Training
Use this page to launch, monitor, and checkpoint training runs with consistent settings. Prerequisites: prepared dataset and metadata from Data Pipeline.
Command(s)
Reference training command:
Device override examples:
python train.py --config configs/base/base.toml --tokenizer char --max-iters 5000 --device cpu
python train.py --config configs/base/base.toml --tokenizer char --max-iters 5000 --device cuda
train.py flags:
--config: TOML preset path (CONFIG_EXAMPLE = configs/base/base.toml)--max-iters: runtime override for total iterations--device:cpuorcuda--tokenizer:charorbpe
Precision and Gradient Accumulation
Configure these in [training]:
grad_accum_steps: gradient accumulation factor (default1)precision:fp32(default),fp16, orbf16
effective_batch_size = batch_size * grad_accum_steps
Mixed precision is enabled only when device = "cuda" and precision != "fp32".
On CPU, training falls back to fp32.
RTX 4060 example:
Best Checkpoint & Early Stopping
save_bestsavescheckpoints/ckpt_best.ptwhenever validation loss improves by at leastearly_stopping_min_delta.early_stoppingis disabled by default.early_stopping_patiencecounts evaluation rounds without sufficient validation improvement.early_stopping_min_deltadefines the minimum improvement threshold onval_loss.
[training]
early_stopping = true
early_stopping_patience = 3
early_stopping_min_delta = 0.001
save_best = true
data_format and Metadata Mapping
| Training mode | Config value | Data artifacts expected | Metadata path |
|---|---|---|---|
| Text pipeline | training.data_format = "txt" |
data/processed/train.txt + data/processed/val.txt (or .npy) |
data/processed/meta.json (META_TXT) |
| Binary pipeline | training.data_format = "bin" |
data/train.bin + data/val.bin |
data/meta.json (META_BIN) |
Note
For binary mode, if data.processed_dir points to data/processed, train.py automatically checks the parent directory (data/) for train.bin and val.bin.
Checkpointing and Resume Behavior
Produced during training:
checkpoints/ckpt_last.pt(CHECKPOINT)checkpoints/ckpt_best.pt(whensave_best = true)checkpoints/train_log.json
Warning
Native resume-from-checkpoint is not implemented in train.py yet. ckpt_last.pt is for inference/export compatibility and state inspection, not automatic continuation.
Common Errors
- Binary shards missing: see Binary shards not found.
- Metadata path mismatch: see Meta path mismatch.
- Vocab inference failure: see Char vocab missing.
- CUDA fallback warning: see CUDA not detected.