Skip to content

Training

Use this page to launch, monitor, and checkpoint training runs with consistent settings. Prerequisites: prepared dataset and metadata from Data Pipeline.

Command(s)

Reference training command:

python train.py --config configs/base/base.toml --tokenizer char --max-iters 5000

Device override examples:

python train.py --config configs/base/base.toml --tokenizer char --max-iters 5000 --device cpu
python train.py --config configs/base/base.toml --tokenizer char --max-iters 5000 --device cuda

train.py flags:

  • --config: TOML preset path (CONFIG_EXAMPLE = configs/base/base.toml)
  • --max-iters: runtime override for total iterations
  • --device: cpu or cuda
  • --tokenizer: char or bpe

Precision and Gradient Accumulation

Configure these in [training]:

  • grad_accum_steps: gradient accumulation factor (default 1)
  • precision: fp32 (default), fp16, or bf16

effective_batch_size = batch_size * grad_accum_steps

Mixed precision is enabled only when device = "cuda" and precision != "fp32". On CPU, training falls back to fp32.

RTX 4060 example:

[training]
batch_size = 8
grad_accum_steps = 4
precision = "fp16"

Best Checkpoint & Early Stopping

  • save_best saves checkpoints/ckpt_best.pt whenever validation loss improves by at least early_stopping_min_delta.
  • early_stopping is disabled by default.
  • early_stopping_patience counts evaluation rounds without sufficient validation improvement.
  • early_stopping_min_delta defines the minimum improvement threshold on val_loss.
[training]
early_stopping = true
early_stopping_patience = 3
early_stopping_min_delta = 0.001
save_best = true

data_format and Metadata Mapping

Training mode Config value Data artifacts expected Metadata path
Text pipeline training.data_format = "txt" data/processed/train.txt + data/processed/val.txt (or .npy) data/processed/meta.json (META_TXT)
Binary pipeline training.data_format = "bin" data/train.bin + data/val.bin data/meta.json (META_BIN)

Note

For binary mode, if data.processed_dir points to data/processed, train.py automatically checks the parent directory (data/) for train.bin and val.bin.

Checkpointing and Resume Behavior

Produced during training:

  • checkpoints/ckpt_last.pt (CHECKPOINT)
  • checkpoints/ckpt_best.pt (when save_best = true)
  • checkpoints/train_log.json

Warning

Native resume-from-checkpoint is not implemented in train.py yet. ckpt_last.pt is for inference/export compatibility and state inspection, not automatic continuation.

Common Errors