Skip to content

Configuration Reference

Use this page as the single source of truth for train.py TOML keys and defaults. Prerequisite: familiarity with the reference preset (configs/base/base.toml).

Canonical Values Used in Guides

  • CONFIG_EXAMPLE = configs/base/base.toml
  • CHECKPOINT = checkpoints/ckpt_last.pt
  • META_TXT = data/processed/meta.json
  • META_BIN = data/meta.json

Section and Key Reference

[general]

Key Type Typical value Notes
run_name string "tiny_char_baseline" Informational run label.
seed int 1337 Optional random seed field.
tokenizer string "char" or "bpe" Default is "char" if omitted.

[data]

Key Type Typical value Notes
dataset string "tinyshakespeare" Metadata label.
processed_dir string path "data/processed" Default is "data/processed" if omitted.

[model]

Key Type Typical value Notes
vocab_size int 65 (char) / 50257 (bpe) Can be inferred from metadata/tokenizer when omitted.
block_size int 512 Sequence length.
n_layer int 6 Transformer depth.
n_head int 8 Attention heads.
n_embd int 256 Embedding size.
dropout float 0.1 Dropout rate.
bias bool true Linear layer bias toggle.
use_rope bool false/true Default false.
use_flash bool false/true Default false.

[optimizer]

Key Type Typical value Notes
learning_rate float 3e-4 Falls back to [training] if missing.
weight_decay float 0.01 Falls back to [training] if missing.
beta1 float 0.9 AdamW beta1.
beta2 float 0.95 AdamW beta2.
grad_clip float 1.0 Falls back to [training] if missing.

[training]

Key Type Typical value Notes
batch_size int 8 Micro-batch size per step.
gradient_accumulation_steps int 1 to 8 Effective batch multiplier.
max_iters int 5000 Docs reference value.
warmup_iters int 0 to 500 LR warmup steps.
lr_decay_iters int 5000 Usually align with max_iters.
min_lr float 3e-5 Cosine LR floor.
eval_interval int 200 Eval/checkpoint cadence.
eval_iters int 50 Validation batches per eval.
log_interval int 20 Console logging cadence.
save_interval int 200 or 500 Optional extra save cadence.
device string "cuda" or "cpu" CLI --device overrides this.
checkpoint_dir string path "checkpoints" Produces ckpt_last.pt and train_log.json.
data_format string "txt" or "bin" Default is "txt" if omitted.

[generation]

Key Type Typical value Notes
max_new_tokens int 200 Useful for shared defaults in docs.
temperature float 0.6 to 0.9 Sampling randomness.
top_k int 40 to 50 Sampling truncation.

Loader Defaults Applied Automatically

load_config guarantees:

  • general.tokenizer = "char"
  • data.processed_dir = "data/processed"
  • model.use_rope = false
  • model.use_flash = false
  • training.data_format = "txt"

Example: Minimal Small Config

[general]
tokenizer = "char"

[data]
dataset = "tinyshakespeare"
processed_dir = "data/processed"

[model]
block_size = 256
n_layer = 4
n_head = 4
n_embd = 192
vocab_size = 65
dropout = 0.1
bias = true
use_rope = false
use_flash = false

[training]
batch_size = 8
gradient_accumulation_steps = 1
max_iters = 2000
eval_interval = 200
eval_iters = 50
log_interval = 20
device = "cpu"
checkpoint_dir = "checkpoints"
data_format = "txt"

Example: RTX 4060 Starting Point (8GB, Validate Locally)

[general]
tokenizer = "char"

[data]
dataset = "tinyshakespeare"
processed_dir = "data/processed"

[model]
block_size = 512
n_layer = 6
n_head = 8
n_embd = 256
dropout = 0.1
bias = true
use_rope = false
use_flash = false

[optimizer]
learning_rate = 3e-4
weight_decay = 0.01
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0

[training]
batch_size = 8
gradient_accumulation_steps = 2
max_iters = 5000
warmup_iters = 200
lr_decay_iters = 5000
min_lr = 3e-5
eval_interval = 200
eval_iters = 50
log_interval = 20
save_interval = 200
device = "cuda"
checkpoint_dir = "checkpoints"
data_format = "txt"

Note

The RTX 4060 block above is a starting template, not a guaranteed benchmark profile. Adjust based on your exact VRAM and driver stack.