Configuration Reference

Use this page as the single source of truth for train.py TOML keys and defaults. Prerequisite: familiarity with the reference preset (configs/base/base.toml).

Canonical Values Used in Guides

CONFIG_EXAMPLE = configs/base/base.toml
CHECKPOINT = checkpoints/ckpt_last.pt
META_TXT = data/processed/meta.json
META_BIN = data/meta.json

Section and Key Reference

`[general]`

Key	Type	Typical value	Notes
`run_name`	string	`"tiny_char_baseline"`	Informational run label.
`seed`	int	`1337`	Optional random seed field.
`tokenizer`	string	`"char"` or `"bpe"`	Default is `"char"` if omitted.

`[data]`

Key	Type	Typical value	Notes
`dataset`	string	`"tinyshakespeare"`	Metadata label.
`processed_dir`	string path	`"data/processed"`	Default is `"data/processed"` if omitted.

`[model]`

Key	Type	Typical value	Notes
`vocab_size`	int	`65` (char) / `50257` (bpe)	Can be inferred from metadata/tokenizer when omitted.
`block_size`	int	`512`	Sequence length.
`n_layer`	int	`6`	Transformer depth.
`n_head`	int	`8`	Attention heads.
`n_embd`	int	`256`	Embedding size.
`dropout`	float	`0.1`	Dropout rate.
`bias`	bool	`true`	Linear layer bias toggle.
`use_rope`	bool	`false`/`true`	Default `false`.
`use_flash`	bool	`false`/`true`	Default `false`.

`[optimizer]`

Key	Type	Typical value	Notes
`learning_rate`	float	`3e-4`	Falls back to `[training]` if missing.
`weight_decay`	float	`0.01`	Falls back to `[training]` if missing.
`beta1`	float	`0.9`	AdamW beta1.
`beta2`	float	`0.95`	AdamW beta2.
`grad_clip`	float	`1.0`	Falls back to `[training]` if missing.

`[training]`

Key	Type	Typical value	Notes
`batch_size`	int	`8`	Micro-batch size per step.
`gradient_accumulation_steps`	int	`1` to `8`	Effective batch multiplier.
`max_iters`	int	`5000`	Docs reference value.
`warmup_iters`	int	`0` to `500`	LR warmup steps.
`lr_decay_iters`	int	`5000`	Usually align with `max_iters`.
`min_lr`	float	`3e-5`	Cosine LR floor.
`eval_interval`	int	`200`	Eval/checkpoint cadence.
`eval_iters`	int	`50`	Validation batches per eval.
`log_interval`	int	`20`	Console logging cadence.
`save_interval`	int	`200` or `500`	Optional extra save cadence.
`device`	string	`"cuda"` or `"cpu"`	CLI `--device` overrides this.
`checkpoint_dir`	string path	`"checkpoints"`	Produces `ckpt_last.pt` and `train_log.json`.
`data_format`	string	`"txt"` or `"bin"`	Default is `"txt"` if omitted.

`[generation]`

Key	Type	Typical value	Notes
`max_new_tokens`	int	`200`	Useful for shared defaults in docs.
`temperature`	float	`0.6` to `0.9`	Sampling randomness.
`top_k`	int	`40` to `50`	Sampling truncation.

Loader Defaults Applied Automatically

load_config guarantees:

general.tokenizer = "char"
data.processed_dir = "data/processed"
model.use_rope = false
model.use_flash = false
training.data_format = "txt"

Example: Minimal Small Config

[general]
tokenizer = "char"

[data]
dataset = "tinyshakespeare"
processed_dir = "data/processed"

[model]
block_size = 256
n_layer = 4
n_head = 4
n_embd = 192
vocab_size = 65
dropout = 0.1
bias = true
use_rope = false
use_flash = false

[training]
batch_size = 8
gradient_accumulation_steps = 1
max_iters = 2000
eval_interval = 200
eval_iters = 50
log_interval = 20
device = "cpu"
checkpoint_dir = "checkpoints"
data_format = "txt"

Example: RTX 4060 Starting Point (8GB, Validate Locally)

[general]
tokenizer = "char"

[data]
dataset = "tinyshakespeare"
processed_dir = "data/processed"

[model]
block_size = 512
n_layer = 6
n_head = 8
n_embd = 256
dropout = 0.1
bias = true
use_rope = false
use_flash = false

[optimizer]
learning_rate = 3e-4
weight_decay = 0.01
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0

[training]
batch_size = 8
gradient_accumulation_steps = 2
max_iters = 5000
warmup_iters = 200
lr_decay_iters = 5000
min_lr = 3e-5
eval_interval = 200
eval_iters = 50
log_interval = 20
save_interval = 200
device = "cuda"
checkpoint_dir = "checkpoints"
data_format = "txt"

Note

The RTX 4060 block above is a starting template, not a guaranteed benchmark profile. Adjust based on your exact VRAM and driver stack.

Configuration Reference

Canonical Values Used in Guides

Section and Key Reference

[general]

[data]

[model]

[optimizer]

[training]

[generation]