Data Pipeline
Use this page to prepare training data and metadata in a predictable layout. Prerequisite: dependencies installed from Getting Started.
Command(s)
Reference txt pipeline (used by configs/base/base.toml):
python scripts/data/prepare_data.py \
--dataset tinyshakespeare \
--tokenizer char \
--output-format txt \
--raw-dir data/raw \
--output-dir data/processed \
--val-ratio 0.1
Alternative bin pipeline:
python scripts/data/prepare_data.py \
--dataset tinyshakespeare \
--tokenizer char \
--output-format bin \
--raw-dir data/raw \
--output-dir data/processed \
--val-ratio 0.1
Output Files / Artifacts Produced
txt format (output-dir = data/processed):
data/processed/train.txtdata/processed/val.txtdata/processed/corpus.txtdata/processed/train.npydata/processed/val.npydata/processed/meta.json(META_TXT)
bin format:
data/train.bindata/val.bindata/meta.json(META_BIN)
Note
For --output-format bin, if --output-dir ends with processed, binary files are written to its parent (data/).
Format Selection
- Use
txtwhen training withtraining.data_format = "txt"and metadata atdata/processed/meta.json. - Use
binwhen training withtraining.data_format = "bin"and metadata atdata/meta.json.
Common Errors
- Missing binary shards: see Binary shards not found.
- Wrong metadata path: see Meta path mismatch.
- Char tokenizer vocab issues: see Char vocab missing.