Benchmarks
Use the inference benchmark script to measure reproducible tokens/s and peak VRAM.
The script supports both local checkpoints and Hugging Face repos and can export JSON + Markdown outputs.
Inference Benchmark Script
Default runtime is intentionally short (warmup + 3 measured runs).
Local Example
python scripts/benchmark/benchmark_infer.py \
--source local \
--checkpoint checkpoints/ckpt_last.pt \
--meta data/processed/meta.json \
--tokenizer char \
--config configs/base/base.toml \
--device cpu \
--json-out outputs/bench_infer.json \
--md-out outputs/bench_infer.md
Hugging Face Example
python scripts/benchmark/benchmark_infer.py \
--source hf \
--repo-id LabCoreAI/<id> \
--config configs/base/base.toml \
--device cuda \
--json-out outputs/bench_infer_hf.json \
--md-out outputs/bench_infer_hf.md
What Is Measured
- Warmup generation (
--warmup-tokens, not counted in final throughput). - Measured generation (
--gen-tokens) repeated--iterstimes. - Throughput summary:
mean,min,maxtokens/sec. - Peak VRAM (
torch.cuda.max_memory_allocated) when CUDA is used.
Reproducibility settings are read from [generation] when --config is provided:
seeddeterministic- sampling settings (
temperature,top_k,top_p,repetition_penalty) use_kv_cache(unless overridden by CLI flag)
JSON Output Schema (Summary)
{
"timestamp": "...",
"commit": "...",
"platform": {"os": "...", "python": "..."},
"torch": {"version": "...", "cuda": "..."},
"device": {"type": "cpu|cuda", "name": "..."},
"model": {"source": "local|hf", "params_m": 0.0, "block_size": 0, "n_layer": 0, "n_head": 0, "n_embd": 0},
"generation": {"prompt": "...", "gen_tokens": 256, "temperature": 0.9, "top_k": 40, "top_p": 1.0, "repetition_penalty": 1.0, "use_kv_cache": true},
"results": {"iters": 3, "tokens_per_sec": {"mean": 0.0, "min": 0.0, "max": 0.0}, "vram_peak_mib": null}
}
Community Results
Paste the generated Markdown row (from --md-out or terminal output) into this table.
Attach the JSON output in the PR description if available.
| Device | Source | Model size (params M) | KV-cache | gen_tokens | mean tok/s | peak VRAM MiB |
|---|---|---|---|---|---|---|
| your result | local/hf | 0.000 | on/off | 256 | 0.00 | N/A or value |