Inference and Demo
Use this page to run deterministic CLI generation and the local Gradio demo.
Prerequisites: a trained checkpoint (checkpoints/ckpt_last.pt) and matching metadata.
Command(s)
CLI generation (reference paths):
python generate.py \
--checkpoint checkpoints/ckpt_last.pt \
--meta data/processed/meta.json \
--tokenizer char \
--prompt "To be" \
--max-new-tokens 200 \
--temperature 0.9 \
--top-k 40 \
--device cpu
Gradio demo from local checkpoint:
python demo_gradio.py \
--source local \
--checkpoint checkpoints/ckpt_last.pt \
--meta data/processed/meta.json \
--device cpu \
--port 7860
Gradio demo from Hugging Face:
python demo_gradio.py \
--source hf \
--repo-id GhostPunishR/labcore-llm-50M \
--device cpu \
--port 7860
Stable Generation Settings (Debug Mode)
Use conservative sampling when debugging reproducibility:
temperature = 0.2to reduce randomnesstop-k = 20(or lower)max-new-tokens = 80for quick checks
Tip
If output quality suddenly drops, first verify that --meta belongs to the same tokenizer/checkpoint run.
Sampling Controls
temperature: scales logits before sampling.top_k: keeps only thekmost likely tokens.top_p: nucleus sampling cutoff (1.0disables it).repetition_penalty: penalizes already generated tokens (1.0disables it).use_kv_cache: enables KV cache during decoding.stream: enables token-by-token output streaming.
top_p and repetition_penalty are read from [generation] when using generate.py --config ....
RTX 4060 stable example:
[generation]
temperature = 0.6
top_k = 50
top_p = 0.9
repetition_penalty = 1.1
use_kv_cache = true
stream = true
system_prompt = "You are LabCore LLM."
max_history_turns = 6
Reproducible Generation
seed: fixes Python/NumPy/Torch RNGs for repeatable sampling.deterministic: enables deterministic Torch algorithms (warn_only=True) and deterministic cuDNN settings.
[generation]
temperature = 0.6
top_k = 50
top_p = 0.9
repetition_penalty = 1.1
use_kv_cache = true
stream = true
system_prompt = "You are LabCore LLM."
max_history_turns = 6
seed = 1337
deterministic = true
KV Cache, Streaming, and Chat
- KV cache speeds up generation by reusing past attention keys/values instead of recomputing the whole context.
- Streaming updates the demo output incrementally as each token is sampled.
- Multi-turn chat builds prompts with simple text markers:
<|system|><|user|><|assistant|>max_history_turnskeeps only the latest turns to bound prompt length.
[generation]
use_kv_cache = true
stream = true
system_prompt = "You are LabCore LLM."
max_history_turns = 6
Output Files / Artifacts Produced
- CLI: generated text in terminal output
- Demo: generated text in Gradio UI
- No new model files unless you explicitly export
Common Errors
- Char tokenizer metadata missing: see Char vocab missing.
- Metadata path mismatch (
txtvsbin): see Meta path mismatch. - CUDA fallback behavior: see CUDA not detected.