gpt2small-en-it-nanochat-lr2e4-batchmaxpossible-bs7-step9000
This repo stages the best saved checkpoint from the local NanoChat EN/IT GPT-2-small-like run stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7.
What this is
- model family: GPT-2-small-like decoder-only LM
- parameters: ~136M
- languages: English + Italian
- context length: 2500
- selected checkpoint:
step_9000.pt - selection reason: lowest recorded validation loss among saved checkpoints in
best_validation.json
Best validation
- step: 9000
- validation loss: 4.0797094479
- validation perplexity: 59.1282875069
- validation batches: 128
Important caveat
A later checkpoint step_10000.pt exists, but it is worse on validation than step_9000.pt, so this release intentionally publishes step_9000.pt instead of the latest saved checkpoint.
Training/data provenance
- training config:
training_config.yaml - tokenizer:
tokenizer.json+tokenizer_meta.json - packed dataset root used by the run:
/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced - tokenizer root used by the run:
/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
Included files
step_9000.ptstep_9000.safetensorsstep_9000.safetensors.jsontraining_config.yamltokenizer.jsontokenizer_meta.jsonbest_validation.jsoneval_summary.jsonprobe_step9000_summary.json- full run telemetry snapshots:
eval_metrics.jsonl,metrics.jsonl,probe_generations.jsonl
Probe reading at step 9000
- EN factual prompt
The capital of Italy is -> Rome: weak (rank=248) - EN simple continuation
A small language model should -> be: strong (rank=1) - IT factual prompt
La capitale d'Italia è -> Roma: weak (rank=1103) - IT simple continuation
Un piccolo modello linguistico dovrebbe -> essere: strong (rank=1)
So this checkpoint is useful as a real intermediate bilingual pretraining artifact, but it is not a polished factual model.
Usage
This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
Limitations
- factual recall is still weak
- generations can become repetitive
- the model was selected by validation loss inside this run family, not by broad downstream benchmark performance
- dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus