gpt2small-en-it-nanochat-lr2e4-bs6-wsd-earlydecay7000-final5e6-webwiki-step7000
This repo stages the best benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD early-decay web/wiki run 20260530_fresh-gpt2small-lr2e4-bs6-wsd-earlydecay7000-final5e6-webwiki.
What this is
- model family: GPT-2-small-like decoder-only LM
- parameters: ~136M
- languages: English + Italian
- context length: 2500
- selected checkpoint:
step_7000.pt - selection reason: best full repo-native CPU benchmark result among the checked saved checkpoints from this run family
- status relative to comparable public checkpoints: currently the
current best benchmark checkpointamong the comparable GPT-2-small EN/IT checkpoints published from this workspace
Best in-run validation
- step:
7000 - validation loss:
3.9247755519 - validation perplexity:
50.6417103477 - validation batches:
128
Important caveat: this release checkpoint is both the best online validation point of the run and the best checkpoint under the repo-native closeout benchmark, but the run itself was not stable afterward and degraded quickly.
Benchmark summary
Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml
Winner metrics:
val_loss_mixed:5.2158ppl_mixed:184.1616val_loss_en:5.0181ppl_en:151.1306val_loss_it:3.9781ppl_it:53.4156loop_rate:0.575repeated_4gram_rate:0.95cloze_en_contains:0.02cloze_it_contains:0.10cloze_en_exact:0.00cloze_it_exact:0.00
Benchmark ranking across the checked saved checkpoints from this run:
step_7000mixed=5.2158en=5.0181it=3.9781
step_8000mixed=5.3042en=5.3609it=4.2465
step_9000mixed=5.5408en=5.3154it=4.2196
step_10000mixed=5.6087en=5.3671it=4.2692
Why this checkpoint
The run did not simply "end later and get better later". It improved through the early window, then entered degradation.
Observed online validation:
7000:3.92488000:4.11999000:4.058210000:4.0088
The repo-native benchmark confirmed that the best preserved checkpoint was still step_7000, not one of the later saves.
Operationally, this is an early-winner release from a run that later collapsed.
Cross-run leaderboard update on 2026-06-03
After the original release, the same repo-native CPU benchmark family was rerun across five previously selected winner checkpoints from distinct GPT-2-small EN/IT web/wiki run families.
Cross-run leaderboard by the same primary metric val_loss_mixed:
- this release:
earlydecay7000 step_7000mixed=5.2158
earlydecay3500 step_7000mixed=5.2358
lr2e4 cosine step_7000mixed=5.3558
lr2e4 WSD step_11000mixed=5.3576
lr1e4 cosine step_14000mixed=5.4493
That wider re-check kept the winner unchanged. So this repo is not only the best checkpoint inside its own run family, but also the current best benchmark checkpoint across the comparable public GPT-2-small EN/IT web/wiki slice tracked from this workspace.
Important caveat: this label is intentionally narrow. It means "best under the current comparable benchmark", not "best free-form generation quality". Samples across the leaderboard remain weak and repetitive enough that the publish decision is operational and comparative, not a claim of polished generation quality.
Best-so-far comparison against comparable public checkpoints
Token estimate formula:
tokens_seen ~= step * batch_size * grad_accum_steps * sequence_length- for these comparable GPT-2-small web/wiki and v5 runs:
6 * 16 * 2500 = 240000tokens per step
Historical order by the same primary metric val_loss_mixed:
- this release:
20260530 ... earlydecay7000 ... step_7000mixed=5.2158- estimated tokens seen:
~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-cosine-webwiki-step7000mixed=5.3558- estimated tokens seen:
~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-webwiki-step11000mixed=5.3576- estimated tokens seen:
~2.64B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000mixed=5.4493- estimated tokens seen:
~3.36B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000mixed=5.4756- estimated tokens seen:
~2.40B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step23000mixed=5.6642- estimated tokens seen:
~5.52B
So this is currently the best benchmarked public checkpoint in this comparable slice, while still being an early checkpoint from an unstable run.
Source/domain losses for the winner
source_loss_books_en:4.7561source_loss_books_it:4.8877source_loss_code:8.1805source_loss_web_en:5.7991source_loss_web_it:6.0342source_loss_wiki_en:4.0188source_loss_wiki_it:4.0581
Training/data provenance
- training config:
training_config.yaml - tokenizer:
tokenizer.json+tokenizer_meta.json - packed dataset root used by the run:
/mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M - tokenizer root used by the run:
/mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M - source commit for release packaging:
044df5556ed876b481a67166119a6bab66ad6f4d
Included files
step_7000.ptstep_7000.safetensorsstep_7000.safetensors.jsontraining_config.yamltokenizer.jsontokenizer_meta.jsonbest_validation.jsoneval_summary.jsoncomparison.jsonbenchmark_report.mdbenchmark_metrics.jsonbenchmark_scores.jsonbenchmark_source_losses.json2026-06-03_cross_run_leaderboard_update.mdcross_run_leaderboard_report_20260603.mdcross_run_leaderboard_comparison_20260603.jsoncross_run_leaderboard_summary_20260603.jsonprobe_step7000_summary.json- full run telemetry snapshots:
eval_metrics.jsonl,metrics.jsonl,probe_generations.jsonl - release note:
2026-06-01_wsd_earlydecay7000_release_step7000.md
Probe reading at step 7000
- EN factual prompt
The capital of Italy is -> Rome:rank=42,prob=0.0028076 - EN procedural prompt
A small language model should -> be:rank=1,prob=0.4629 - IT factual prompt
La capitale d'Italia e' -> Roma:rank=287,prob=0.0002937 - IT procedural prompt
Un piccolo modello linguistico dovrebbe -> essere:rank=1,prob=0.3770
Factual probes remain weak in both languages, while the procedural prompts are strong next-token continuations. These probes are directional evidence only. The main selection rule here is the repo-native benchmark result.
Usage
This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
Limitations
- mixed quality is still in the weak/intermediate band
- generations remain repetitive and often unstable under free-form continuation
- factual recall is still weak in both languages
- this is the best preserved checkpoint inside a run that later collapsed, not a claim that the schedule is fully solved
- dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus