gpt2small-en-it-nanochat-lr2e4-bs6-wsd-earlydecay7000-final5e6-webwiki-step7000

This repo stages the best benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD early-decay web/wiki run 20260530_fresh-gpt2small-lr2e4-bs6-wsd-earlydecay7000-final5e6-webwiki.

What this is

model family: GPT-2-small-like decoder-only LM
parameters: ~136M
languages: English + Italian
context length: 2500
selected checkpoint: step_7000.pt
selection reason: best full repo-native CPU benchmark result among the checked saved checkpoints from this run family
status relative to comparable public checkpoints: currently the current best benchmark checkpoint among the comparable GPT-2-small EN/IT checkpoints published from this workspace

Best in-run validation

step: 7000
validation loss: 3.9247755519
validation perplexity: 50.6417103477
validation batches: 128

Important caveat: this release checkpoint is both the best online validation point of the run and the best checkpoint under the repo-native closeout benchmark, but the run itself was not stable afterward and degraded quickly.

Benchmark summary

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

Winner metrics:

val_loss_mixed: 5.2158
ppl_mixed: 184.1616
val_loss_en: 5.0181
ppl_en: 151.1306
val_loss_it: 3.9781
ppl_it: 53.4156
loop_rate: 0.575
repeated_4gram_rate: 0.95
cloze_en_contains: 0.02
cloze_it_contains: 0.10
cloze_en_exact: 0.00
cloze_it_exact: 0.00

Benchmark ranking across the checked saved checkpoints from this run:

step_7000
- mixed=5.2158
- en=5.0181
- it=3.9781
step_8000
- mixed=5.3042
- en=5.3609
- it=4.2465
step_9000
- mixed=5.5408
- en=5.3154
- it=4.2196
step_10000
- mixed=5.6087
- en=5.3671
- it=4.2692

Why this checkpoint

The run did not simply "end later and get better later". It improved through the early window, then entered degradation.

Observed online validation:

7000: 3.9248
8000: 4.1199
9000: 4.0582
10000: 4.0088

The repo-native benchmark confirmed that the best preserved checkpoint was still step_7000, not one of the later saves.

Operationally, this is an early-winner release from a run that later collapsed.

Cross-run leaderboard update on 2026-06-03

After the original release, the same repo-native CPU benchmark family was rerun across five previously selected winner checkpoints from distinct GPT-2-small EN/IT web/wiki run families.

Cross-run leaderboard by the same primary metric val_loss_mixed:

this release: earlydecay7000 step_7000
- mixed=5.2158
earlydecay3500 step_7000
- mixed=5.2358
lr2e4 cosine step_7000
- mixed=5.3558
lr2e4 WSD step_11000
- mixed=5.3576
lr1e4 cosine step_14000
- mixed=5.4493

That wider re-check kept the winner unchanged. So this repo is not only the best checkpoint inside its own run family, but also the current best benchmark checkpoint across the comparable public GPT-2-small EN/IT web/wiki slice tracked from this workspace.

Important caveat: this label is intentionally narrow. It means "best under the current comparable benchmark", not "best free-form generation quality". Samples across the leaderboard remain weak and repetitive enough that the publish decision is operational and comparative, not a claim of polished generation quality.

Best-so-far comparison against comparable public checkpoints

Token estimate formula:

tokens_seen ~= step * batch_size * grad_accum_steps * sequence_length
for these comparable GPT-2-small web/wiki and v5 runs: 6 * 16 * 2500 = 240000 tokens per step

Historical order by the same primary metric val_loss_mixed:

this release: 20260530 ... earlydecay7000 ... step_7000
- mixed=5.2158
- estimated tokens seen: ~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-cosine-webwiki-step7000
- mixed=5.3558
- estimated tokens seen: ~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-webwiki-step11000
- mixed=5.3576
- estimated tokens seen: ~2.64B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000
- mixed=5.4493
- estimated tokens seen: ~3.36B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-step10000
- mixed=5.4756
- estimated tokens seen: ~2.40B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step23000
- mixed=5.6642
- estimated tokens seen: ~5.52B

So this is currently the best benchmarked public checkpoint in this comparable slice, while still being an early checkpoint from an unstable run.

Source/domain losses for the winner

source_loss_books_en: 4.7561
source_loss_books_it: 4.8877
source_loss_code: 8.1805
source_loss_web_en: 5.7991
source_loss_web_it: 6.0342
source_loss_wiki_en: 4.0188
source_loss_wiki_it: 4.0581

Training/data provenance

training config: training_config.yaml
tokenizer: tokenizer.json + tokenizer_meta.json
packed dataset root used by the run: /mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
tokenizer root used by the run: /mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M
source commit for release packaging: 044df5556ed876b481a67166119a6bab66ad6f4d

Included files

step_7000.pt
step_7000.safetensors
step_7000.safetensors.json
training_config.yaml
tokenizer.json
tokenizer_meta.json
best_validation.json
eval_summary.json
comparison.json
benchmark_report.md
benchmark_metrics.json
benchmark_scores.json
benchmark_source_losses.json
2026-06-03_cross_run_leaderboard_update.md
cross_run_leaderboard_report_20260603.md
cross_run_leaderboard_comparison_20260603.json
cross_run_leaderboard_summary_20260603.json
probe_step7000_summary.json
full run telemetry snapshots: eval_metrics.jsonl, metrics.jsonl, probe_generations.jsonl
release note: 2026-06-01_wsd_earlydecay7000_release_step7000.md

Probe reading at step 7000

EN factual prompt The capital of Italy is -> Rome: rank=42, prob=0.0028076
EN procedural prompt A small language model should -> be: rank=1, prob=0.4629
IT factual prompt La capitale d'Italia e' -> Roma: rank=287, prob=0.0002937
IT procedural prompt Un piccolo modello linguistico dovrebbe -> essere: rank=1, prob=0.3770

Factual probes remain weak in both languages, while the procedural prompts are strong next-token continuations. These probes are directional evidence only. The main selection rule here is the repo-native benchmark result.

Usage

This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.

Limitations

mixed quality is still in the weak/intermediate band
generations remain repetitive and often unstable under free-form continuation
factual recall is still weak in both languages
this is the best preserved checkpoint inside a run that later collapsed, not a claim that the schedule is fully solved
dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus

Downloads last month: -; Downloads are not tracked for this model. How to track