gpt2small-en-it-nanochat-lr3e4-bs6-wsd-shortfastdecay8k-final5e6-webwiki-step8000

This repo stages step_8000.pt, the final checkpoint and best online-validation checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD short-fast-decay web/wiki run 20260605_fresh-gpt2small-lr3e4-bs6-wsd-shortfastdecay8k-final5e6-webwiki.

What this is

model family: GPT-2-small-like decoder-only LM
parameters: ~136M
languages: English + Italian
context length: 2500
selected checkpoint: step_8000.pt
B tokens seen: ~1.92B
selection reason: best in-run online validation checkpoint and final saved checkpoint for this run
status relative to the companion benchmark winner:
- this is the validation-selected release
- the repo-native benchmark winner from the same run is step_4000

Best in-run validation

best saved validation step for the run: 8000
validation loss: 3.8823011749
validation perplexity: 48.535776
validation batches: 128

This checkpoint matches the run's online validation winner.

Repo-native benchmark context

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

Metrics for this checkpoint:

val_loss_mixed: 5.3930
ppl_mixed: 219.8592
val_loss_en: 4.9928
ppl_en: 147.3508
val_loss_it: 4.1405
ppl_it: 62.8313
loop_rate: 0.400
repeated_4gram_rate: 0.750
distinct_2: 0.4706
cloze_en_contains: 0.00
cloze_it_contains: 0.12

Ranking inside the checked saved checkpoints from this run:

step_4000 -> mixed=5.1440
step_7000 -> mixed=5.3313
step_5000 -> mixed=5.3651
step_8000 -> mixed=5.3930
step_6000 -> mixed=5.5364

Important caveat: this run produced two different winners:

step_8000 won the run's internal online validation
step_4000 won the external repo-native benchmark used to rank comparable releases

Operationally:

step_8000 is the cleaner final checkpoint on repetition/diversity surface metrics
step_4000 remains the checkpoint we promote as the benchmark winner

Surface-quality reading

Compared with step_4000, this final checkpoint is behaviorally cleaner on several surface metrics:

loop_rate: 0.400 vs 0.725
repeated_4gram_rate: 0.750 vs 0.900
distinct_2: 0.4706 vs 0.4251
language_consistency_en: 1.00 vs 0.95

But it loses on the primary benchmark metric:

val_loss_mixed: 5.3930 vs 5.1440

So this repo is the final/validation winner, not the benchmark-first winner.

Source/domain losses for this checkpoint

source_loss_books_en: 5.1537
source_loss_books_it: 5.1258
source_loss_code: 8.3286
source_loss_web_en: 6.2020
source_loss_web_it: 6.4544
source_loss_wiki_en: 3.9960
source_loss_wiki_it: 3.6270

Training/data provenance

training config: training_config.yaml
tokenizer files:
- tokenizer.json
- tokenizer_meta.json
checkpoint weights:
- step_8000.pt
- step_8000.safetensors
telemetry:
- best_validation.json
- metrics.jsonl
- eval_metrics.jsonl
- probe_generations.jsonl
benchmark bundle:
- eval_summary.json
- comparison.json
- benchmark_report.md
- benchmark_metrics.json
- benchmark_scores.json
- benchmark_source_losses.json

Limitations

Generations are still visibly repetitive and templatey.
This repo should not be read as evidence that free-form generation quality is solved.
The main value of this checkpoint is as the run's final online-validation winner and as a comparison point against the benchmark-winning step_4000.

Downloads last month: -; Downloads are not tracked for this model. How to track