| --- |
| language: |
| - en |
| - it |
| license: other |
| library_name: custom |
| pipeline_tag: text-generation |
| tags: |
| - nanochat |
| - gpt2-small |
| - bilingual |
| - english |
| - italian |
| - pretraining |
| - webwiki |
| - wsd |
| - short-fast-decay |
| - validation-selected |
| - final-checkpoint |
| - lr3e4 |
| --- |
| |
| # gpt2small-en-it-nanochat-lr3e4-bs6-wsd-shortfastdecay8k-final5e6-webwiki-step8000 |
|
|
| This repo stages `step_8000.pt`, the final checkpoint and best online-validation checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD short-fast-decay web/wiki run `20260605_fresh-gpt2small-lr3e4-bs6-wsd-shortfastdecay8k-final5e6-webwiki`. |
|
|
| ## What this is |
|
|
| - model family: GPT-2-small-like decoder-only LM |
| - parameters: ~136M |
| - languages: English + Italian |
| - context length: 2500 |
| - selected checkpoint: `step_8000.pt` |
| - B tokens seen: `~1.92B` |
| - selection reason: best in-run online validation checkpoint and final saved checkpoint for this run |
| - status relative to the companion benchmark winner: |
| - this is the validation-selected release |
| - the repo-native benchmark winner from the same run is `step_4000` |
|
|
| ## Best in-run validation |
|
|
| - best saved validation step for the run: `8000` |
| - validation loss: `3.8823011749` |
| - validation perplexity: `48.535776` |
| - validation batches: `128` |
|
|
| This checkpoint matches the run's online validation winner. |
|
|
| ## Repo-native benchmark context |
|
|
| Repo-native benchmark suite: `configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml` |
|
|
| Metrics for this checkpoint: |
|
|
| - `val_loss_mixed`: `5.3930` |
| - `ppl_mixed`: `219.8592` |
| - `val_loss_en`: `4.9928` |
| - `ppl_en`: `147.3508` |
| - `val_loss_it`: `4.1405` |
| - `ppl_it`: `62.8313` |
| - `loop_rate`: `0.400` |
| - `repeated_4gram_rate`: `0.750` |
| - `distinct_2`: `0.4706` |
| - `cloze_en_contains`: `0.00` |
| - `cloze_it_contains`: `0.12` |
|
|
| Ranking inside the checked saved checkpoints from this run: |
|
|
| 1. `step_4000` -> `mixed=5.1440` |
| 2. `step_7000` -> `mixed=5.3313` |
| 3. `step_5000` -> `mixed=5.3651` |
| 4. `step_8000` -> `mixed=5.3930` |
| 5. `step_6000` -> `mixed=5.5364` |
|
|
| Important caveat: this run produced two different winners: |
|
|
| - `step_8000` won the run's internal online validation |
| - `step_4000` won the external repo-native benchmark used to rank comparable releases |
|
|
| Operationally: |
|
|
| - `step_8000` is the cleaner final checkpoint on repetition/diversity surface metrics |
| - `step_4000` remains the checkpoint we promote as the benchmark winner |
|
|
| ## Surface-quality reading |
|
|
| Compared with `step_4000`, this final checkpoint is behaviorally cleaner on several surface metrics: |
|
|
| - `loop_rate`: `0.400` vs `0.725` |
| - `repeated_4gram_rate`: `0.750` vs `0.900` |
| - `distinct_2`: `0.4706` vs `0.4251` |
| - `language_consistency_en`: `1.00` vs `0.95` |
|
|
| But it loses on the primary benchmark metric: |
|
|
| - `val_loss_mixed`: `5.3930` vs `5.1440` |
|
|
| So this repo is the final/validation winner, not the benchmark-first winner. |
|
|
| ## Source/domain losses for this checkpoint |
|
|
| - `source_loss_books_en`: `5.1537` |
| - `source_loss_books_it`: `5.1258` |
| - `source_loss_code`: `8.3286` |
| - `source_loss_web_en`: `6.2020` |
| - `source_loss_web_it`: `6.4544` |
| - `source_loss_wiki_en`: `3.9960` |
| - `source_loss_wiki_it`: `3.6270` |
|
|
| ## Training/data provenance |
|
|
| - training config: `training_config.yaml` |
| - tokenizer files: |
| - `tokenizer.json` |
| - `tokenizer_meta.json` |
| - checkpoint weights: |
| - `step_8000.pt` |
| - `step_8000.safetensors` |
| - telemetry: |
| - `best_validation.json` |
| - `metrics.jsonl` |
| - `eval_metrics.jsonl` |
| - `probe_generations.jsonl` |
| - benchmark bundle: |
| - `eval_summary.json` |
| - `comparison.json` |
| - `benchmark_report.md` |
| - `benchmark_metrics.json` |
| - `benchmark_scores.json` |
| - `benchmark_source_losses.json` |
|
|
| ## Limitations |
|
|
| - Generations are still visibly repetitive and templatey. |
| - This repo should not be read as evidence that free-form generation quality is solved. |
| - The main value of this checkpoint is as the run's final online-validation winner and as a comparison point against the benchmark-winning `step_4000`. |
|
|