nazdef commited on
Commit
b162fdc
·
verified ·
1 Parent(s): eb5cc19

Upload bs7 best checkpoint step_9000

Browse files
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - it
5
+ license: other
6
+ library_name: custom
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - nanochat
10
+ - gpt2-small
11
+ - bilingual
12
+ - english
13
+ - italian
14
+ - pretraining
15
+ ---
16
+
17
+ # gpt2small-en-it-nanochat-lr2e4-batchmaxpossible-bs7-step9000
18
+
19
+ This repo stages the best saved checkpoint from the local NanoChat EN/IT GPT-2-small-like run `stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7`.
20
+
21
+ ## What this is
22
+
23
+ - model family: GPT-2-small-like decoder-only LM
24
+ - parameters: ~136M
25
+ - languages: English + Italian
26
+ - context length: 2500
27
+ - selected checkpoint: `step_9000.pt`
28
+ - selection reason: lowest recorded validation loss among saved checkpoints in `best_validation.json`
29
+
30
+ ## Best validation
31
+
32
+ - step: 9000
33
+ - validation loss: 4.0797094479
34
+ - validation perplexity: 59.1282875069
35
+ - validation batches: 128
36
+
37
+ ## Important caveat
38
+
39
+ A later checkpoint `step_10000.pt` exists, but it is worse on validation than `step_9000.pt`, so this release intentionally publishes `step_9000.pt` instead of the latest saved checkpoint.
40
+
41
+ ## Training/data provenance
42
+
43
+ - training config: `training_config.yaml`
44
+ - tokenizer: `tokenizer.json` + `tokenizer_meta.json`
45
+ - packed dataset root used by the run: `/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced`
46
+ - tokenizer root used by the run: `/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch`
47
+
48
+ ## Included files
49
+
50
+ - `step_9000.pt`
51
+ - `step_9000.safetensors`
52
+ - `step_9000.safetensors.json`
53
+ - `training_config.yaml`
54
+ - `tokenizer.json`
55
+ - `tokenizer_meta.json`
56
+ - `best_validation.json`
57
+ - `eval_summary.json`
58
+ - `probe_step9000_summary.json`
59
+ - full run telemetry snapshots: `eval_metrics.jsonl`, `metrics.jsonl`, `probe_generations.jsonl`
60
+
61
+ ## Probe reading at step 9000
62
+
63
+ - EN factual prompt `The capital of Italy is -> Rome`: weak (`rank=248`)
64
+ - EN simple continuation `A small language model should -> be`: strong (`rank=1`)
65
+ - IT factual prompt `La capitale d'Italia è -> Roma`: weak (`rank=1103`)
66
+ - IT simple continuation `Un piccolo modello linguistico dovrebbe -> essere`: strong (`rank=1`)
67
+
68
+ So this checkpoint is useful as a real intermediate bilingual pretraining artifact, but it is not a polished factual model.
69
+
70
+ ## Usage
71
+
72
+ This project uses a custom NanoChat inference/training stack. The easiest local UI in the source repo is the Chainlit checkpoint tester documented in the repo README.
73
+
74
+ ## Limitations
75
+
76
+ - factual recall is still weak
77
+ - generations can become repetitive
78
+ - the model was selected by validation loss inside this run family, not by broad downstream benchmark performance
79
+ - dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus
best_validation.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "step": 9000,
3
+ "validation_loss": 4.079709447920322,
4
+ "validation_perplexity": 59.128287506917495,
5
+ "validation_num_batches": 128,
6
+ "elapsed_sec": 33191.382900476456,
7
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/step_9000.pt"
8
+ }
eval_metrics.jsonl ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {"step": 1000, "validation_loss": 5.71584378182888, "validation_perplexity": 303.6403014175998, "validation_num_batches": 128, "elapsed_sec": 9672.132137775421}
2
+ {"step": 2000, "validation_loss": 4.976522132754326, "validation_perplexity": 144.96931984461716, "validation_num_batches": 128, "elapsed_sec": 19343.931071043015}
3
+ {"step": 3000, "validation_loss": 4.552704691886902, "validation_perplexity": 94.88870626918761, "validation_num_batches": 128, "elapsed_sec": 29024.834416627884}
4
+ {"step": 4000, "validation_loss": 4.3076410908252, "validation_perplexity": 74.26509753553422, "validation_num_batches": 128, "elapsed_sec": 38696.663430929184}
5
+ {"step": 5000, "validation_loss": 4.182185200974345, "validation_perplexity": 65.50884691823217, "validation_num_batches": 128, "elapsed_sec": 4152.805989980698}
6
+ {"step": 6000, "validation_loss": 4.291273836046457, "validation_perplexity": 73.05947504235718, "validation_num_batches": 128, "elapsed_sec": 8294.450510978699}
7
+ {"step": 7000, "validation_loss": 4.166212774813175, "validation_perplexity": 64.47082364114317, "validation_num_batches": 128, "elapsed_sec": 16590.2826256752}
8
+ {"step": 8000, "validation_loss": 4.110390676185489, "validation_perplexity": 60.97053265245032, "validation_num_batches": 128, "elapsed_sec": 24895.48659992218}
9
+ {"step": 9000, "validation_loss": 4.079709447920322, "validation_perplexity": 59.128287506917495, "validation_num_batches": 128, "elapsed_sec": 33191.382900476456}
10
+ {"step": 10000, "validation_loss": 4.123492615297437, "validation_perplexity": 61.774620914104325, "validation_num_batches": 128, "elapsed_sec": 41492.283163785934}
eval_summary.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "gpt2small-en-it-nanochat-lr2e4-batchmaxpossible-bs7-step9000",
3
+ "selected_checkpoint": "step_9000.pt",
4
+ "selection_reason": "best_validation.json minimum validation loss for this run",
5
+ "best_validation": {
6
+ "step": 9000,
7
+ "validation_loss": 4.079709447920322,
8
+ "validation_perplexity": 59.128287506917495,
9
+ "validation_num_batches": 128,
10
+ "elapsed_sec": 33191.382900476456,
11
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/step_9000.pt"
12
+ },
13
+ "final_validation_step_10000": {
14
+ "step": 10000,
15
+ "validation_loss": 4.123492615297437,
16
+ "validation_perplexity": 61.774620914104325,
17
+ "validation_num_batches": 128,
18
+ "elapsed_sec": 41492.283163785934
19
+ },
20
+ "notes": [
21
+ "This is the best saved checkpoint of the stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7 run.",
22
+ "The later checkpoint step_10000.pt exists but is worse on validation than step_9000.pt.",
23
+ "Probe quality remains mixed: simple continuations are strong, factual recall remains weak and repetitive."
24
+ ],
25
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch",
26
+ "dataset_dir": "/mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced"
27
+ }
metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
probe_generations.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
probe_step9000_summary.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "language": "en",
4
+ "prompt": "The capital of Italy is",
5
+ "expected_next_text": " Rome",
6
+ "completion": " the capital of Italy. The capital of Italy is the capital of Italy. The capital of Italy is the capital of Italy. The capital of Italy is the capital",
7
+ "correct_token_rank": 248,
8
+ "correct_token_probability": 0.0003604888916015625,
9
+ "entropy": 5.0625
10
+ },
11
+ {
12
+ "language": "en",
13
+ "prompt": "A small language model should",
14
+ "expected_next_text": " be",
15
+ "completion": " be used to be used to be used to be used to be used to be used to be used to be used to be used to be used to be used",
16
+ "correct_token_rank": 1,
17
+ "correct_token_probability": 0.45703125,
18
+ "entropy": 3.703125
19
+ },
20
+ {
21
+ "language": "it",
22
+ "prompt": "La capitale d'Italia è",
23
+ "expected_next_text": " Roma",
24
+ "completion": " stata occupata da un'altra parte, e la sua posizione è stata di fatto. La sua posizione è stata di fatto, e la sua posizione è stata di",
25
+ "correct_token_rank": 1103,
26
+ "correct_token_probability": 4.6253204345703125e-05,
27
+ "entropy": 5.125
28
+ },
29
+ {
30
+ "language": "it",
31
+ "prompt": "Un piccolo modello linguistico dovrebbe",
32
+ "expected_next_text": " essere",
33
+ "completion": " essere un'opera di un'opera di un'opera di un'opera di un'opera di un'opera di un'opera di un'opera",
34
+ "correct_token_rank": 1,
35
+ "correct_token_probability": 0.3984375,
36
+ "entropy": 4.34375
37
+ }
38
+ ]
step_9000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0712992c4bfa86045f9a612d327c4624f83c6ae9058efe24f36a195e7766efb5
3
+ size 1633717975
step_9000.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5c93c3bf22b8102a97d074a5ca6394282b67a0f0c82bf745191d7e0f942f2ef
3
+ size 544531056
step_9000.safetensors.json ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "checkpoint_config": {
3
+ "actual_precision": "bf16",
4
+ "adamw_betas": [
5
+ 0.9,
6
+ 0.95
7
+ ],
8
+ "adamw_eps": 1e-08,
9
+ "attention_kernel_policy": "auto",
10
+ "batch_size": 6,
11
+ "benchmark": {
12
+ "enable_central_tensorboard": true,
13
+ "enable_local_tensorboard": true,
14
+ "enabled": false,
15
+ "output_path": "/mnt/apps/llm-nanochat/artifacts/runs/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/throughput_benchmark.json",
16
+ "warmup_steps": 0
17
+ },
18
+ "checkpoint_dir": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7",
19
+ "clip_grad_norm": 1.0,
20
+ "compile": {
21
+ "backend": null,
22
+ "compile_setup_sec": 0.0,
23
+ "diagnostic": null,
24
+ "dynamic": false,
25
+ "enabled": false,
26
+ "error_policy": "raise",
27
+ "fullgraph": false,
28
+ "mode": null,
29
+ "requested": false,
30
+ "status": "disabled"
31
+ },
32
+ "dataset": {
33
+ "storage_mode": "indexed_jsonl"
34
+ },
35
+ "decay_steps": 9850,
36
+ "deterministic_algorithms": false,
37
+ "device": "cuda",
38
+ "dim": 768,
39
+ "final_lr": 1e-05,
40
+ "fp8_backend": null,
41
+ "grad_accum_steps": 16,
42
+ "learning_rate": 0.0002,
43
+ "logging": {
44
+ "enable_central_tensorboard": true,
45
+ "enable_local_tensorboard": true,
46
+ "metrics_flush_every_steps": 1,
47
+ "metrics_writer": "persistent_jsonl_handle"
48
+ },
49
+ "lr": 0.0002,
50
+ "lr_schedule": "linear_warmup_cosine",
51
+ "max_seq_len": 2500,
52
+ "max_steps": 10000,
53
+ "n_heads": 12,
54
+ "n_layers": 12,
55
+ "optimizer": {
56
+ "backend": "torch",
57
+ "betas": [
58
+ 0.9,
59
+ 0.95
60
+ ],
61
+ "eps": 1e-08,
62
+ "implementation": "torch.optim.AdamW",
63
+ "learning_rate": 0.0002,
64
+ "state_precision": "full_precision",
65
+ "type": "adamw",
66
+ "weight_decay": 0.1
67
+ },
68
+ "optimizer_backend": "torch",
69
+ "optimizer_implementation": "torch.optim.AdamW",
70
+ "optimizer_state_precision": "full_precision",
71
+ "optimizer_type": "adamw",
72
+ "peak_lr": 0.0002,
73
+ "repro": {
74
+ "attention_kernel_policy": "auto",
75
+ "cublas_workspace_config": null,
76
+ "cudnn_benchmark": true,
77
+ "cudnn_deterministic": false,
78
+ "deterministic_algorithms": false,
79
+ "flash_sdp_enabled": true,
80
+ "math_sdp_enabled": true,
81
+ "mem_efficient_sdp_enabled": true,
82
+ "pythonhashseed": "1337",
83
+ "seed": 1337
84
+ },
85
+ "requested_precision": "bf16",
86
+ "save_every_steps": 500,
87
+ "scheduler": {
88
+ "decay_steps": 9850,
89
+ "final_lr": 1e-05,
90
+ "peak_lr": 0.0002,
91
+ "schedule_type": "linear_warmup_cosine",
92
+ "stable_steps": 0,
93
+ "total_steps": 10000,
94
+ "warmup_steps": 150
95
+ },
96
+ "seed": 1337,
97
+ "stable_steps": 0,
98
+ "train_cache_ram_bytes": 1073741824,
99
+ "train_cache_ram_mb": 1024,
100
+ "vocab_size": 32000,
101
+ "warmup_steps": 150,
102
+ "weight_decay": 0.1
103
+ },
104
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/step_9000.pt",
105
+ "exported_at": "2026-05-15T09:49:45.812469+00:00",
106
+ "format": "llm-nanochat-safetensors-export",
107
+ "global_step": 9000,
108
+ "metadata_path": "/mnt/apps/llm-nanochat/hf_exports/gpt2small-en-it-nanochat-lr2e4-batchmaxpossible-bs7-step9000/step_9000.safetensors.json",
109
+ "model_config": {
110
+ "dim": 768,
111
+ "max_seq_len": 2500,
112
+ "n_heads": 12,
113
+ "n_layers": 12,
114
+ "vocab_size": 32000
115
+ },
116
+ "num_parameters": 136128000,
117
+ "num_tensors": 149,
118
+ "provenance": {
119
+ "checkpoint_dir": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7",
120
+ "checkpoint_name": "step_9000.pt",
121
+ "checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/step_9000.pt",
122
+ "global_step": 9000,
123
+ "packed_dataset_config_path": null,
124
+ "run_dir": "/mnt/apps/llm-nanochat/checkpoints",
125
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_20260515_fresh_50_50_score100_500m_32k_fromscratch",
126
+ "training_config_path": null
127
+ },
128
+ "safetensors_path": "/mnt/apps/llm-nanochat/hf_exports/gpt2small-en-it-nanochat-lr2e4-batchmaxpossible-bs7-step9000/step_9000.safetensors",
129
+ "source_checkpoint_path": "/mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7/step_9000.pt",
130
+ "source_global_step": 9000,
131
+ "tensor_names": [
132
+ "token_emb.weight",
133
+ "pos_emb.weight",
134
+ "blocks.layers.0.self_attn.in_proj_weight",
135
+ "blocks.layers.0.self_attn.in_proj_bias",
136
+ "blocks.layers.0.self_attn.out_proj.weight",
137
+ "blocks.layers.0.self_attn.out_proj.bias",
138
+ "blocks.layers.0.linear1.weight",
139
+ "blocks.layers.0.linear1.bias",
140
+ "blocks.layers.0.linear2.weight",
141
+ "blocks.layers.0.linear2.bias",
142
+ "blocks.layers.0.norm1.weight",
143
+ "blocks.layers.0.norm1.bias",
144
+ "blocks.layers.0.norm2.weight",
145
+ "blocks.layers.0.norm2.bias",
146
+ "blocks.layers.1.self_attn.in_proj_weight",
147
+ "blocks.layers.1.self_attn.in_proj_bias",
148
+ "blocks.layers.1.self_attn.out_proj.weight",
149
+ "blocks.layers.1.self_attn.out_proj.bias",
150
+ "blocks.layers.1.linear1.weight",
151
+ "blocks.layers.1.linear1.bias",
152
+ "blocks.layers.1.linear2.weight",
153
+ "blocks.layers.1.linear2.bias",
154
+ "blocks.layers.1.norm1.weight",
155
+ "blocks.layers.1.norm1.bias",
156
+ "blocks.layers.1.norm2.weight",
157
+ "blocks.layers.1.norm2.bias",
158
+ "blocks.layers.2.self_attn.in_proj_weight",
159
+ "blocks.layers.2.self_attn.in_proj_bias",
160
+ "blocks.layers.2.self_attn.out_proj.weight",
161
+ "blocks.layers.2.self_attn.out_proj.bias",
162
+ "blocks.layers.2.linear1.weight",
163
+ "blocks.layers.2.linear1.bias",
164
+ "blocks.layers.2.linear2.weight",
165
+ "blocks.layers.2.linear2.bias",
166
+ "blocks.layers.2.norm1.weight",
167
+ "blocks.layers.2.norm1.bias",
168
+ "blocks.layers.2.norm2.weight",
169
+ "blocks.layers.2.norm2.bias",
170
+ "blocks.layers.3.self_attn.in_proj_weight",
171
+ "blocks.layers.3.self_attn.in_proj_bias",
172
+ "blocks.layers.3.self_attn.out_proj.weight",
173
+ "blocks.layers.3.self_attn.out_proj.bias",
174
+ "blocks.layers.3.linear1.weight",
175
+ "blocks.layers.3.linear1.bias",
176
+ "blocks.layers.3.linear2.weight",
177
+ "blocks.layers.3.linear2.bias",
178
+ "blocks.layers.3.norm1.weight",
179
+ "blocks.layers.3.norm1.bias",
180
+ "blocks.layers.3.norm2.weight",
181
+ "blocks.layers.3.norm2.bias",
182
+ "blocks.layers.4.self_attn.in_proj_weight",
183
+ "blocks.layers.4.self_attn.in_proj_bias",
184
+ "blocks.layers.4.self_attn.out_proj.weight",
185
+ "blocks.layers.4.self_attn.out_proj.bias",
186
+ "blocks.layers.4.linear1.weight",
187
+ "blocks.layers.4.linear1.bias",
188
+ "blocks.layers.4.linear2.weight",
189
+ "blocks.layers.4.linear2.bias",
190
+ "blocks.layers.4.norm1.weight",
191
+ "blocks.layers.4.norm1.bias",
192
+ "blocks.layers.4.norm2.weight",
193
+ "blocks.layers.4.norm2.bias",
194
+ "blocks.layers.5.self_attn.in_proj_weight",
195
+ "blocks.layers.5.self_attn.in_proj_bias",
196
+ "blocks.layers.5.self_attn.out_proj.weight",
197
+ "blocks.layers.5.self_attn.out_proj.bias",
198
+ "blocks.layers.5.linear1.weight",
199
+ "blocks.layers.5.linear1.bias",
200
+ "blocks.layers.5.linear2.weight",
201
+ "blocks.layers.5.linear2.bias",
202
+ "blocks.layers.5.norm1.weight",
203
+ "blocks.layers.5.norm1.bias",
204
+ "blocks.layers.5.norm2.weight",
205
+ "blocks.layers.5.norm2.bias",
206
+ "blocks.layers.6.self_attn.in_proj_weight",
207
+ "blocks.layers.6.self_attn.in_proj_bias",
208
+ "blocks.layers.6.self_attn.out_proj.weight",
209
+ "blocks.layers.6.self_attn.out_proj.bias",
210
+ "blocks.layers.6.linear1.weight",
211
+ "blocks.layers.6.linear1.bias",
212
+ "blocks.layers.6.linear2.weight",
213
+ "blocks.layers.6.linear2.bias",
214
+ "blocks.layers.6.norm1.weight",
215
+ "blocks.layers.6.norm1.bias",
216
+ "blocks.layers.6.norm2.weight",
217
+ "blocks.layers.6.norm2.bias",
218
+ "blocks.layers.7.self_attn.in_proj_weight",
219
+ "blocks.layers.7.self_attn.in_proj_bias",
220
+ "blocks.layers.7.self_attn.out_proj.weight",
221
+ "blocks.layers.7.self_attn.out_proj.bias",
222
+ "blocks.layers.7.linear1.weight",
223
+ "blocks.layers.7.linear1.bias",
224
+ "blocks.layers.7.linear2.weight",
225
+ "blocks.layers.7.linear2.bias",
226
+ "blocks.layers.7.norm1.weight",
227
+ "blocks.layers.7.norm1.bias",
228
+ "blocks.layers.7.norm2.weight",
229
+ "blocks.layers.7.norm2.bias",
230
+ "blocks.layers.8.self_attn.in_proj_weight",
231
+ "blocks.layers.8.self_attn.in_proj_bias",
232
+ "blocks.layers.8.self_attn.out_proj.weight",
233
+ "blocks.layers.8.self_attn.out_proj.bias",
234
+ "blocks.layers.8.linear1.weight",
235
+ "blocks.layers.8.linear1.bias",
236
+ "blocks.layers.8.linear2.weight",
237
+ "blocks.layers.8.linear2.bias",
238
+ "blocks.layers.8.norm1.weight",
239
+ "blocks.layers.8.norm1.bias",
240
+ "blocks.layers.8.norm2.weight",
241
+ "blocks.layers.8.norm2.bias",
242
+ "blocks.layers.9.self_attn.in_proj_weight",
243
+ "blocks.layers.9.self_attn.in_proj_bias",
244
+ "blocks.layers.9.self_attn.out_proj.weight",
245
+ "blocks.layers.9.self_attn.out_proj.bias",
246
+ "blocks.layers.9.linear1.weight",
247
+ "blocks.layers.9.linear1.bias",
248
+ "blocks.layers.9.linear2.weight",
249
+ "blocks.layers.9.linear2.bias",
250
+ "blocks.layers.9.norm1.weight",
251
+ "blocks.layers.9.norm1.bias",
252
+ "blocks.layers.9.norm2.weight",
253
+ "blocks.layers.9.norm2.bias",
254
+ "blocks.layers.10.self_attn.in_proj_weight",
255
+ "blocks.layers.10.self_attn.in_proj_bias",
256
+ "blocks.layers.10.self_attn.out_proj.weight",
257
+ "blocks.layers.10.self_attn.out_proj.bias",
258
+ "blocks.layers.10.linear1.weight",
259
+ "blocks.layers.10.linear1.bias",
260
+ "blocks.layers.10.linear2.weight",
261
+ "blocks.layers.10.linear2.bias",
262
+ "blocks.layers.10.norm1.weight",
263
+ "blocks.layers.10.norm1.bias",
264
+ "blocks.layers.10.norm2.weight",
265
+ "blocks.layers.10.norm2.bias",
266
+ "blocks.layers.11.self_attn.in_proj_weight",
267
+ "blocks.layers.11.self_attn.in_proj_bias",
268
+ "blocks.layers.11.self_attn.out_proj.weight",
269
+ "blocks.layers.11.self_attn.out_proj.bias",
270
+ "blocks.layers.11.linear1.weight",
271
+ "blocks.layers.11.linear1.bias",
272
+ "blocks.layers.11.linear2.weight",
273
+ "blocks.layers.11.linear2.bias",
274
+ "blocks.layers.11.norm1.weight",
275
+ "blocks.layers.11.norm1.bias",
276
+ "blocks.layers.11.norm2.weight",
277
+ "blocks.layers.11.norm2.bias",
278
+ "ln_f.weight",
279
+ "ln_f.bias",
280
+ "head.weight"
281
+ ],
282
+ "tokenizer_reference": {
283
+ "packed_dataset_config_path": null,
284
+ "tokenizer_dir": "/mnt/apps/llm-nanochat/tokenizers/tok_20260515_fresh_50_50_score100_500m_32k_fromscratch",
285
+ "training_config_path": null
286
+ }
287
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_meta.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size_requested": 32000,
3
+ "vocab_size_actual": 32000,
4
+ "special_tokens": [
5
+ "<pad>",
6
+ "<bos>",
7
+ "<eos>",
8
+ "<unk>"
9
+ ]
10
+ }
training_config.yaml ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Derived from configs/stable-config-recipe-v2-gpt2small.yaml
2
+ # Purpose: GPT-2-small stable v2 variant with lr=2e-4 and final_lr at 5% of peak.
3
+
4
+ dataset_dir: /mnt/apps/llm-nanochat/datasets/202605011052_fresh_50_50_score100_2500_sourcebalanced
5
+ output_dir: /mnt/apps/llm-nanochat/artifacts/runs/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7
6
+ tokenizer_dir: /mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
7
+ seed: 1337
8
+ model:
9
+ vocab_size: 32000
10
+ dim: 768
11
+ n_layers: 12
12
+ n_heads: 12
13
+ training:
14
+ sequence_length: 2500
15
+ max_steps: 10000
16
+ batch_size: 7
17
+ grad_accum_steps: 16
18
+ learning_rate: 0.0002
19
+ peak_lr: 0.0002
20
+ lr_schedule: linear_warmup_cosine
21
+ warmup_steps: -1
22
+ final_lr: 1.0e-05
23
+ adamw_betas:
24
+ - 0.9
25
+ - 0.95
26
+ adamw_eps: 1.0e-08
27
+ weight_decay: 0.1
28
+ clip_grad_norm: 1.0
29
+ save_every_steps: 500
30
+ checkpoint_dir: /mnt/apps/llm-nanochat/checkpoints/stable-config-recipe-v3-gpt2small-lr2e4-batchmaxpossible-bs7
31
+ precision: bf16
32
+ evaluation:
33
+ validation_every_steps: 1000
34
+ validation_max_batches: 128
35
+ probe_every_steps: 1000
36
+ probe_tokenizer_dir: /mnt/apps/llm-nanochat/tokenizers/tok_202605011052_fresh_50_50_score100_32k_fromscratch
37
+ probe_max_new_tokens: 32
38
+ probe_prompts:
39
+ en:
40
+ - prompt: "The capital of Italy is"
41
+ expected_next_text: " Rome"
42
+ - prompt: "A small language model should"
43
+ expected_next_text: " be"
44
+ it:
45
+ - prompt: "La capitale d'Italia è"
46
+ expected_next_text: " Roma"
47
+ - prompt: "Un piccolo modello linguistico dovrebbe"
48
+ expected_next_text: " essere"