HectorHe commited on
Commit
d8df595
Β·
verified Β·
1 Parent(s): 8e90609

Model save

Browse files
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen1.5-MoE-A2.7B
3
+ library_name: transformers
4
+ model_name: Qwen1.5-MOE-aux-free-sft-math7k-1e-3-gamma
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - sft
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for Qwen1.5-MOE-aux-free-sft-math7k-1e-3-gamma
13
+
14
+ This model is a fine-tuned version of [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="HectorHe/Qwen1.5-MOE-aux-free-sft-math7k-1e-3-gamma", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/hector_-carnegie-mellon-university/huggingface/runs/26r47xsq)
31
+
32
+
33
+ This model was trained with SFT.
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.16.0.dev0
38
+ - Transformers: 4.51.0
39
+ - Pytorch: 2.6.0
40
+ - Datasets: 4.0.0
41
+ - Tokenizers: 0.21.4
42
+
43
+ ## Citations
44
+
45
+
46
+
47
+ Cite TRL as:
48
+
49
+ ```bibtex
50
+ @misc{vonwerra2022trl,
51
+ title = {{TRL: Transformer Reinforcement Learning}},
52
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin GallouΓ©dec},
53
+ year = 2020,
54
+ journal = {GitHub repository},
55
+ publisher = {GitHub},
56
+ howpublished = {\url{https://github.com/huggingface/trl}}
57
+ }
58
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 2.0144468407196058e+17,
3
+ "train_loss": 0.325891209826913,
4
+ "train_runtime": 788.7208,
5
+ "train_samples": 6851,
6
+ "train_samples_per_second": 8.686,
7
+ "train_steps_per_second": 0.273
8
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_implementation": "flash_attention_2",
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "transformers_version": "4.51.0",
10
+ "use_cache": false
11
+ }
moe_bias_states.json ADDED
@@ -0,0 +1,1667 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_moe_layers": 24,
4
+ "save_timestamp": "2025-09-15T02:07:52.402156",
5
+ "model_type": "Qwen2MoeForCausalLM",
6
+ "pytorch_version": "2.6.0+cu124",
7
+ "description": "Auxiliary-loss-free MoE bias states saved during training"
8
+ },
9
+ "moe_bias_states": {
10
+ "model.layers.0.mlp": {
11
+ "bias_values": [
12
+ 0.5,
13
+ -0.5,
14
+ -0.5,
15
+ -0.5,
16
+ -0.5,
17
+ -0.5,
18
+ -0.5,
19
+ -0.5,
20
+ 0.5,
21
+ -0.5,
22
+ -0.5,
23
+ 0.5,
24
+ -0.5,
25
+ 0.5,
26
+ -0.5,
27
+ -0.5,
28
+ 0.5,
29
+ -0.5,
30
+ -0.5,
31
+ -0.5,
32
+ -0.5,
33
+ -0.5,
34
+ -0.5,
35
+ 0.5,
36
+ 0.5,
37
+ -0.5,
38
+ -0.5,
39
+ 0.5,
40
+ -0.5,
41
+ -0.5,
42
+ -0.5,
43
+ 0.474609375,
44
+ -0.5,
45
+ 0.5,
46
+ -0.5,
47
+ 0.5,
48
+ -0.5,
49
+ -0.5,
50
+ -0.5,
51
+ -0.5,
52
+ -0.5,
53
+ -0.5,
54
+ -0.5,
55
+ 0.5,
56
+ 0.5,
57
+ 0.5,
58
+ -0.5,
59
+ -0.5,
60
+ 0.5,
61
+ -0.5,
62
+ -0.5,
63
+ -0.5,
64
+ 0.5,
65
+ 0.443359375,
66
+ 0.5,
67
+ -0.5,
68
+ -0.5,
69
+ 0.5,
70
+ -0.5,
71
+ -0.5
72
+ ],
73
+ "bias_update_speed": 0.001,
74
+ "num_experts": 60,
75
+ "module_type": "Qwen2MoeSparseMoeBlock",
76
+ "device": "cuda:0",
77
+ "dtype": "torch.bfloat16"
78
+ },
79
+ "model.layers.1.mlp": {
80
+ "bias_values": [
81
+ 0.5,
82
+ -0.5,
83
+ -0.5,
84
+ 0.5,
85
+ -0.5,
86
+ -0.5,
87
+ -0.5,
88
+ -0.5,
89
+ 0.017822265625,
90
+ -0.5,
91
+ -0.5,
92
+ -0.5,
93
+ -0.5,
94
+ 0.5,
95
+ -0.5,
96
+ -0.5,
97
+ 0.5,
98
+ 0.5,
99
+ -0.5,
100
+ 0.5,
101
+ -0.5,
102
+ 0.5,
103
+ -0.5,
104
+ 0.5,
105
+ -0.5,
106
+ -0.5,
107
+ 0.5,
108
+ 0.5,
109
+ -0.5,
110
+ -0.5,
111
+ -0.5,
112
+ -0.5,
113
+ -0.5,
114
+ -0.5,
115
+ -0.5,
116
+ -0.5,
117
+ 0.5,
118
+ 0.5,
119
+ -0.5,
120
+ -0.5,
121
+ -0.5,
122
+ -0.5,
123
+ -0.5,
124
+ -0.5,
125
+ -0.5,
126
+ -0.5,
127
+ -0.5,
128
+ 0.5,
129
+ -0.5,
130
+ 0.5,
131
+ 0.5,
132
+ 0.5,
133
+ -0.5,
134
+ -0.5,
135
+ 0.5,
136
+ -0.5,
137
+ -0.490234375,
138
+ 0.5,
139
+ -0.5,
140
+ -0.5
141
+ ],
142
+ "bias_update_speed": 0.001,
143
+ "num_experts": 60,
144
+ "module_type": "Qwen2MoeSparseMoeBlock",
145
+ "device": "cuda:0",
146
+ "dtype": "torch.bfloat16"
147
+ },
148
+ "model.layers.2.mlp": {
149
+ "bias_values": [
150
+ 0.5,
151
+ 0.5,
152
+ -0.5,
153
+ -0.5,
154
+ -0.5,
155
+ -0.5,
156
+ -0.5,
157
+ 0.5,
158
+ -0.5,
159
+ -0.5,
160
+ -0.5,
161
+ -0.5,
162
+ -0.5,
163
+ 0.5,
164
+ -0.5,
165
+ -0.5,
166
+ -0.5,
167
+ -0.5,
168
+ -0.5,
169
+ 0.5,
170
+ -0.5,
171
+ -0.5,
172
+ -0.5,
173
+ -0.5,
174
+ 0.5,
175
+ -0.5,
176
+ -0.5,
177
+ 0.5,
178
+ -0.5,
179
+ 0.5,
180
+ -0.5,
181
+ -0.5,
182
+ -0.5,
183
+ 0.5,
184
+ -0.5,
185
+ -0.5,
186
+ -0.5,
187
+ -0.5,
188
+ -0.5,
189
+ 0.5,
190
+ 0.5,
191
+ -0.5,
192
+ 0.5,
193
+ -0.5,
194
+ -0.5,
195
+ -0.5,
196
+ -0.5,
197
+ 0.5,
198
+ -0.5,
199
+ -0.5,
200
+ -0.5,
201
+ -0.5,
202
+ -0.5,
203
+ -0.5,
204
+ 0.5,
205
+ -0.458984375,
206
+ -0.5,
207
+ -0.5,
208
+ -0.5,
209
+ 0.5
210
+ ],
211
+ "bias_update_speed": 0.001,
212
+ "num_experts": 60,
213
+ "module_type": "Qwen2MoeSparseMoeBlock",
214
+ "device": "cuda:0",
215
+ "dtype": "torch.bfloat16"
216
+ },
217
+ "model.layers.3.mlp": {
218
+ "bias_values": [
219
+ -0.5,
220
+ -0.5,
221
+ -0.5,
222
+ 0.5,
223
+ -0.5,
224
+ -0.5,
225
+ -0.5,
226
+ -0.5,
227
+ -0.5,
228
+ 0.5,
229
+ 0.5,
230
+ -0.5,
231
+ 0.5,
232
+ -0.5,
233
+ 0.5,
234
+ -0.5,
235
+ -0.5,
236
+ -0.5,
237
+ 0.5,
238
+ 0.5,
239
+ 0.5,
240
+ 0.5,
241
+ 0.5,
242
+ -0.5,
243
+ 0.5,
244
+ -0.5,
245
+ 0.5,
246
+ -0.5,
247
+ -0.5,
248
+ 0.5,
249
+ 0.5,
250
+ -0.5,
251
+ 0.5,
252
+ -0.5,
253
+ 0.5,
254
+ 0.5,
255
+ 0.5,
256
+ -0.5,
257
+ -0.5,
258
+ 0.5,
259
+ -0.5,
260
+ 0.5,
261
+ -0.310546875,
262
+ -0.5,
263
+ -0.5,
264
+ 0.5,
265
+ -0.5,
266
+ -0.419921875,
267
+ 0.5,
268
+ 0.5,
269
+ -0.5,
270
+ -0.5,
271
+ -0.5,
272
+ 0.5,
273
+ -0.5,
274
+ -0.5,
275
+ -0.5,
276
+ -0.5,
277
+ -0.5,
278
+ -0.5
279
+ ],
280
+ "bias_update_speed": 0.001,
281
+ "num_experts": 60,
282
+ "module_type": "Qwen2MoeSparseMoeBlock",
283
+ "device": "cuda:0",
284
+ "dtype": "torch.bfloat16"
285
+ },
286
+ "model.layers.4.mlp": {
287
+ "bias_values": [
288
+ 0.5,
289
+ -0.5,
290
+ 0.5,
291
+ 0.490234375,
292
+ -0.5,
293
+ -0.5,
294
+ -0.5,
295
+ 0.5,
296
+ 0.5,
297
+ -0.5,
298
+ -0.5,
299
+ -0.5,
300
+ -0.5,
301
+ -0.5,
302
+ -0.5,
303
+ -0.5,
304
+ -0.5,
305
+ 0.5,
306
+ -0.5,
307
+ -0.5,
308
+ -0.5,
309
+ -0.5,
310
+ -0.5,
311
+ -0.5,
312
+ -0.5,
313
+ 0.5,
314
+ 0.5,
315
+ 0.5,
316
+ -0.5,
317
+ 0.5,
318
+ 0.5,
319
+ 0.5,
320
+ 0.5,
321
+ 0.5,
322
+ -0.5,
323
+ -0.5,
324
+ 0.5,
325
+ -0.5,
326
+ -0.5,
327
+ -0.5,
328
+ -0.5,
329
+ 0.5,
330
+ -0.5,
331
+ -0.5,
332
+ -0.5,
333
+ -0.5,
334
+ 0.5,
335
+ 0.5,
336
+ 0.5,
337
+ -0.5,
338
+ 0.5,
339
+ 0.5,
340
+ -0.5,
341
+ -0.5,
342
+ -0.5,
343
+ 0.5,
344
+ -0.5,
345
+ 0.5,
346
+ -0.5,
347
+ 0.5
348
+ ],
349
+ "bias_update_speed": 0.001,
350
+ "num_experts": 60,
351
+ "module_type": "Qwen2MoeSparseMoeBlock",
352
+ "device": "cuda:0",
353
+ "dtype": "torch.bfloat16"
354
+ },
355
+ "model.layers.5.mlp": {
356
+ "bias_values": [
357
+ 0.5,
358
+ -0.5,
359
+ -0.5,
360
+ -0.5,
361
+ -0.5,
362
+ -0.5,
363
+ 0.5,
364
+ 0.5,
365
+ -0.5,
366
+ 0.5,
367
+ -0.5,
368
+ 0.5,
369
+ -0.5,
370
+ -0.5,
371
+ -0.5,
372
+ -0.5,
373
+ 0.5,
374
+ -0.5,
375
+ -0.5,
376
+ -0.5,
377
+ -0.5,
378
+ -0.5,
379
+ -0.5,
380
+ 0.5,
381
+ 0.5,
382
+ -0.5,
383
+ -0.5,
384
+ -0.5,
385
+ -0.5,
386
+ -0.5,
387
+ 0.5,
388
+ -0.5,
389
+ 0.5,
390
+ -0.5,
391
+ -0.5,
392
+ 0.5,
393
+ 0.5,
394
+ -0.5,
395
+ 0.5,
396
+ 0.5,
397
+ -0.5,
398
+ -0.5,
399
+ -0.5,
400
+ -0.5,
401
+ 0.5,
402
+ 0.5,
403
+ -0.5,
404
+ -0.5,
405
+ -0.5,
406
+ -0.5,
407
+ 0.5,
408
+ -0.5,
409
+ -0.5,
410
+ 0.5,
411
+ 0.5,
412
+ 0.5,
413
+ 0.5,
414
+ -0.5,
415
+ 0.5,
416
+ 0.5
417
+ ],
418
+ "bias_update_speed": 0.001,
419
+ "num_experts": 60,
420
+ "module_type": "Qwen2MoeSparseMoeBlock",
421
+ "device": "cuda:0",
422
+ "dtype": "torch.bfloat16"
423
+ },
424
+ "model.layers.6.mlp": {
425
+ "bias_values": [
426
+ 0.5,
427
+ 0.5,
428
+ -0.5,
429
+ 0.5,
430
+ 0.5,
431
+ 0.5,
432
+ 0.5,
433
+ -0.5,
434
+ -0.5,
435
+ -0.5,
436
+ -0.5,
437
+ -0.5,
438
+ -0.5,
439
+ -0.5,
440
+ -0.5,
441
+ 0.5,
442
+ -0.5,
443
+ 0.2451171875,
444
+ -0.5,
445
+ -0.5,
446
+ 0.5,
447
+ 0.5,
448
+ -0.5,
449
+ -0.5,
450
+ -0.5,
451
+ -0.5,
452
+ -0.5,
453
+ -0.5,
454
+ -0.5,
455
+ -0.5,
456
+ 0.5,
457
+ 0.5,
458
+ 0.5,
459
+ -0.5,
460
+ -0.5,
461
+ -0.5,
462
+ -0.5,
463
+ 0.5,
464
+ -0.5,
465
+ -0.5,
466
+ -0.5,
467
+ -0.5,
468
+ -0.5,
469
+ -0.5,
470
+ -0.5,
471
+ -0.5,
472
+ 0.5,
473
+ -0.5,
474
+ -0.5,
475
+ -0.5,
476
+ -0.5,
477
+ -0.5,
478
+ 0.5,
479
+ -0.5,
480
+ -0.5,
481
+ -0.5,
482
+ -0.5,
483
+ 0.5,
484
+ 0.5,
485
+ -0.5
486
+ ],
487
+ "bias_update_speed": 0.001,
488
+ "num_experts": 60,
489
+ "module_type": "Qwen2MoeSparseMoeBlock",
490
+ "device": "cuda:0",
491
+ "dtype": "torch.bfloat16"
492
+ },
493
+ "model.layers.7.mlp": {
494
+ "bias_values": [
495
+ -0.5,
496
+ -0.5,
497
+ -0.5,
498
+ -0.5,
499
+ 0.5,
500
+ -0.5,
501
+ -0.5,
502
+ -0.5,
503
+ -0.5,
504
+ 0.5,
505
+ 0.5,
506
+ -0.5,
507
+ 0.5,
508
+ 0.5,
509
+ -0.5,
510
+ -0.5,
511
+ -0.5,
512
+ -0.5,
513
+ 0.5,
514
+ -0.5,
515
+ -0.5,
516
+ 0.5,
517
+ -0.5,
518
+ -0.5,
519
+ -0.5,
520
+ 0.5,
521
+ -0.5,
522
+ -0.5,
523
+ -0.5,
524
+ -0.5,
525
+ -0.5,
526
+ 0.5,
527
+ 0.5,
528
+ -0.5,
529
+ -0.5,
530
+ -0.5,
531
+ -0.5,
532
+ 0.5,
533
+ -0.5,
534
+ -0.5,
535
+ -0.5,
536
+ -0.5,
537
+ 0.5,
538
+ 0.5,
539
+ 0.2333984375,
540
+ 0.5,
541
+ -0.5,
542
+ -0.5,
543
+ -0.5,
544
+ 0.5,
545
+ -0.5,
546
+ -0.5,
547
+ -0.5,
548
+ 0.5,
549
+ 0.5,
550
+ 0.5,
551
+ -0.5,
552
+ -0.5,
553
+ -0.5,
554
+ 0.5
555
+ ],
556
+ "bias_update_speed": 0.001,
557
+ "num_experts": 60,
558
+ "module_type": "Qwen2MoeSparseMoeBlock",
559
+ "device": "cuda:0",
560
+ "dtype": "torch.bfloat16"
561
+ },
562
+ "model.layers.8.mlp": {
563
+ "bias_values": [
564
+ -0.5,
565
+ -0.5,
566
+ -0.5,
567
+ 0.5,
568
+ 0.5,
569
+ -0.5,
570
+ 0.5,
571
+ -0.5,
572
+ 0.5,
573
+ -0.5,
574
+ 0.5,
575
+ -0.5,
576
+ 0.5,
577
+ -0.5,
578
+ -0.5,
579
+ 0.5,
580
+ -0.5,
581
+ -0.5,
582
+ -0.5,
583
+ -0.5,
584
+ -0.5,
585
+ 0.5,
586
+ -0.5,
587
+ 0.5,
588
+ -0.5,
589
+ -0.5,
590
+ 0.5,
591
+ -0.5,
592
+ 0.474609375,
593
+ -0.5,
594
+ 0.5,
595
+ -0.5,
596
+ -0.5,
597
+ -0.5,
598
+ 0.5,
599
+ 0.5,
600
+ 0.5,
601
+ 0.5,
602
+ -0.5,
603
+ 0.5,
604
+ -0.5,
605
+ -0.5,
606
+ 0.5,
607
+ 0.5,
608
+ -0.5,
609
+ -0.5,
610
+ -0.5,
611
+ 0.5,
612
+ -0.5,
613
+ -0.5,
614
+ -0.5,
615
+ -0.5,
616
+ -0.5,
617
+ 0.5,
618
+ -0.5,
619
+ 0.5,
620
+ 0.5,
621
+ -0.5,
622
+ -0.5,
623
+ -0.5
624
+ ],
625
+ "bias_update_speed": 0.001,
626
+ "num_experts": 60,
627
+ "module_type": "Qwen2MoeSparseMoeBlock",
628
+ "device": "cuda:0",
629
+ "dtype": "torch.bfloat16"
630
+ },
631
+ "model.layers.9.mlp": {
632
+ "bias_values": [
633
+ -0.5,
634
+ -0.5,
635
+ -0.5,
636
+ -0.5,
637
+ 0.474609375,
638
+ -0.5,
639
+ -0.5,
640
+ -0.5,
641
+ -0.5,
642
+ -0.5,
643
+ -0.5,
644
+ 0.5,
645
+ -0.5,
646
+ -0.5,
647
+ -0.5,
648
+ -0.5,
649
+ -0.5,
650
+ -0.5,
651
+ -0.5,
652
+ -0.5,
653
+ 0.318359375,
654
+ 0.5,
655
+ -0.5,
656
+ 0.5,
657
+ 0.5,
658
+ -0.5,
659
+ -0.5,
660
+ -0.5,
661
+ -0.5,
662
+ -0.5,
663
+ -0.5,
664
+ 0.5,
665
+ 0.5,
666
+ 0.5,
667
+ -0.5,
668
+ -0.5,
669
+ 0.5,
670
+ 0.5,
671
+ -0.5,
672
+ -0.5,
673
+ -0.5,
674
+ -0.5,
675
+ 0.5,
676
+ -0.5,
677
+ 0.5,
678
+ -0.5,
679
+ 0.5,
680
+ -0.5,
681
+ -0.5,
682
+ -0.5,
683
+ -0.5,
684
+ -0.5,
685
+ 0.5,
686
+ 0.419921875,
687
+ 0.5,
688
+ -0.5,
689
+ -0.5,
690
+ -0.5,
691
+ -0.5,
692
+ -0.5
693
+ ],
694
+ "bias_update_speed": 0.001,
695
+ "num_experts": 60,
696
+ "module_type": "Qwen2MoeSparseMoeBlock",
697
+ "device": "cuda:0",
698
+ "dtype": "torch.bfloat16"
699
+ },
700
+ "model.layers.10.mlp": {
701
+ "bias_values": [
702
+ -0.5,
703
+ 0.1435546875,
704
+ -0.5,
705
+ -0.5,
706
+ -0.458984375,
707
+ -0.5,
708
+ -0.5,
709
+ 0.5,
710
+ -0.5,
711
+ -0.5,
712
+ -0.5,
713
+ -0.5,
714
+ 0.5,
715
+ -0.5,
716
+ -0.5,
717
+ -0.5,
718
+ -0.5,
719
+ 0.5,
720
+ -0.5,
721
+ 0.5,
722
+ -0.5,
723
+ -0.5,
724
+ -0.5,
725
+ -0.5,
726
+ -0.5,
727
+ 0.5,
728
+ -0.5,
729
+ 0.5,
730
+ -0.5,
731
+ -0.5,
732
+ 0.5,
733
+ 0.5,
734
+ -0.5,
735
+ -0.5,
736
+ 0.5,
737
+ -0.5,
738
+ -0.5,
739
+ 0.458984375,
740
+ -0.490234375,
741
+ -0.5,
742
+ -0.5,
743
+ 0.5,
744
+ -0.5,
745
+ 0.5,
746
+ -0.5,
747
+ 0.5,
748
+ -0.5,
749
+ -0.5,
750
+ -0.5,
751
+ 0.5,
752
+ 0.5,
753
+ -0.5,
754
+ -0.5,
755
+ -0.5,
756
+ -0.5,
757
+ 0.5,
758
+ -0.5,
759
+ 0.482421875,
760
+ 0.5,
761
+ 0.5
762
+ ],
763
+ "bias_update_speed": 0.001,
764
+ "num_experts": 60,
765
+ "module_type": "Qwen2MoeSparseMoeBlock",
766
+ "device": "cuda:0",
767
+ "dtype": "torch.bfloat16"
768
+ },
769
+ "model.layers.11.mlp": {
770
+ "bias_values": [
771
+ -0.5,
772
+ 0.5,
773
+ -0.5,
774
+ 0.5,
775
+ -0.5,
776
+ -0.5,
777
+ -0.5,
778
+ 0.5,
779
+ -0.5,
780
+ 0.5,
781
+ -0.5,
782
+ -0.5,
783
+ -0.5,
784
+ 0.5,
785
+ -0.5,
786
+ 0.490234375,
787
+ 0.5,
788
+ -0.5,
789
+ -0.5,
790
+ -0.5,
791
+ -0.5,
792
+ -0.5,
793
+ 0.5,
794
+ -0.5,
795
+ -0.5,
796
+ -0.5,
797
+ -0.5,
798
+ -0.5,
799
+ -0.5,
800
+ -0.5,
801
+ -0.5,
802
+ 0.474609375,
803
+ -0.5,
804
+ -0.5,
805
+ -0.5,
806
+ -0.5,
807
+ -0.5,
808
+ 0.5,
809
+ 0.326171875,
810
+ 0.5,
811
+ 0.5,
812
+ 0.5,
813
+ 0.5,
814
+ 0.5,
815
+ -0.5,
816
+ 0.5,
817
+ -0.5,
818
+ -0.5,
819
+ -0.5,
820
+ -0.5,
821
+ 0.5,
822
+ -0.5,
823
+ -0.5,
824
+ 0.5,
825
+ -0.5,
826
+ -0.5,
827
+ -0.5,
828
+ -0.5,
829
+ -0.5,
830
+ -0.5
831
+ ],
832
+ "bias_update_speed": 0.001,
833
+ "num_experts": 60,
834
+ "module_type": "Qwen2MoeSparseMoeBlock",
835
+ "device": "cuda:0",
836
+ "dtype": "torch.bfloat16"
837
+ },
838
+ "model.layers.12.mlp": {
839
+ "bias_values": [
840
+ -0.5,
841
+ -0.5,
842
+ -0.5,
843
+ -0.5,
844
+ -0.5,
845
+ -0.5,
846
+ -0.5,
847
+ -0.5,
848
+ -0.5,
849
+ -0.5,
850
+ 0.458984375,
851
+ -0.5,
852
+ -0.5,
853
+ 0.5,
854
+ -0.5,
855
+ -0.5,
856
+ 0.5,
857
+ -0.5,
858
+ -0.5,
859
+ 0.490234375,
860
+ -0.5,
861
+ -0.5,
862
+ -0.5,
863
+ -0.5,
864
+ -0.5,
865
+ -0.5,
866
+ 0.5,
867
+ -0.5,
868
+ -0.5,
869
+ -0.5,
870
+ 0.5,
871
+ -0.5,
872
+ -0.5,
873
+ -0.5,
874
+ 0.5,
875
+ 0.404296875,
876
+ 0.5,
877
+ 0.5,
878
+ 0.5,
879
+ -0.427734375,
880
+ -0.5,
881
+ -0.5,
882
+ 0.5,
883
+ -0.5,
884
+ -0.5,
885
+ -0.5,
886
+ 0.5,
887
+ -0.5,
888
+ 0.5,
889
+ -0.5,
890
+ -0.5,
891
+ -0.5,
892
+ -0.5,
893
+ -0.5,
894
+ 0.5,
895
+ -0.5,
896
+ -0.5,
897
+ -0.5,
898
+ -0.5,
899
+ 0.5
900
+ ],
901
+ "bias_update_speed": 0.001,
902
+ "num_experts": 60,
903
+ "module_type": "Qwen2MoeSparseMoeBlock",
904
+ "device": "cuda:0",
905
+ "dtype": "torch.bfloat16"
906
+ },
907
+ "model.layers.13.mlp": {
908
+ "bias_values": [
909
+ -0.5,
910
+ 0.2119140625,
911
+ -0.5,
912
+ -0.5,
913
+ -0.5,
914
+ 0.5,
915
+ -0.5,
916
+ 0.5,
917
+ -0.5,
918
+ -0.5,
919
+ -0.5,
920
+ 0.5,
921
+ -0.5,
922
+ 0.412109375,
923
+ 0.5,
924
+ -0.5,
925
+ 0.5,
926
+ -0.5,
927
+ -0.5,
928
+ 0.5,
929
+ -0.5,
930
+ -0.5,
931
+ 0.451171875,
932
+ -0.5,
933
+ 0.5,
934
+ -0.5,
935
+ 0.5,
936
+ 0.5,
937
+ 0.5,
938
+ 0.5,
939
+ 0.5,
940
+ 0.5,
941
+ -0.5,
942
+ -0.5,
943
+ -0.5,
944
+ 0.5,
945
+ -0.5,
946
+ -0.5,
947
+ -0.5,
948
+ -0.5,
949
+ -0.388671875,
950
+ -0.5,
951
+ -0.5,
952
+ -0.5,
953
+ 0.5,
954
+ 0.5,
955
+ -0.5,
956
+ -0.5,
957
+ -0.5,
958
+ -0.5,
959
+ -0.5,
960
+ -0.5,
961
+ -0.5,
962
+ 0.5,
963
+ -0.5,
964
+ -0.5,
965
+ -0.5,
966
+ -0.5,
967
+ -0.5,
968
+ -0.5
969
+ ],
970
+ "bias_update_speed": 0.001,
971
+ "num_experts": 60,
972
+ "module_type": "Qwen2MoeSparseMoeBlock",
973
+ "device": "cuda:0",
974
+ "dtype": "torch.bfloat16"
975
+ },
976
+ "model.layers.14.mlp": {
977
+ "bias_values": [
978
+ 0.5,
979
+ 0.5,
980
+ 0.5,
981
+ 0.435546875,
982
+ -0.5,
983
+ -0.5,
984
+ -0.2412109375,
985
+ 0.5,
986
+ -0.5,
987
+ -0.5,
988
+ -0.5,
989
+ 0.5,
990
+ 0.5,
991
+ -0.5,
992
+ -0.5,
993
+ -0.5,
994
+ -0.5,
995
+ -0.5,
996
+ -0.5,
997
+ -0.5,
998
+ 0.2138671875,
999
+ 0.5,
1000
+ -0.5,
1001
+ -0.5,
1002
+ 0.5,
1003
+ 0.5,
1004
+ -0.5,
1005
+ -0.5,
1006
+ -0.5,
1007
+ -0.5,
1008
+ -0.5,
1009
+ -0.5,
1010
+ 0.5,
1011
+ -0.5,
1012
+ -0.5,
1013
+ -0.5,
1014
+ 0.5,
1015
+ -0.5,
1016
+ 0.5,
1017
+ 0.5,
1018
+ -0.404296875,
1019
+ 0.5,
1020
+ -0.5,
1021
+ -0.5,
1022
+ -0.5,
1023
+ 0.5,
1024
+ -0.5,
1025
+ 0.5,
1026
+ 0.5,
1027
+ 0.5,
1028
+ -0.5,
1029
+ 0.5,
1030
+ -0.5,
1031
+ -0.5,
1032
+ -0.5,
1033
+ 0.5,
1034
+ -0.5,
1035
+ -0.5,
1036
+ -0.5,
1037
+ -0.5
1038
+ ],
1039
+ "bias_update_speed": 0.001,
1040
+ "num_experts": 60,
1041
+ "module_type": "Qwen2MoeSparseMoeBlock",
1042
+ "device": "cuda:0",
1043
+ "dtype": "torch.bfloat16"
1044
+ },
1045
+ "model.layers.15.mlp": {
1046
+ "bias_values": [
1047
+ -0.5,
1048
+ -0.5,
1049
+ 0.5,
1050
+ -0.5,
1051
+ -0.5,
1052
+ -0.5,
1053
+ -0.5,
1054
+ -0.5,
1055
+ 0.5,
1056
+ -0.5,
1057
+ 0.5,
1058
+ 0.5,
1059
+ -0.5,
1060
+ 0.5,
1061
+ -0.5,
1062
+ 0.5,
1063
+ -0.5,
1064
+ 0.5,
1065
+ -0.5,
1066
+ -0.5,
1067
+ 0.5,
1068
+ -0.5,
1069
+ 0.5,
1070
+ 0.5,
1071
+ -0.5,
1072
+ 0.5,
1073
+ -0.5,
1074
+ 0.349609375,
1075
+ -0.5,
1076
+ -0.5,
1077
+ -0.5,
1078
+ 0.5,
1079
+ 0.5,
1080
+ 0.5,
1081
+ 0.5,
1082
+ 0.5,
1083
+ -0.5,
1084
+ -0.5,
1085
+ 0.5,
1086
+ -0.5,
1087
+ -0.5,
1088
+ 0.5,
1089
+ -0.5,
1090
+ -0.5,
1091
+ 0.5,
1092
+ -0.5,
1093
+ 0.5,
1094
+ -0.5,
1095
+ 0.5,
1096
+ -0.5,
1097
+ -0.5,
1098
+ 0.5,
1099
+ -0.5,
1100
+ -0.5,
1101
+ 0.5,
1102
+ -0.5,
1103
+ -0.5,
1104
+ -0.5,
1105
+ -0.5,
1106
+ -0.5
1107
+ ],
1108
+ "bias_update_speed": 0.001,
1109
+ "num_experts": 60,
1110
+ "module_type": "Qwen2MoeSparseMoeBlock",
1111
+ "device": "cuda:0",
1112
+ "dtype": "torch.bfloat16"
1113
+ },
1114
+ "model.layers.16.mlp": {
1115
+ "bias_values": [
1116
+ -0.5,
1117
+ 0.5,
1118
+ 0.5,
1119
+ -0.5,
1120
+ -0.5,
1121
+ 0.5,
1122
+ 0.01385498046875,
1123
+ -0.5,
1124
+ 0.5,
1125
+ -0.5,
1126
+ -0.5,
1127
+ 0.5,
1128
+ -0.5,
1129
+ -0.5,
1130
+ -0.5,
1131
+ 0.474609375,
1132
+ -0.5,
1133
+ -0.5,
1134
+ -0.5,
1135
+ -0.5,
1136
+ -0.5,
1137
+ -0.5,
1138
+ 0.2431640625,
1139
+ -0.5,
1140
+ -0.5,
1141
+ -0.5,
1142
+ -0.5,
1143
+ 0.5,
1144
+ 0.5,
1145
+ -0.5,
1146
+ -0.5,
1147
+ -0.5,
1148
+ 0.5,
1149
+ 0.5,
1150
+ -0.5,
1151
+ -0.5,
1152
+ -0.5,
1153
+ -0.5,
1154
+ -0.5,
1155
+ -0.5,
1156
+ 0.5,
1157
+ -0.5,
1158
+ -0.5,
1159
+ 0.5,
1160
+ -0.5,
1161
+ -0.5,
1162
+ 0.5,
1163
+ -0.5,
1164
+ 0.5,
1165
+ -0.5,
1166
+ -0.5,
1167
+ -0.5,
1168
+ -0.5,
1169
+ -0.5,
1170
+ 0.5,
1171
+ 0.5,
1172
+ 0.5,
1173
+ -0.5,
1174
+ 0.5,
1175
+ 0.5
1176
+ ],
1177
+ "bias_update_speed": 0.001,
1178
+ "num_experts": 60,
1179
+ "module_type": "Qwen2MoeSparseMoeBlock",
1180
+ "device": "cuda:0",
1181
+ "dtype": "torch.bfloat16"
1182
+ },
1183
+ "model.layers.17.mlp": {
1184
+ "bias_values": [
1185
+ -0.5,
1186
+ -0.5,
1187
+ -0.5,
1188
+ 0.5,
1189
+ -0.5,
1190
+ 0.5,
1191
+ -0.5,
1192
+ 0.5,
1193
+ -0.5,
1194
+ 0.5,
1195
+ -0.5,
1196
+ -0.5,
1197
+ -0.5,
1198
+ -0.5,
1199
+ 0.5,
1200
+ -0.5,
1201
+ 0.5,
1202
+ -0.5,
1203
+ -0.5,
1204
+ -0.5,
1205
+ -0.5,
1206
+ 0.5,
1207
+ -0.5,
1208
+ -0.5,
1209
+ -0.5,
1210
+ -0.5,
1211
+ -0.373046875,
1212
+ -0.5,
1213
+ -0.5,
1214
+ -0.5,
1215
+ -0.5,
1216
+ 0.5,
1217
+ 0.1943359375,
1218
+ 0.5,
1219
+ -0.5,
1220
+ -0.5,
1221
+ -0.5,
1222
+ -0.5,
1223
+ -0.5,
1224
+ -0.5,
1225
+ 0.451171875,
1226
+ -0.5,
1227
+ -0.443359375,
1228
+ -0.5,
1229
+ 0.5,
1230
+ 0.396484375,
1231
+ -0.5,
1232
+ -0.5,
1233
+ -0.5,
1234
+ -0.5,
1235
+ 0.5,
1236
+ -0.5,
1237
+ 0.5,
1238
+ -0.5,
1239
+ 0.5,
1240
+ 0.5,
1241
+ -0.5,
1242
+ 0.490234375,
1243
+ 0.5,
1244
+ 0.5
1245
+ ],
1246
+ "bias_update_speed": 0.001,
1247
+ "num_experts": 60,
1248
+ "module_type": "Qwen2MoeSparseMoeBlock",
1249
+ "device": "cuda:0",
1250
+ "dtype": "torch.bfloat16"
1251
+ },
1252
+ "model.layers.18.mlp": {
1253
+ "bias_values": [
1254
+ -0.5,
1255
+ -0.5,
1256
+ -0.5,
1257
+ -0.5,
1258
+ -0.5,
1259
+ -0.5,
1260
+ -0.5,
1261
+ -0.5,
1262
+ 0.5,
1263
+ -0.5,
1264
+ 0.5,
1265
+ -0.5,
1266
+ -0.5,
1267
+ -0.5,
1268
+ -0.5,
1269
+ -0.5,
1270
+ 0.5,
1271
+ 0.5,
1272
+ -0.5,
1273
+ 0.5,
1274
+ 0.5,
1275
+ 0.5,
1276
+ -0.5,
1277
+ -0.5,
1278
+ -0.5,
1279
+ -0.419921875,
1280
+ 0.5,
1281
+ -0.5,
1282
+ 0.5,
1283
+ 0.5,
1284
+ 0.5,
1285
+ -0.5,
1286
+ -0.5,
1287
+ 0.5,
1288
+ -0.5,
1289
+ -0.474609375,
1290
+ -0.5,
1291
+ 0.5,
1292
+ -0.5,
1293
+ -0.5,
1294
+ -0.5,
1295
+ -0.5,
1296
+ 0.451171875,
1297
+ -0.5,
1298
+ -0.5,
1299
+ -0.5,
1300
+ -0.5,
1301
+ -0.5,
1302
+ 0.5,
1303
+ 0.5,
1304
+ -0.5,
1305
+ -0.5,
1306
+ 0.5,
1307
+ -0.5,
1308
+ -0.443359375,
1309
+ 0.5,
1310
+ -0.5,
1311
+ -0.5,
1312
+ -0.5,
1313
+ -0.5
1314
+ ],
1315
+ "bias_update_speed": 0.001,
1316
+ "num_experts": 60,
1317
+ "module_type": "Qwen2MoeSparseMoeBlock",
1318
+ "device": "cuda:0",
1319
+ "dtype": "torch.bfloat16"
1320
+ },
1321
+ "model.layers.19.mlp": {
1322
+ "bias_values": [
1323
+ 0.5,
1324
+ -0.5,
1325
+ 0.5,
1326
+ -0.5,
1327
+ -0.5,
1328
+ -0.5,
1329
+ 0.388671875,
1330
+ -0.5,
1331
+ 0.5,
1332
+ -0.5,
1333
+ -0.5,
1334
+ 0.5,
1335
+ 0.5,
1336
+ -0.5,
1337
+ -0.5,
1338
+ -0.5,
1339
+ 0.5,
1340
+ -0.5,
1341
+ -0.5,
1342
+ -0.5,
1343
+ -0.5,
1344
+ 0.5,
1345
+ -0.5,
1346
+ 0.5,
1347
+ -0.474609375,
1348
+ -0.5,
1349
+ 0.5,
1350
+ -0.5,
1351
+ 0.271484375,
1352
+ -0.5,
1353
+ 0.5,
1354
+ 0.5,
1355
+ -0.5,
1356
+ 0.5,
1357
+ 0.5,
1358
+ -0.5,
1359
+ -0.5,
1360
+ 0.5,
1361
+ 0.5,
1362
+ -0.5,
1363
+ -0.5,
1364
+ -0.5,
1365
+ -0.5,
1366
+ 0.5,
1367
+ -0.5,
1368
+ -0.5,
1369
+ -0.5,
1370
+ 0.5,
1371
+ -0.5,
1372
+ -0.5,
1373
+ -0.5,
1374
+ -0.5,
1375
+ -0.5,
1376
+ -0.5,
1377
+ -0.5,
1378
+ 0.5,
1379
+ -0.5,
1380
+ -0.5,
1381
+ -0.5,
1382
+ -0.5
1383
+ ],
1384
+ "bias_update_speed": 0.001,
1385
+ "num_experts": 60,
1386
+ "module_type": "Qwen2MoeSparseMoeBlock",
1387
+ "device": "cuda:0",
1388
+ "dtype": "torch.bfloat16"
1389
+ },
1390
+ "model.layers.20.mlp": {
1391
+ "bias_values": [
1392
+ -0.5,
1393
+ 0.5,
1394
+ 0.5,
1395
+ -0.5,
1396
+ -0.5,
1397
+ -0.5,
1398
+ 0.5,
1399
+ 0.466796875,
1400
+ -0.5,
1401
+ -0.5,
1402
+ 0.5,
1403
+ 0.5,
1404
+ -0.5,
1405
+ -0.5,
1406
+ 0.5,
1407
+ -0.5,
1408
+ -0.5,
1409
+ 0.5,
1410
+ -0.490234375,
1411
+ -0.5,
1412
+ -0.5,
1413
+ -0.5,
1414
+ -0.5,
1415
+ -0.5,
1416
+ -0.5,
1417
+ 0.482421875,
1418
+ -0.5,
1419
+ -0.5,
1420
+ -0.5,
1421
+ 0.5,
1422
+ -0.5,
1423
+ -0.5,
1424
+ 0.396484375,
1425
+ -0.5,
1426
+ -0.490234375,
1427
+ 0.443359375,
1428
+ -0.5,
1429
+ -0.5,
1430
+ -0.5,
1431
+ 0.5,
1432
+ -0.5,
1433
+ -0.5,
1434
+ -0.5,
1435
+ 0.5,
1436
+ -0.5,
1437
+ 0.5,
1438
+ -0.5,
1439
+ -0.5,
1440
+ 0.5,
1441
+ 0.5,
1442
+ 0.5,
1443
+ 0.5,
1444
+ -0.5,
1445
+ -0.5,
1446
+ -0.5,
1447
+ -0.5,
1448
+ -0.5,
1449
+ 0.5,
1450
+ 0.5,
1451
+ -0.5
1452
+ ],
1453
+ "bias_update_speed": 0.001,
1454
+ "num_experts": 60,
1455
+ "module_type": "Qwen2MoeSparseMoeBlock",
1456
+ "device": "cuda:0",
1457
+ "dtype": "torch.bfloat16"
1458
+ },
1459
+ "model.layers.21.mlp": {
1460
+ "bias_values": [
1461
+ -0.5,
1462
+ 0.5,
1463
+ -0.5,
1464
+ -0.5,
1465
+ -0.5,
1466
+ 0.5,
1467
+ 0.2060546875,
1468
+ -0.5,
1469
+ -0.380859375,
1470
+ -0.5,
1471
+ -0.5,
1472
+ 0.458984375,
1473
+ -0.5,
1474
+ -0.5,
1475
+ -0.5,
1476
+ -0.5,
1477
+ -0.5,
1478
+ 0.5,
1479
+ 0.5,
1480
+ -0.5,
1481
+ 0.5,
1482
+ -0.5,
1483
+ -0.5,
1484
+ -0.5,
1485
+ 0.5,
1486
+ -0.5,
1487
+ -0.5,
1488
+ -0.5,
1489
+ 0.5,
1490
+ -0.5,
1491
+ -0.5,
1492
+ 0.5,
1493
+ -0.5,
1494
+ -0.5,
1495
+ -0.5,
1496
+ -0.5,
1497
+ 0.5,
1498
+ -0.5,
1499
+ 0.396484375,
1500
+ -0.5,
1501
+ 0.5,
1502
+ 0.5,
1503
+ -0.5,
1504
+ 0.5,
1505
+ -0.5,
1506
+ -0.5,
1507
+ -0.5,
1508
+ 0.5,
1509
+ -0.5,
1510
+ -0.326171875,
1511
+ -0.5,
1512
+ -0.5,
1513
+ -0.5,
1514
+ -0.5,
1515
+ -0.5,
1516
+ -0.5,
1517
+ -0.5,
1518
+ -0.5,
1519
+ -0.5,
1520
+ 0.5
1521
+ ],
1522
+ "bias_update_speed": 0.001,
1523
+ "num_experts": 60,
1524
+ "module_type": "Qwen2MoeSparseMoeBlock",
1525
+ "device": "cuda:0",
1526
+ "dtype": "torch.bfloat16"
1527
+ },
1528
+ "model.layers.22.mlp": {
1529
+ "bias_values": [
1530
+ -0.5,
1531
+ -0.5,
1532
+ 0.5,
1533
+ -0.5,
1534
+ -0.341796875,
1535
+ -0.5,
1536
+ -0.5,
1537
+ 0.5,
1538
+ 0.5,
1539
+ -0.384765625,
1540
+ -0.5,
1541
+ -0.5,
1542
+ -0.365234375,
1543
+ -0.5,
1544
+ -0.5,
1545
+ -0.5,
1546
+ -0.5,
1547
+ 0.5,
1548
+ -0.5,
1549
+ -0.5,
1550
+ 0.5,
1551
+ -0.5,
1552
+ -0.5,
1553
+ -0.5,
1554
+ 0.5,
1555
+ -0.5,
1556
+ 0.5,
1557
+ -0.5,
1558
+ -0.5,
1559
+ 0.5,
1560
+ -0.5,
1561
+ 0.5,
1562
+ -0.5,
1563
+ 0.5,
1564
+ 0.5,
1565
+ -0.5,
1566
+ 0.5,
1567
+ -0.5,
1568
+ 0.5,
1569
+ 0.5,
1570
+ 0.5,
1571
+ -0.5,
1572
+ -0.5,
1573
+ 0.5,
1574
+ -0.5,
1575
+ -0.5,
1576
+ -0.5,
1577
+ -0.5,
1578
+ 0.5,
1579
+ -0.5,
1580
+ 0.5,
1581
+ 0.5,
1582
+ -0.5,
1583
+ 0.5,
1584
+ -0.5,
1585
+ 0.5,
1586
+ 0.5,
1587
+ -0.5,
1588
+ 0.5,
1589
+ -0.5
1590
+ ],
1591
+ "bias_update_speed": 0.001,
1592
+ "num_experts": 60,
1593
+ "module_type": "Qwen2MoeSparseMoeBlock",
1594
+ "device": "cuda:0",
1595
+ "dtype": "torch.bfloat16"
1596
+ },
1597
+ "model.layers.23.mlp": {
1598
+ "bias_values": [
1599
+ 0.5,
1600
+ -0.5,
1601
+ -0.5,
1602
+ -0.5,
1603
+ -0.5,
1604
+ -0.5,
1605
+ 0.5,
1606
+ -0.5,
1607
+ 0.5,
1608
+ 0.404296875,
1609
+ 0.5,
1610
+ -0.5,
1611
+ -0.5,
1612
+ -0.5,
1613
+ 0.380859375,
1614
+ -0.5,
1615
+ -0.5,
1616
+ -0.5,
1617
+ 0.5,
1618
+ -0.5,
1619
+ -0.5,
1620
+ -0.5,
1621
+ 0.5,
1622
+ 0.5,
1623
+ -0.5,
1624
+ 0.1572265625,
1625
+ -0.5,
1626
+ -0.5,
1627
+ 0.5,
1628
+ -0.5,
1629
+ -0.5,
1630
+ -0.5,
1631
+ 0.5,
1632
+ -0.5,
1633
+ -0.5,
1634
+ -0.5,
1635
+ -0.5,
1636
+ -0.5,
1637
+ -0.5,
1638
+ 0.5,
1639
+ 0.5,
1640
+ -0.5,
1641
+ 0.5,
1642
+ 0.5,
1643
+ -0.5,
1644
+ -0.5,
1645
+ -0.5,
1646
+ 0.5,
1647
+ -0.5,
1648
+ -0.5,
1649
+ -0.5,
1650
+ 0.5,
1651
+ -0.5,
1652
+ 0.5,
1653
+ -0.5,
1654
+ -0.5,
1655
+ -0.5,
1656
+ 0.5,
1657
+ -0.5,
1658
+ -0.5
1659
+ ],
1660
+ "bias_update_speed": 0.001,
1661
+ "num_experts": 60,
1662
+ "module_type": "Qwen2MoeSparseMoeBlock",
1663
+ "device": "cuda:0",
1664
+ "dtype": "torch.bfloat16"
1665
+ }
1666
+ }
1667
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 2.0144468407196058e+17,
3
+ "train_loss": 0.325891209826913,
4
+ "train_runtime": 788.7208,
5
+ "train_samples": 6851,
6
+ "train_samples_per_second": 8.686,
7
+ "train_steps_per_second": 0.273
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1763 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 1.0,
6
+ "eval_steps": 500,
7
+ "global_step": 215,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.004651162790697674,
14
+ "grad_norm": 19.566913604736328,
15
+ "learning_rate": 0.0,
16
+ "loss": 0.842,
17
+ "mean_token_accuracy": 0.8234314918518066,
18
+ "step": 1
19
+ },
20
+ {
21
+ "epoch": 0.009302325581395349,
22
+ "grad_norm": 21.503646850585938,
23
+ "learning_rate": 4.5454545454545457e-07,
24
+ "loss": 0.8773,
25
+ "mean_token_accuracy": 0.8220880627632141,
26
+ "step": 2
27
+ },
28
+ {
29
+ "epoch": 0.013953488372093023,
30
+ "grad_norm": 19.072912216186523,
31
+ "learning_rate": 9.090909090909091e-07,
32
+ "loss": 0.8522,
33
+ "mean_token_accuracy": 0.832254946231842,
34
+ "step": 3
35
+ },
36
+ {
37
+ "epoch": 0.018604651162790697,
38
+ "grad_norm": 18.07330322265625,
39
+ "learning_rate": 1.3636363636363636e-06,
40
+ "loss": 0.8166,
41
+ "mean_token_accuracy": 0.8297423720359802,
42
+ "step": 4
43
+ },
44
+ {
45
+ "epoch": 0.023255813953488372,
46
+ "grad_norm": 21.246156692504883,
47
+ "learning_rate": 1.8181818181818183e-06,
48
+ "loss": 0.9312,
49
+ "mean_token_accuracy": 0.8097259998321533,
50
+ "step": 5
51
+ },
52
+ {
53
+ "epoch": 0.027906976744186046,
54
+ "grad_norm": 18.991504669189453,
55
+ "learning_rate": 2.2727272727272728e-06,
56
+ "loss": 0.7816,
57
+ "mean_token_accuracy": 0.8285066485404968,
58
+ "step": 6
59
+ },
60
+ {
61
+ "epoch": 0.03255813953488372,
62
+ "grad_norm": 16.567394256591797,
63
+ "learning_rate": 2.7272727272727272e-06,
64
+ "loss": 0.8123,
65
+ "mean_token_accuracy": 0.8268486261367798,
66
+ "step": 7
67
+ },
68
+ {
69
+ "epoch": 0.037209302325581395,
70
+ "grad_norm": 10.308907508850098,
71
+ "learning_rate": 3.181818181818182e-06,
72
+ "loss": 0.6385,
73
+ "mean_token_accuracy": 0.8507297039031982,
74
+ "step": 8
75
+ },
76
+ {
77
+ "epoch": 0.04186046511627907,
78
+ "grad_norm": 8.591865539550781,
79
+ "learning_rate": 3.6363636363636366e-06,
80
+ "loss": 0.5651,
81
+ "mean_token_accuracy": 0.863410234451294,
82
+ "step": 9
83
+ },
84
+ {
85
+ "epoch": 0.046511627906976744,
86
+ "grad_norm": 9.019761085510254,
87
+ "learning_rate": 4.0909090909090915e-06,
88
+ "loss": 0.4933,
89
+ "mean_token_accuracy": 0.8713976144790649,
90
+ "step": 10
91
+ },
92
+ {
93
+ "epoch": 0.05116279069767442,
94
+ "grad_norm": 9.10396671295166,
95
+ "learning_rate": 4.5454545454545455e-06,
96
+ "loss": 0.4211,
97
+ "mean_token_accuracy": 0.8915254473686218,
98
+ "step": 11
99
+ },
100
+ {
101
+ "epoch": 0.05581395348837209,
102
+ "grad_norm": 7.293844223022461,
103
+ "learning_rate": 5e-06,
104
+ "loss": 0.4411,
105
+ "mean_token_accuracy": 0.8812285661697388,
106
+ "step": 12
107
+ },
108
+ {
109
+ "epoch": 0.06046511627906977,
110
+ "grad_norm": 4.323971271514893,
111
+ "learning_rate": 5.4545454545454545e-06,
112
+ "loss": 0.4205,
113
+ "mean_token_accuracy": 0.8759526610374451,
114
+ "step": 13
115
+ },
116
+ {
117
+ "epoch": 0.06511627906976744,
118
+ "grad_norm": 5.9293413162231445,
119
+ "learning_rate": 5.90909090909091e-06,
120
+ "loss": 0.364,
121
+ "mean_token_accuracy": 0.8886061906814575,
122
+ "step": 14
123
+ },
124
+ {
125
+ "epoch": 0.06976744186046512,
126
+ "grad_norm": 3.369750738143921,
127
+ "learning_rate": 6.363636363636364e-06,
128
+ "loss": 0.3435,
129
+ "mean_token_accuracy": 0.8966382145881653,
130
+ "step": 15
131
+ },
132
+ {
133
+ "epoch": 0.07441860465116279,
134
+ "grad_norm": 3.6587915420532227,
135
+ "learning_rate": 6.818181818181818e-06,
136
+ "loss": 0.3801,
137
+ "mean_token_accuracy": 0.8860718607902527,
138
+ "step": 16
139
+ },
140
+ {
141
+ "epoch": 0.07906976744186046,
142
+ "grad_norm": 3.0630006790161133,
143
+ "learning_rate": 7.272727272727273e-06,
144
+ "loss": 0.333,
145
+ "mean_token_accuracy": 0.9021540880203247,
146
+ "step": 17
147
+ },
148
+ {
149
+ "epoch": 0.08372093023255814,
150
+ "grad_norm": 3.0884928703308105,
151
+ "learning_rate": 7.727272727272727e-06,
152
+ "loss": 0.3774,
153
+ "mean_token_accuracy": 0.8831896185874939,
154
+ "step": 18
155
+ },
156
+ {
157
+ "epoch": 0.08837209302325581,
158
+ "grad_norm": 3.343445301055908,
159
+ "learning_rate": 8.181818181818183e-06,
160
+ "loss": 0.3834,
161
+ "mean_token_accuracy": 0.884019136428833,
162
+ "step": 19
163
+ },
164
+ {
165
+ "epoch": 0.09302325581395349,
166
+ "grad_norm": 2.862471580505371,
167
+ "learning_rate": 8.636363636363637e-06,
168
+ "loss": 0.3523,
169
+ "mean_token_accuracy": 0.8919129371643066,
170
+ "step": 20
171
+ },
172
+ {
173
+ "epoch": 0.09767441860465116,
174
+ "grad_norm": 3.1519503593444824,
175
+ "learning_rate": 9.090909090909091e-06,
176
+ "loss": 0.3994,
177
+ "mean_token_accuracy": 0.8742711544036865,
178
+ "step": 21
179
+ },
180
+ {
181
+ "epoch": 0.10232558139534884,
182
+ "grad_norm": 3.08125376701355,
183
+ "learning_rate": 9.545454545454547e-06,
184
+ "loss": 0.3518,
185
+ "mean_token_accuracy": 0.8841822147369385,
186
+ "step": 22
187
+ },
188
+ {
189
+ "epoch": 0.10697674418604651,
190
+ "grad_norm": 3.0018293857574463,
191
+ "learning_rate": 1e-05,
192
+ "loss": 0.345,
193
+ "mean_token_accuracy": 0.8950249552726746,
194
+ "step": 23
195
+ },
196
+ {
197
+ "epoch": 0.11162790697674418,
198
+ "grad_norm": 3.267643451690674,
199
+ "learning_rate": 9.999403846557509e-06,
200
+ "loss": 0.3704,
201
+ "mean_token_accuracy": 0.8792986273765564,
202
+ "step": 24
203
+ },
204
+ {
205
+ "epoch": 0.11627906976744186,
206
+ "grad_norm": 3.051391839981079,
207
+ "learning_rate": 9.99761554418511e-06,
208
+ "loss": 0.3331,
209
+ "mean_token_accuracy": 0.896356999874115,
210
+ "step": 25
211
+ },
212
+ {
213
+ "epoch": 0.12093023255813953,
214
+ "grad_norm": 3.030799627304077,
215
+ "learning_rate": 9.99463556670619e-06,
216
+ "loss": 0.3339,
217
+ "mean_token_accuracy": 0.9024479985237122,
218
+ "step": 26
219
+ },
220
+ {
221
+ "epoch": 0.12558139534883722,
222
+ "grad_norm": 3.1301109790802,
223
+ "learning_rate": 9.990464703686895e-06,
224
+ "loss": 0.3652,
225
+ "mean_token_accuracy": 0.8962038159370422,
226
+ "step": 27
227
+ },
228
+ {
229
+ "epoch": 0.13023255813953488,
230
+ "grad_norm": 3.1041147708892822,
231
+ "learning_rate": 9.985104060226937e-06,
232
+ "loss": 0.3538,
233
+ "mean_token_accuracy": 0.8938086032867432,
234
+ "step": 28
235
+ },
236
+ {
237
+ "epoch": 0.13488372093023257,
238
+ "grad_norm": 3.011711835861206,
239
+ "learning_rate": 9.978555056666784e-06,
240
+ "loss": 0.3763,
241
+ "mean_token_accuracy": 0.878600537776947,
242
+ "step": 29
243
+ },
244
+ {
245
+ "epoch": 0.13953488372093023,
246
+ "grad_norm": 3.104379653930664,
247
+ "learning_rate": 9.97081942821133e-06,
248
+ "loss": 0.3706,
249
+ "mean_token_accuracy": 0.8877567648887634,
250
+ "step": 30
251
+ },
252
+ {
253
+ "epoch": 0.14418604651162792,
254
+ "grad_norm": 3.3601248264312744,
255
+ "learning_rate": 9.961899224470146e-06,
256
+ "loss": 0.3956,
257
+ "mean_token_accuracy": 0.8821022510528564,
258
+ "step": 31
259
+ },
260
+ {
261
+ "epoch": 0.14883720930232558,
262
+ "grad_norm": 3.0410044193267822,
263
+ "learning_rate": 9.95179680891442e-06,
264
+ "loss": 0.318,
265
+ "mean_token_accuracy": 0.9060280323028564,
266
+ "step": 32
267
+ },
268
+ {
269
+ "epoch": 0.15348837209302327,
270
+ "grad_norm": 3.7731399536132812,
271
+ "learning_rate": 9.940514858250736e-06,
272
+ "loss": 0.3565,
273
+ "mean_token_accuracy": 0.8926541209220886,
274
+ "step": 33
275
+ },
276
+ {
277
+ "epoch": 0.15813953488372093,
278
+ "grad_norm": 2.947474241256714,
279
+ "learning_rate": 9.928056361711854e-06,
280
+ "loss": 0.3035,
281
+ "mean_token_accuracy": 0.9058524370193481,
282
+ "step": 34
283
+ },
284
+ {
285
+ "epoch": 0.16279069767441862,
286
+ "grad_norm": 3.0079967975616455,
287
+ "learning_rate": 9.914424620264714e-06,
288
+ "loss": 0.3581,
289
+ "mean_token_accuracy": 0.8900263905525208,
290
+ "step": 35
291
+ },
292
+ {
293
+ "epoch": 0.16744186046511628,
294
+ "grad_norm": 3.0129306316375732,
295
+ "learning_rate": 9.899623245735798e-06,
296
+ "loss": 0.3283,
297
+ "mean_token_accuracy": 0.8980950713157654,
298
+ "step": 36
299
+ },
300
+ {
301
+ "epoch": 0.17209302325581396,
302
+ "grad_norm": 3.025678873062134,
303
+ "learning_rate": 9.883656159854166e-06,
304
+ "loss": 0.3179,
305
+ "mean_token_accuracy": 0.9024977087974548,
306
+ "step": 37
307
+ },
308
+ {
309
+ "epoch": 0.17674418604651163,
310
+ "grad_norm": 3.019639253616333,
311
+ "learning_rate": 9.866527593212355e-06,
312
+ "loss": 0.3314,
313
+ "mean_token_accuracy": 0.899016261100769,
314
+ "step": 38
315
+ },
316
+ {
317
+ "epoch": 0.1813953488372093,
318
+ "grad_norm": 2.9794199466705322,
319
+ "learning_rate": 9.848242084145462e-06,
320
+ "loss": 0.3199,
321
+ "mean_token_accuracy": 0.9022731184959412,
322
+ "step": 39
323
+ },
324
+ {
325
+ "epoch": 0.18604651162790697,
326
+ "grad_norm": 2.8634836673736572,
327
+ "learning_rate": 9.82880447752868e-06,
328
+ "loss": 0.3377,
329
+ "mean_token_accuracy": 0.8929729461669922,
330
+ "step": 40
331
+ },
332
+ {
333
+ "epoch": 0.19069767441860466,
334
+ "grad_norm": 3.023061990737915,
335
+ "learning_rate": 9.808219923493606e-06,
336
+ "loss": 0.3331,
337
+ "mean_token_accuracy": 0.8968125581741333,
338
+ "step": 41
339
+ },
340
+ {
341
+ "epoch": 0.19534883720930232,
342
+ "grad_norm": 3.201726198196411,
343
+ "learning_rate": 9.786493876063685e-06,
344
+ "loss": 0.3291,
345
+ "mean_token_accuracy": 0.8950858116149902,
346
+ "step": 42
347
+ },
348
+ {
349
+ "epoch": 0.2,
350
+ "grad_norm": 3.0741653442382812,
351
+ "learning_rate": 9.763632091709125e-06,
352
+ "loss": 0.31,
353
+ "mean_token_accuracy": 0.8963869214057922,
354
+ "step": 43
355
+ },
356
+ {
357
+ "epoch": 0.20465116279069767,
358
+ "grad_norm": 2.884732723236084,
359
+ "learning_rate": 9.739640627821678e-06,
360
+ "loss": 0.3381,
361
+ "mean_token_accuracy": 0.8979052901268005,
362
+ "step": 44
363
+ },
364
+ {
365
+ "epoch": 0.20930232558139536,
366
+ "grad_norm": 3.245725631713867,
367
+ "learning_rate": 9.714525841109697e-06,
368
+ "loss": 0.2951,
369
+ "mean_token_accuracy": 0.9035796523094177,
370
+ "step": 45
371
+ },
372
+ {
373
+ "epoch": 0.21395348837209302,
374
+ "grad_norm": 3.057401180267334,
375
+ "learning_rate": 9.68829438591387e-06,
376
+ "loss": 0.3034,
377
+ "mean_token_accuracy": 0.9002149701118469,
378
+ "step": 46
379
+ },
380
+ {
381
+ "epoch": 0.2186046511627907,
382
+ "grad_norm": 3.0023374557495117,
383
+ "learning_rate": 9.660953212444116e-06,
384
+ "loss": 0.3138,
385
+ "mean_token_accuracy": 0.8987898826599121,
386
+ "step": 47
387
+ },
388
+ {
389
+ "epoch": 0.22325581395348837,
390
+ "grad_norm": 2.846851110458374,
391
+ "learning_rate": 9.632509564938073e-06,
392
+ "loss": 0.3213,
393
+ "mean_token_accuracy": 0.8954131603240967,
394
+ "step": 48
395
+ },
396
+ {
397
+ "epoch": 0.22790697674418606,
398
+ "grad_norm": 2.9974007606506348,
399
+ "learning_rate": 9.60297097974169e-06,
400
+ "loss": 0.3149,
401
+ "mean_token_accuracy": 0.8978151082992554,
402
+ "step": 49
403
+ },
404
+ {
405
+ "epoch": 0.23255813953488372,
406
+ "grad_norm": 2.860105037689209,
407
+ "learning_rate": 9.572345283312407e-06,
408
+ "loss": 0.3152,
409
+ "mean_token_accuracy": 0.8984004259109497,
410
+ "step": 50
411
+ },
412
+ {
413
+ "epoch": 0.2372093023255814,
414
+ "grad_norm": 3.178861141204834,
415
+ "learning_rate": 9.540640590145496e-06,
416
+ "loss": 0.3362,
417
+ "mean_token_accuracy": 0.8821947574615479,
418
+ "step": 51
419
+ },
420
+ {
421
+ "epoch": 0.24186046511627907,
422
+ "grad_norm": 2.8864493370056152,
423
+ "learning_rate": 9.507865300624057e-06,
424
+ "loss": 0.338,
425
+ "mean_token_accuracy": 0.8844265937805176,
426
+ "step": 52
427
+ },
428
+ {
429
+ "epoch": 0.24651162790697675,
430
+ "grad_norm": 2.9779324531555176,
431
+ "learning_rate": 9.474028098793277e-06,
432
+ "loss": 0.3133,
433
+ "mean_token_accuracy": 0.8958209753036499,
434
+ "step": 53
435
+ },
436
+ {
437
+ "epoch": 0.25116279069767444,
438
+ "grad_norm": 2.81357479095459,
439
+ "learning_rate": 9.439137950059539e-06,
440
+ "loss": 0.3293,
441
+ "mean_token_accuracy": 0.8908188343048096,
442
+ "step": 54
443
+ },
444
+ {
445
+ "epoch": 0.2558139534883721,
446
+ "grad_norm": 2.7432124614715576,
447
+ "learning_rate": 9.403204098814965e-06,
448
+ "loss": 0.3289,
449
+ "mean_token_accuracy": 0.8898541927337646,
450
+ "step": 55
451
+ },
452
+ {
453
+ "epoch": 0.26046511627906976,
454
+ "grad_norm": 2.78106689453125,
455
+ "learning_rate": 9.366236065988053e-06,
456
+ "loss": 0.3222,
457
+ "mean_token_accuracy": 0.8953770995140076,
458
+ "step": 56
459
+ },
460
+ {
461
+ "epoch": 0.2651162790697674,
462
+ "grad_norm": 2.4186606407165527,
463
+ "learning_rate": 9.32824364652104e-06,
464
+ "loss": 0.2825,
465
+ "mean_token_accuracy": 0.908851146697998,
466
+ "step": 57
467
+ },
468
+ {
469
+ "epoch": 0.26976744186046514,
470
+ "grad_norm": 2.4946393966674805,
471
+ "learning_rate": 9.289236906774663e-06,
472
+ "loss": 0.3115,
473
+ "mean_token_accuracy": 0.8956378698348999,
474
+ "step": 58
475
+ },
476
+ {
477
+ "epoch": 0.2744186046511628,
478
+ "grad_norm": 2.7663369178771973,
479
+ "learning_rate": 9.249226181861e-06,
480
+ "loss": 0.2915,
481
+ "mean_token_accuracy": 0.9062834978103638,
482
+ "step": 59
483
+ },
484
+ {
485
+ "epoch": 0.27906976744186046,
486
+ "grad_norm": 2.6216726303100586,
487
+ "learning_rate": 9.208222072905113e-06,
488
+ "loss": 0.3069,
489
+ "mean_token_accuracy": 0.8927038908004761,
490
+ "step": 60
491
+ },
492
+ {
493
+ "epoch": 0.2837209302325581,
494
+ "grad_norm": 2.7621278762817383,
495
+ "learning_rate": 9.166235444236209e-06,
496
+ "loss": 0.3425,
497
+ "mean_token_accuracy": 0.8820893168449402,
498
+ "step": 61
499
+ },
500
+ {
501
+ "epoch": 0.28837209302325584,
502
+ "grad_norm": 2.748720407485962,
503
+ "learning_rate": 9.123277420509053e-06,
504
+ "loss": 0.3364,
505
+ "mean_token_accuracy": 0.8870337605476379,
506
+ "step": 62
507
+ },
508
+ {
509
+ "epoch": 0.2930232558139535,
510
+ "grad_norm": 2.5530190467834473,
511
+ "learning_rate": 9.079359383756411e-06,
512
+ "loss": 0.3133,
513
+ "mean_token_accuracy": 0.8994490504264832,
514
+ "step": 63
515
+ },
516
+ {
517
+ "epoch": 0.29767441860465116,
518
+ "grad_norm": 2.48974609375,
519
+ "learning_rate": 9.034492970373305e-06,
520
+ "loss": 0.3035,
521
+ "mean_token_accuracy": 0.8979023098945618,
522
+ "step": 64
523
+ },
524
+ {
525
+ "epoch": 0.3023255813953488,
526
+ "grad_norm": 2.6978116035461426,
527
+ "learning_rate": 8.988690068033864e-06,
528
+ "loss": 0.3339,
529
+ "mean_token_accuracy": 0.8893972635269165,
530
+ "step": 65
531
+ },
532
+ {
533
+ "epoch": 0.30697674418604654,
534
+ "grad_norm": 2.308871030807495,
535
+ "learning_rate": 8.941962812541604e-06,
536
+ "loss": 0.3016,
537
+ "mean_token_accuracy": 0.8937329649925232,
538
+ "step": 66
539
+ },
540
+ {
541
+ "epoch": 0.3116279069767442,
542
+ "grad_norm": 2.511657953262329,
543
+ "learning_rate": 8.894323584613951e-06,
544
+ "loss": 0.3194,
545
+ "mean_token_accuracy": 0.889618456363678,
546
+ "step": 67
547
+ },
548
+ {
549
+ "epoch": 0.31627906976744186,
550
+ "grad_norm": 2.389920949935913,
551
+ "learning_rate": 8.845785006601898e-06,
552
+ "loss": 0.281,
553
+ "mean_token_accuracy": 0.9037378430366516,
554
+ "step": 68
555
+ },
556
+ {
557
+ "epoch": 0.3209302325581395,
558
+ "grad_norm": 2.5016298294067383,
559
+ "learning_rate": 8.796359939145614e-06,
560
+ "loss": 0.3076,
561
+ "mean_token_accuracy": 0.8938754200935364,
562
+ "step": 69
563
+ },
564
+ {
565
+ "epoch": 0.32558139534883723,
566
+ "grad_norm": 2.682018518447876,
567
+ "learning_rate": 8.74606147776692e-06,
568
+ "loss": 0.3102,
569
+ "mean_token_accuracy": 0.8926771879196167,
570
+ "step": 70
571
+ },
572
+ {
573
+ "epoch": 0.3302325581395349,
574
+ "grad_norm": 2.5323574542999268,
575
+ "learning_rate": 8.694902949399555e-06,
576
+ "loss": 0.3095,
577
+ "mean_token_accuracy": 0.8985507488250732,
578
+ "step": 71
579
+ },
580
+ {
581
+ "epoch": 0.33488372093023255,
582
+ "grad_norm": 2.5909910202026367,
583
+ "learning_rate": 8.642897908858096e-06,
584
+ "loss": 0.3112,
585
+ "mean_token_accuracy": 0.8959133625030518,
586
+ "step": 72
587
+ },
588
+ {
589
+ "epoch": 0.3395348837209302,
590
+ "grad_norm": 2.635803461074829,
591
+ "learning_rate": 8.590060135246516e-06,
592
+ "loss": 0.3358,
593
+ "mean_token_accuracy": 0.8812984228134155,
594
+ "step": 73
595
+ },
596
+ {
597
+ "epoch": 0.34418604651162793,
598
+ "grad_norm": 2.2963192462921143,
599
+ "learning_rate": 8.53640362830732e-06,
600
+ "loss": 0.2899,
601
+ "mean_token_accuracy": 0.9005515575408936,
602
+ "step": 74
603
+ },
604
+ {
605
+ "epoch": 0.3488372093023256,
606
+ "grad_norm": 2.634748935699463,
607
+ "learning_rate": 8.481942604712209e-06,
608
+ "loss": 0.3187,
609
+ "mean_token_accuracy": 0.891135573387146,
610
+ "step": 75
611
+ },
612
+ {
613
+ "epoch": 0.35348837209302325,
614
+ "grad_norm": 2.4233202934265137,
615
+ "learning_rate": 8.426691494295269e-06,
616
+ "loss": 0.2874,
617
+ "mean_token_accuracy": 0.9031267762184143,
618
+ "step": 76
619
+ },
620
+ {
621
+ "epoch": 0.3581395348837209,
622
+ "grad_norm": 2.462113618850708,
623
+ "learning_rate": 8.370664936229688e-06,
624
+ "loss": 0.314,
625
+ "mean_token_accuracy": 0.8929266929626465,
626
+ "step": 77
627
+ },
628
+ {
629
+ "epoch": 0.3627906976744186,
630
+ "grad_norm": 2.456052780151367,
631
+ "learning_rate": 8.313877775149009e-06,
632
+ "loss": 0.2742,
633
+ "mean_token_accuracy": 0.9058647155761719,
634
+ "step": 78
635
+ },
636
+ {
637
+ "epoch": 0.3674418604651163,
638
+ "grad_norm": 2.792787790298462,
639
+ "learning_rate": 8.256345057213925e-06,
640
+ "loss": 0.3157,
641
+ "mean_token_accuracy": 0.8885141015052795,
642
+ "step": 79
643
+ },
644
+ {
645
+ "epoch": 0.37209302325581395,
646
+ "grad_norm": 2.2764811515808105,
647
+ "learning_rate": 8.198082026125707e-06,
648
+ "loss": 0.3285,
649
+ "mean_token_accuracy": 0.8876305222511292,
650
+ "step": 80
651
+ },
652
+ {
653
+ "epoch": 0.3767441860465116,
654
+ "grad_norm": 2.815459966659546,
655
+ "learning_rate": 8.139104119087265e-06,
656
+ "loss": 0.2839,
657
+ "mean_token_accuracy": 0.9014216065406799,
658
+ "step": 81
659
+ },
660
+ {
661
+ "epoch": 0.3813953488372093,
662
+ "grad_norm": 2.5127663612365723,
663
+ "learning_rate": 8.07942696271296e-06,
664
+ "loss": 0.2825,
665
+ "mean_token_accuracy": 0.9009503126144409,
666
+ "step": 82
667
+ },
668
+ {
669
+ "epoch": 0.386046511627907,
670
+ "grad_norm": 2.3554956912994385,
671
+ "learning_rate": 8.019066368888222e-06,
672
+ "loss": 0.2873,
673
+ "mean_token_accuracy": 0.902760922908783,
674
+ "step": 83
675
+ },
676
+ {
677
+ "epoch": 0.39069767441860465,
678
+ "grad_norm": 2.4911487102508545,
679
+ "learning_rate": 7.958038330580067e-06,
680
+ "loss": 0.3324,
681
+ "mean_token_accuracy": 0.883284866809845,
682
+ "step": 84
683
+ },
684
+ {
685
+ "epoch": 0.3953488372093023,
686
+ "grad_norm": 2.3755667209625244,
687
+ "learning_rate": 7.89635901759967e-06,
688
+ "loss": 0.3,
689
+ "mean_token_accuracy": 0.8920915126800537,
690
+ "step": 85
691
+ },
692
+ {
693
+ "epoch": 0.4,
694
+ "grad_norm": 2.7483432292938232,
695
+ "learning_rate": 7.834044772318033e-06,
696
+ "loss": 0.3282,
697
+ "mean_token_accuracy": 0.8910906910896301,
698
+ "step": 86
699
+ },
700
+ {
701
+ "epoch": 0.4046511627906977,
702
+ "grad_norm": 2.4823901653289795,
703
+ "learning_rate": 7.77111210533597e-06,
704
+ "loss": 0.3,
705
+ "mean_token_accuracy": 0.8902961015701294,
706
+ "step": 87
707
+ },
708
+ {
709
+ "epoch": 0.40930232558139534,
710
+ "grad_norm": 2.6033935546875,
711
+ "learning_rate": 7.707577691109519e-06,
712
+ "loss": 0.3548,
713
+ "mean_token_accuracy": 0.880534827709198,
714
+ "step": 88
715
+ },
716
+ {
717
+ "epoch": 0.413953488372093,
718
+ "grad_norm": 2.27488374710083,
719
+ "learning_rate": 7.6434583635319e-06,
720
+ "loss": 0.3022,
721
+ "mean_token_accuracy": 0.9012537002563477,
722
+ "step": 89
723
+ },
724
+ {
725
+ "epoch": 0.4186046511627907,
726
+ "grad_norm": 2.287916660308838,
727
+ "learning_rate": 7.578771111473276e-06,
728
+ "loss": 0.2578,
729
+ "mean_token_accuracy": 0.9131473898887634,
730
+ "step": 90
731
+ },
732
+ {
733
+ "epoch": 0.4232558139534884,
734
+ "grad_norm": 2.3548431396484375,
735
+ "learning_rate": 7.513533074279427e-06,
736
+ "loss": 0.2677,
737
+ "mean_token_accuracy": 0.9082207083702087,
738
+ "step": 91
739
+ },
740
+ {
741
+ "epoch": 0.42790697674418604,
742
+ "grad_norm": 2.484837532043457,
743
+ "learning_rate": 7.4477615372305545e-06,
744
+ "loss": 0.3218,
745
+ "mean_token_accuracy": 0.8943215012550354,
746
+ "step": 92
747
+ },
748
+ {
749
+ "epoch": 0.4325581395348837,
750
+ "grad_norm": 2.419360637664795,
751
+ "learning_rate": 7.3814739269614265e-06,
752
+ "loss": 0.2945,
753
+ "mean_token_accuracy": 0.9002562165260315,
754
+ "step": 93
755
+ },
756
+ {
757
+ "epoch": 0.4372093023255814,
758
+ "grad_norm": 2.4345784187316895,
759
+ "learning_rate": 7.314687806844067e-06,
760
+ "loss": 0.3091,
761
+ "mean_token_accuracy": 0.8968407511711121,
762
+ "step": 94
763
+ },
764
+ {
765
+ "epoch": 0.4418604651162791,
766
+ "grad_norm": 2.479210376739502,
767
+ "learning_rate": 7.247420872334221e-06,
768
+ "loss": 0.3019,
769
+ "mean_token_accuracy": 0.8969621658325195,
770
+ "step": 95
771
+ },
772
+ {
773
+ "epoch": 0.44651162790697674,
774
+ "grad_norm": 2.2260143756866455,
775
+ "learning_rate": 7.179690946282808e-06,
776
+ "loss": 0.2833,
777
+ "mean_token_accuracy": 0.9025951027870178,
778
+ "step": 96
779
+ },
780
+ {
781
+ "epoch": 0.4511627906976744,
782
+ "grad_norm": 2.2926838397979736,
783
+ "learning_rate": 7.111515974213639e-06,
784
+ "loss": 0.2491,
785
+ "mean_token_accuracy": 0.9109234809875488,
786
+ "step": 97
787
+ },
788
+ {
789
+ "epoch": 0.4558139534883721,
790
+ "grad_norm": 2.386190414428711,
791
+ "learning_rate": 7.042914019568621e-06,
792
+ "loss": 0.3028,
793
+ "mean_token_accuracy": 0.8939149379730225,
794
+ "step": 98
795
+ },
796
+ {
797
+ "epoch": 0.4604651162790698,
798
+ "grad_norm": 2.2225522994995117,
799
+ "learning_rate": 6.973903258921719e-06,
800
+ "loss": 0.2564,
801
+ "mean_token_accuracy": 0.909423291683197,
802
+ "step": 99
803
+ },
804
+ {
805
+ "epoch": 0.46511627906976744,
806
+ "grad_norm": 2.3535642623901367,
807
+ "learning_rate": 6.904501977162949e-06,
808
+ "loss": 0.3274,
809
+ "mean_token_accuracy": 0.88499915599823,
810
+ "step": 100
811
+ },
812
+ {
813
+ "epoch": 0.4697674418604651,
814
+ "grad_norm": 2.362877607345581,
815
+ "learning_rate": 6.834728562653659e-06,
816
+ "loss": 0.309,
817
+ "mean_token_accuracy": 0.8906528353691101,
818
+ "step": 101
819
+ },
820
+ {
821
+ "epoch": 0.4744186046511628,
822
+ "grad_norm": 2.4089605808258057,
823
+ "learning_rate": 6.764601502354403e-06,
824
+ "loss": 0.3059,
825
+ "mean_token_accuracy": 0.8971619009971619,
826
+ "step": 102
827
+ },
828
+ {
829
+ "epoch": 0.4790697674418605,
830
+ "grad_norm": 2.2496674060821533,
831
+ "learning_rate": 6.6941393769266995e-06,
832
+ "loss": 0.2987,
833
+ "mean_token_accuracy": 0.8985577821731567,
834
+ "step": 103
835
+ },
836
+ {
837
+ "epoch": 0.48372093023255813,
838
+ "grad_norm": 2.0976436138153076,
839
+ "learning_rate": 6.6233608558099405e-06,
840
+ "loss": 0.2774,
841
+ "mean_token_accuracy": 0.9051709771156311,
842
+ "step": 104
843
+ },
844
+ {
845
+ "epoch": 0.4883720930232558,
846
+ "grad_norm": 2.2868094444274902,
847
+ "learning_rate": 6.552284692274803e-06,
848
+ "loss": 0.3088,
849
+ "mean_token_accuracy": 0.8942136764526367,
850
+ "step": 105
851
+ },
852
+ {
853
+ "epoch": 0.4930232558139535,
854
+ "grad_norm": 2.3689169883728027,
855
+ "learning_rate": 6.48092971845443e-06,
856
+ "loss": 0.2793,
857
+ "mean_token_accuracy": 0.9030190706253052,
858
+ "step": 106
859
+ },
860
+ {
861
+ "epoch": 0.49767441860465117,
862
+ "grad_norm": 2.390408515930176,
863
+ "learning_rate": 6.409314840354724e-06,
864
+ "loss": 0.3383,
865
+ "mean_token_accuracy": 0.8868411779403687,
866
+ "step": 107
867
+ },
868
+ {
869
+ "epoch": 0.5023255813953489,
870
+ "grad_norm": 2.522303342819214,
871
+ "learning_rate": 6.337459032845068e-06,
872
+ "loss": 0.3146,
873
+ "mean_token_accuracy": 0.8873773217201233,
874
+ "step": 108
875
+ },
876
+ {
877
+ "epoch": 0.5069767441860465,
878
+ "grad_norm": 2.418745279312134,
879
+ "learning_rate": 6.2653813346308e-06,
880
+ "loss": 0.2738,
881
+ "mean_token_accuracy": 0.9050701260566711,
882
+ "step": 109
883
+ },
884
+ {
885
+ "epoch": 0.5116279069767442,
886
+ "grad_norm": 2.2483389377593994,
887
+ "learning_rate": 6.193100843208772e-06,
888
+ "loss": 0.282,
889
+ "mean_token_accuracy": 0.8973640203475952,
890
+ "step": 110
891
+ },
892
+ {
893
+ "epoch": 0.5162790697674419,
894
+ "grad_norm": 2.3183794021606445,
895
+ "learning_rate": 6.120636709807334e-06,
896
+ "loss": 0.3109,
897
+ "mean_token_accuracy": 0.8886767029762268,
898
+ "step": 111
899
+ },
900
+ {
901
+ "epoch": 0.5209302325581395,
902
+ "grad_norm": 2.3985159397125244,
903
+ "learning_rate": 6.048008134312078e-06,
904
+ "loss": 0.2891,
905
+ "mean_token_accuracy": 0.9019830226898193,
906
+ "step": 112
907
+ },
908
+ {
909
+ "epoch": 0.5255813953488372,
910
+ "grad_norm": 2.401186943054199,
911
+ "learning_rate": 5.975234360178698e-06,
912
+ "loss": 0.3263,
913
+ "mean_token_accuracy": 0.8926165103912354,
914
+ "step": 113
915
+ },
916
+ {
917
+ "epoch": 0.5302325581395348,
918
+ "grad_norm": 2.2257766723632812,
919
+ "learning_rate": 5.902334669334287e-06,
920
+ "loss": 0.2991,
921
+ "mean_token_accuracy": 0.9015018939971924,
922
+ "step": 114
923
+ },
924
+ {
925
+ "epoch": 0.5348837209302325,
926
+ "grad_norm": 2.4873902797698975,
927
+ "learning_rate": 5.829328377068476e-06,
928
+ "loss": 0.2952,
929
+ "mean_token_accuracy": 0.8938588500022888,
930
+ "step": 115
931
+ },
932
+ {
933
+ "epoch": 0.5395348837209303,
934
+ "grad_norm": 2.5127971172332764,
935
+ "learning_rate": 5.756234826915686e-06,
936
+ "loss": 0.3015,
937
+ "mean_token_accuracy": 0.8990825414657593,
938
+ "step": 116
939
+ },
940
+ {
941
+ "epoch": 0.5441860465116279,
942
+ "grad_norm": 2.7055206298828125,
943
+ "learning_rate": 5.683073385529938e-06,
944
+ "loss": 0.3148,
945
+ "mean_token_accuracy": 0.8902867436408997,
946
+ "step": 117
947
+ },
948
+ {
949
+ "epoch": 0.5488372093023256,
950
+ "grad_norm": 2.184539794921875,
951
+ "learning_rate": 5.60986343755352e-06,
952
+ "loss": 0.2628,
953
+ "mean_token_accuracy": 0.907613217830658,
954
+ "step": 118
955
+ },
956
+ {
957
+ "epoch": 0.5534883720930233,
958
+ "grad_norm": 2.460629463195801,
959
+ "learning_rate": 5.536624380480878e-06,
960
+ "loss": 0.2526,
961
+ "mean_token_accuracy": 0.9117704629898071,
962
+ "step": 119
963
+ },
964
+ {
965
+ "epoch": 0.5581395348837209,
966
+ "grad_norm": 2.3159563541412354,
967
+ "learning_rate": 5.4633756195191235e-06,
968
+ "loss": 0.261,
969
+ "mean_token_accuracy": 0.90885990858078,
970
+ "step": 120
971
+ },
972
+ {
973
+ "epoch": 0.5627906976744186,
974
+ "grad_norm": 2.0507678985595703,
975
+ "learning_rate": 5.390136562446482e-06,
976
+ "loss": 0.2262,
977
+ "mean_token_accuracy": 0.9185211062431335,
978
+ "step": 121
979
+ },
980
+ {
981
+ "epoch": 0.5674418604651162,
982
+ "grad_norm": 2.456918954849243,
983
+ "learning_rate": 5.316926614470063e-06,
984
+ "loss": 0.3052,
985
+ "mean_token_accuracy": 0.8933428525924683,
986
+ "step": 122
987
+ },
988
+ {
989
+ "epoch": 0.5720930232558139,
990
+ "grad_norm": 2.683582067489624,
991
+ "learning_rate": 5.2437651730843165e-06,
992
+ "loss": 0.3406,
993
+ "mean_token_accuracy": 0.8779610991477966,
994
+ "step": 123
995
+ },
996
+ {
997
+ "epoch": 0.5767441860465117,
998
+ "grad_norm": 2.5005733966827393,
999
+ "learning_rate": 5.170671622931527e-06,
1000
+ "loss": 0.3104,
1001
+ "mean_token_accuracy": 0.8913931250572205,
1002
+ "step": 124
1003
+ },
1004
+ {
1005
+ "epoch": 0.5813953488372093,
1006
+ "grad_norm": 2.073763370513916,
1007
+ "learning_rate": 5.097665330665714e-06,
1008
+ "loss": 0.2393,
1009
+ "mean_token_accuracy": 0.9153239727020264,
1010
+ "step": 125
1011
+ },
1012
+ {
1013
+ "epoch": 0.586046511627907,
1014
+ "grad_norm": 2.638117551803589,
1015
+ "learning_rate": 5.024765639821305e-06,
1016
+ "loss": 0.321,
1017
+ "mean_token_accuracy": 0.8932949900627136,
1018
+ "step": 126
1019
+ },
1020
+ {
1021
+ "epoch": 0.5906976744186047,
1022
+ "grad_norm": 2.077889919281006,
1023
+ "learning_rate": 4.951991865687923e-06,
1024
+ "loss": 0.2523,
1025
+ "mean_token_accuracy": 0.916557788848877,
1026
+ "step": 127
1027
+ },
1028
+ {
1029
+ "epoch": 0.5953488372093023,
1030
+ "grad_norm": 2.4227733612060547,
1031
+ "learning_rate": 4.879363290192667e-06,
1032
+ "loss": 0.2931,
1033
+ "mean_token_accuracy": 0.9004586338996887,
1034
+ "step": 128
1035
+ },
1036
+ {
1037
+ "epoch": 0.6,
1038
+ "grad_norm": 2.7623236179351807,
1039
+ "learning_rate": 4.806899156791231e-06,
1040
+ "loss": 0.359,
1041
+ "mean_token_accuracy": 0.8809077143669128,
1042
+ "step": 129
1043
+ },
1044
+ {
1045
+ "epoch": 0.6046511627906976,
1046
+ "grad_norm": 2.4066247940063477,
1047
+ "learning_rate": 4.734618665369202e-06,
1048
+ "loss": 0.2844,
1049
+ "mean_token_accuracy": 0.8950888514518738,
1050
+ "step": 130
1051
+ },
1052
+ {
1053
+ "epoch": 0.6093023255813953,
1054
+ "grad_norm": 2.34375,
1055
+ "learning_rate": 4.662540967154934e-06,
1056
+ "loss": 0.2673,
1057
+ "mean_token_accuracy": 0.9051871299743652,
1058
+ "step": 131
1059
+ },
1060
+ {
1061
+ "epoch": 0.6139534883720931,
1062
+ "grad_norm": 2.1489133834838867,
1063
+ "learning_rate": 4.5906851596452765e-06,
1064
+ "loss": 0.2721,
1065
+ "mean_token_accuracy": 0.9027295112609863,
1066
+ "step": 132
1067
+ },
1068
+ {
1069
+ "epoch": 0.6186046511627907,
1070
+ "grad_norm": 2.2944834232330322,
1071
+ "learning_rate": 4.519070281545571e-06,
1072
+ "loss": 0.3041,
1073
+ "mean_token_accuracy": 0.8964738845825195,
1074
+ "step": 133
1075
+ },
1076
+ {
1077
+ "epoch": 0.6232558139534884,
1078
+ "grad_norm": 2.323543071746826,
1079
+ "learning_rate": 4.447715307725197e-06,
1080
+ "loss": 0.3137,
1081
+ "mean_token_accuracy": 0.8975229263305664,
1082
+ "step": 134
1083
+ },
1084
+ {
1085
+ "epoch": 0.627906976744186,
1086
+ "grad_norm": 2.3029696941375732,
1087
+ "learning_rate": 4.376639144190061e-06,
1088
+ "loss": 0.3004,
1089
+ "mean_token_accuracy": 0.8968156576156616,
1090
+ "step": 135
1091
+ },
1092
+ {
1093
+ "epoch": 0.6325581395348837,
1094
+ "grad_norm": 2.538562536239624,
1095
+ "learning_rate": 4.305860623073304e-06,
1096
+ "loss": 0.2989,
1097
+ "mean_token_accuracy": 0.8975074291229248,
1098
+ "step": 136
1099
+ },
1100
+ {
1101
+ "epoch": 0.6372093023255814,
1102
+ "grad_norm": 2.164029836654663,
1103
+ "learning_rate": 4.2353984976456e-06,
1104
+ "loss": 0.2795,
1105
+ "mean_token_accuracy": 0.9066439270973206,
1106
+ "step": 137
1107
+ },
1108
+ {
1109
+ "epoch": 0.641860465116279,
1110
+ "grad_norm": 2.2238547801971436,
1111
+ "learning_rate": 4.1652714373463435e-06,
1112
+ "loss": 0.2878,
1113
+ "mean_token_accuracy": 0.8984693884849548,
1114
+ "step": 138
1115
+ },
1116
+ {
1117
+ "epoch": 0.6465116279069767,
1118
+ "grad_norm": 2.2204103469848633,
1119
+ "learning_rate": 4.095498022837051e-06,
1120
+ "loss": 0.2723,
1121
+ "mean_token_accuracy": 0.9076555967330933,
1122
+ "step": 139
1123
+ },
1124
+ {
1125
+ "epoch": 0.6511627906976745,
1126
+ "grad_norm": 2.327342987060547,
1127
+ "learning_rate": 4.026096741078281e-06,
1128
+ "loss": 0.3144,
1129
+ "mean_token_accuracy": 0.8956175446510315,
1130
+ "step": 140
1131
+ },
1132
+ {
1133
+ "epoch": 0.6558139534883721,
1134
+ "grad_norm": 2.408447027206421,
1135
+ "learning_rate": 3.957085980431382e-06,
1136
+ "loss": 0.3109,
1137
+ "mean_token_accuracy": 0.8903638124465942,
1138
+ "step": 141
1139
+ },
1140
+ {
1141
+ "epoch": 0.6604651162790698,
1142
+ "grad_norm": 2.17551851272583,
1143
+ "learning_rate": 3.888484025786364e-06,
1144
+ "loss": 0.2808,
1145
+ "mean_token_accuracy": 0.9039127230644226,
1146
+ "step": 142
1147
+ },
1148
+ {
1149
+ "epoch": 0.6651162790697674,
1150
+ "grad_norm": 2.148789405822754,
1151
+ "learning_rate": 3.820309053717195e-06,
1152
+ "loss": 0.2644,
1153
+ "mean_token_accuracy": 0.9068344831466675,
1154
+ "step": 143
1155
+ },
1156
+ {
1157
+ "epoch": 0.6697674418604651,
1158
+ "grad_norm": 2.4230973720550537,
1159
+ "learning_rate": 3.75257912766578e-06,
1160
+ "loss": 0.3521,
1161
+ "mean_token_accuracy": 0.8838323354721069,
1162
+ "step": 144
1163
+ },
1164
+ {
1165
+ "epoch": 0.6744186046511628,
1166
+ "grad_norm": 2.407576560974121,
1167
+ "learning_rate": 3.6853121931559334e-06,
1168
+ "loss": 0.3176,
1169
+ "mean_token_accuracy": 0.8907623291015625,
1170
+ "step": 145
1171
+ },
1172
+ {
1173
+ "epoch": 0.6790697674418604,
1174
+ "grad_norm": 2.257375478744507,
1175
+ "learning_rate": 3.618526073038574e-06,
1176
+ "loss": 0.3402,
1177
+ "mean_token_accuracy": 0.877872109413147,
1178
+ "step": 146
1179
+ },
1180
+ {
1181
+ "epoch": 0.6837209302325581,
1182
+ "grad_norm": 2.3605127334594727,
1183
+ "learning_rate": 3.552238462769446e-06,
1184
+ "loss": 0.2822,
1185
+ "mean_token_accuracy": 0.898716390132904,
1186
+ "step": 147
1187
+ },
1188
+ {
1189
+ "epoch": 0.6883720930232559,
1190
+ "grad_norm": 2.5195729732513428,
1191
+ "learning_rate": 3.4864669257205745e-06,
1192
+ "loss": 0.2726,
1193
+ "mean_token_accuracy": 0.9063690900802612,
1194
+ "step": 148
1195
+ },
1196
+ {
1197
+ "epoch": 0.6930232558139535,
1198
+ "grad_norm": 2.344093084335327,
1199
+ "learning_rate": 3.4212288885267246e-06,
1200
+ "loss": 0.3185,
1201
+ "mean_token_accuracy": 0.8916053771972656,
1202
+ "step": 149
1203
+ },
1204
+ {
1205
+ "epoch": 0.6976744186046512,
1206
+ "grad_norm": 2.4580070972442627,
1207
+ "learning_rate": 3.3565416364681016e-06,
1208
+ "loss": 0.2861,
1209
+ "mean_token_accuracy": 0.8970394134521484,
1210
+ "step": 150
1211
+ },
1212
+ {
1213
+ "epoch": 0.7023255813953488,
1214
+ "grad_norm": 2.1077418327331543,
1215
+ "learning_rate": 3.2924223088904816e-06,
1216
+ "loss": 0.2753,
1217
+ "mean_token_accuracy": 0.9032756090164185,
1218
+ "step": 151
1219
+ },
1220
+ {
1221
+ "epoch": 0.7069767441860465,
1222
+ "grad_norm": 2.571012258529663,
1223
+ "learning_rate": 3.228887894664029e-06,
1224
+ "loss": 0.3382,
1225
+ "mean_token_accuracy": 0.8874310851097107,
1226
+ "step": 152
1227
+ },
1228
+ {
1229
+ "epoch": 0.7116279069767442,
1230
+ "grad_norm": 2.415187358856201,
1231
+ "learning_rate": 3.1659552276819693e-06,
1232
+ "loss": 0.311,
1233
+ "mean_token_accuracy": 0.8911898136138916,
1234
+ "step": 153
1235
+ },
1236
+ {
1237
+ "epoch": 0.7162790697674418,
1238
+ "grad_norm": 2.24096941947937,
1239
+ "learning_rate": 3.1036409824003324e-06,
1240
+ "loss": 0.2866,
1241
+ "mean_token_accuracy": 0.9001513123512268,
1242
+ "step": 154
1243
+ },
1244
+ {
1245
+ "epoch": 0.7209302325581395,
1246
+ "grad_norm": 2.1161606311798096,
1247
+ "learning_rate": 3.0419616694199327e-06,
1248
+ "loss": 0.2648,
1249
+ "mean_token_accuracy": 0.9077988862991333,
1250
+ "step": 155
1251
+ },
1252
+ {
1253
+ "epoch": 0.7255813953488373,
1254
+ "grad_norm": 2.244520425796509,
1255
+ "learning_rate": 2.98093363111178e-06,
1256
+ "loss": 0.2427,
1257
+ "mean_token_accuracy": 0.9150313138961792,
1258
+ "step": 156
1259
+ },
1260
+ {
1261
+ "epoch": 0.7302325581395349,
1262
+ "grad_norm": 2.494776725769043,
1263
+ "learning_rate": 2.92057303728704e-06,
1264
+ "loss": 0.2976,
1265
+ "mean_token_accuracy": 0.8998464941978455,
1266
+ "step": 157
1267
+ },
1268
+ {
1269
+ "epoch": 0.7348837209302326,
1270
+ "grad_norm": 2.248281717300415,
1271
+ "learning_rate": 2.860895880912735e-06,
1272
+ "loss": 0.2849,
1273
+ "mean_token_accuracy": 0.9054994583129883,
1274
+ "step": 158
1275
+ },
1276
+ {
1277
+ "epoch": 0.7395348837209302,
1278
+ "grad_norm": 2.331533432006836,
1279
+ "learning_rate": 2.801917973874294e-06,
1280
+ "loss": 0.3037,
1281
+ "mean_token_accuracy": 0.8930827379226685,
1282
+ "step": 159
1283
+ },
1284
+ {
1285
+ "epoch": 0.7441860465116279,
1286
+ "grad_norm": 2.3448197841644287,
1287
+ "learning_rate": 2.7436549427860766e-06,
1288
+ "loss": 0.3222,
1289
+ "mean_token_accuracy": 0.889302134513855,
1290
+ "step": 160
1291
+ },
1292
+ {
1293
+ "epoch": 0.7488372093023256,
1294
+ "grad_norm": 2.071540355682373,
1295
+ "learning_rate": 2.6861222248509926e-06,
1296
+ "loss": 0.2476,
1297
+ "mean_token_accuracy": 0.9146341681480408,
1298
+ "step": 161
1299
+ },
1300
+ {
1301
+ "epoch": 0.7534883720930232,
1302
+ "grad_norm": 2.1529202461242676,
1303
+ "learning_rate": 2.6293350637703123e-06,
1304
+ "loss": 0.2813,
1305
+ "mean_token_accuracy": 0.9099606871604919,
1306
+ "step": 162
1307
+ },
1308
+ {
1309
+ "epoch": 0.7581395348837209,
1310
+ "grad_norm": 2.4522061347961426,
1311
+ "learning_rate": 2.5733085057047325e-06,
1312
+ "loss": 0.2871,
1313
+ "mean_token_accuracy": 0.9062443971633911,
1314
+ "step": 163
1315
+ },
1316
+ {
1317
+ "epoch": 0.7627906976744186,
1318
+ "grad_norm": 2.1827683448791504,
1319
+ "learning_rate": 2.518057395287792e-06,
1320
+ "loss": 0.2984,
1321
+ "mean_token_accuracy": 0.8948549628257751,
1322
+ "step": 164
1323
+ },
1324
+ {
1325
+ "epoch": 0.7674418604651163,
1326
+ "grad_norm": 2.5053515434265137,
1327
+ "learning_rate": 2.463596371692681e-06,
1328
+ "loss": 0.297,
1329
+ "mean_token_accuracy": 0.8924633860588074,
1330
+ "step": 165
1331
+ },
1332
+ {
1333
+ "epoch": 0.772093023255814,
1334
+ "grad_norm": 2.1398110389709473,
1335
+ "learning_rate": 2.409939864753487e-06,
1336
+ "loss": 0.2976,
1337
+ "mean_token_accuracy": 0.8955926895141602,
1338
+ "step": 166
1339
+ },
1340
+ {
1341
+ "epoch": 0.7767441860465116,
1342
+ "grad_norm": 2.19425892829895,
1343
+ "learning_rate": 2.3571020911419067e-06,
1344
+ "loss": 0.2816,
1345
+ "mean_token_accuracy": 0.9085714221000671,
1346
+ "step": 167
1347
+ },
1348
+ {
1349
+ "epoch": 0.7813953488372093,
1350
+ "grad_norm": 2.4098033905029297,
1351
+ "learning_rate": 2.3050970506004463e-06,
1352
+ "loss": 0.3113,
1353
+ "mean_token_accuracy": 0.8950749635696411,
1354
+ "step": 168
1355
+ },
1356
+ {
1357
+ "epoch": 0.786046511627907,
1358
+ "grad_norm": 2.4144301414489746,
1359
+ "learning_rate": 2.2539385222330797e-06,
1360
+ "loss": 0.2541,
1361
+ "mean_token_accuracy": 0.90910804271698,
1362
+ "step": 169
1363
+ },
1364
+ {
1365
+ "epoch": 0.7906976744186046,
1366
+ "grad_norm": 2.286418914794922,
1367
+ "learning_rate": 2.203640060854387e-06,
1368
+ "loss": 0.269,
1369
+ "mean_token_accuracy": 0.9078757166862488,
1370
+ "step": 170
1371
+ },
1372
+ {
1373
+ "epoch": 0.7953488372093023,
1374
+ "grad_norm": 2.281157970428467,
1375
+ "learning_rate": 2.1542149933981014e-06,
1376
+ "loss": 0.3069,
1377
+ "mean_token_accuracy": 0.893151581287384,
1378
+ "step": 171
1379
+ },
1380
+ {
1381
+ "epoch": 0.8,
1382
+ "grad_norm": 2.220977544784546,
1383
+ "learning_rate": 2.10567641538605e-06,
1384
+ "loss": 0.279,
1385
+ "mean_token_accuracy": 0.9098817110061646,
1386
+ "step": 172
1387
+ },
1388
+ {
1389
+ "epoch": 0.8046511627906977,
1390
+ "grad_norm": 2.178304433822632,
1391
+ "learning_rate": 2.058037187458398e-06,
1392
+ "loss": 0.2903,
1393
+ "mean_token_accuracy": 0.9047447443008423,
1394
+ "step": 173
1395
+ },
1396
+ {
1397
+ "epoch": 0.8093023255813954,
1398
+ "grad_norm": 2.4263737201690674,
1399
+ "learning_rate": 2.011309931966136e-06,
1400
+ "loss": 0.2844,
1401
+ "mean_token_accuracy": 0.9039999842643738,
1402
+ "step": 174
1403
+ },
1404
+ {
1405
+ "epoch": 0.813953488372093,
1406
+ "grad_norm": 2.2288589477539062,
1407
+ "learning_rate": 1.965507029626695e-06,
1408
+ "loss": 0.2977,
1409
+ "mean_token_accuracy": 0.8857610821723938,
1410
+ "step": 175
1411
+ },
1412
+ {
1413
+ "epoch": 0.8186046511627907,
1414
+ "grad_norm": 2.2465760707855225,
1415
+ "learning_rate": 1.920640616243589e-06,
1416
+ "loss": 0.2632,
1417
+ "mean_token_accuracy": 0.9077427387237549,
1418
+ "step": 176
1419
+ },
1420
+ {
1421
+ "epoch": 0.8232558139534883,
1422
+ "grad_norm": 2.478376626968384,
1423
+ "learning_rate": 1.8767225794909484e-06,
1424
+ "loss": 0.321,
1425
+ "mean_token_accuracy": 0.8919563889503479,
1426
+ "step": 177
1427
+ },
1428
+ {
1429
+ "epoch": 0.827906976744186,
1430
+ "grad_norm": 2.138991594314575,
1431
+ "learning_rate": 1.8337645557637929e-06,
1432
+ "loss": 0.2744,
1433
+ "mean_token_accuracy": 0.9032313227653503,
1434
+ "step": 178
1435
+ },
1436
+ {
1437
+ "epoch": 0.8325581395348837,
1438
+ "grad_norm": 2.145294189453125,
1439
+ "learning_rate": 1.7917779270948887e-06,
1440
+ "loss": 0.2934,
1441
+ "mean_token_accuracy": 0.8986738324165344,
1442
+ "step": 179
1443
+ },
1444
+ {
1445
+ "epoch": 0.8372093023255814,
1446
+ "grad_norm": 2.3890199661254883,
1447
+ "learning_rate": 1.7507738181390027e-06,
1448
+ "loss": 0.2847,
1449
+ "mean_token_accuracy": 0.9007092118263245,
1450
+ "step": 180
1451
+ },
1452
+ {
1453
+ "epoch": 0.8418604651162791,
1454
+ "grad_norm": 2.3594157695770264,
1455
+ "learning_rate": 1.7107630932253383e-06,
1456
+ "loss": 0.2853,
1457
+ "mean_token_accuracy": 0.8999655842781067,
1458
+ "step": 181
1459
+ },
1460
+ {
1461
+ "epoch": 0.8465116279069768,
1462
+ "grad_norm": 2.2847506999969482,
1463
+ "learning_rate": 1.6717563534789594e-06,
1464
+ "loss": 0.2909,
1465
+ "mean_token_accuracy": 0.9014893174171448,
1466
+ "step": 182
1467
+ },
1468
+ {
1469
+ "epoch": 0.8511627906976744,
1470
+ "grad_norm": 2.3248329162597656,
1471
+ "learning_rate": 1.6337639340119476e-06,
1472
+ "loss": 0.2996,
1473
+ "mean_token_accuracy": 0.8940504193305969,
1474
+ "step": 183
1475
+ },
1476
+ {
1477
+ "epoch": 0.8558139534883721,
1478
+ "grad_norm": 2.390031576156616,
1479
+ "learning_rate": 1.596795901185037e-06,
1480
+ "loss": 0.305,
1481
+ "mean_token_accuracy": 0.8964377641677856,
1482
+ "step": 184
1483
+ },
1484
+ {
1485
+ "epoch": 0.8604651162790697,
1486
+ "grad_norm": 2.054699420928955,
1487
+ "learning_rate": 1.5608620499404628e-06,
1488
+ "loss": 0.2769,
1489
+ "mean_token_accuracy": 0.9020813703536987,
1490
+ "step": 185
1491
+ },
1492
+ {
1493
+ "epoch": 0.8651162790697674,
1494
+ "grad_norm": 2.3316433429718018,
1495
+ "learning_rate": 1.5259719012067249e-06,
1496
+ "loss": 0.3028,
1497
+ "mean_token_accuracy": 0.8970876932144165,
1498
+ "step": 186
1499
+ },
1500
+ {
1501
+ "epoch": 0.8697674418604651,
1502
+ "grad_norm": 2.2691869735717773,
1503
+ "learning_rate": 1.4921346993759453e-06,
1504
+ "loss": 0.27,
1505
+ "mean_token_accuracy": 0.9035449028015137,
1506
+ "step": 187
1507
+ },
1508
+ {
1509
+ "epoch": 0.8744186046511628,
1510
+ "grad_norm": 2.1666178703308105,
1511
+ "learning_rate": 1.459359409854505e-06,
1512
+ "loss": 0.2836,
1513
+ "mean_token_accuracy": 0.9079626202583313,
1514
+ "step": 188
1515
+ },
1516
+ {
1517
+ "epoch": 0.8790697674418605,
1518
+ "grad_norm": 2.2185068130493164,
1519
+ "learning_rate": 1.4276547166875946e-06,
1520
+ "loss": 0.2652,
1521
+ "mean_token_accuracy": 0.9092705249786377,
1522
+ "step": 189
1523
+ },
1524
+ {
1525
+ "epoch": 0.8837209302325582,
1526
+ "grad_norm": 2.47299861907959,
1527
+ "learning_rate": 1.397029020258313e-06,
1528
+ "loss": 0.2879,
1529
+ "mean_token_accuracy": 0.896272599697113,
1530
+ "step": 190
1531
+ },
1532
+ {
1533
+ "epoch": 0.8883720930232558,
1534
+ "grad_norm": 2.3769164085388184,
1535
+ "learning_rate": 1.367490435061928e-06,
1536
+ "loss": 0.3358,
1537
+ "mean_token_accuracy": 0.8814517259597778,
1538
+ "step": 191
1539
+ },
1540
+ {
1541
+ "epoch": 0.8930232558139535,
1542
+ "grad_norm": 2.21427845954895,
1543
+ "learning_rate": 1.3390467875558855e-06,
1544
+ "loss": 0.2465,
1545
+ "mean_token_accuracy": 0.9141114950180054,
1546
+ "step": 192
1547
+ },
1548
+ {
1549
+ "epoch": 0.8976744186046511,
1550
+ "grad_norm": 2.2213754653930664,
1551
+ "learning_rate": 1.3117056140861317e-06,
1552
+ "loss": 0.2912,
1553
+ "mean_token_accuracy": 0.9031753540039062,
1554
+ "step": 193
1555
+ },
1556
+ {
1557
+ "epoch": 0.9023255813953488,
1558
+ "grad_norm": 2.2283401489257812,
1559
+ "learning_rate": 1.285474158890304e-06,
1560
+ "loss": 0.2645,
1561
+ "mean_token_accuracy": 0.9112547039985657,
1562
+ "step": 194
1563
+ },
1564
+ {
1565
+ "epoch": 0.9069767441860465,
1566
+ "grad_norm": 2.1087920665740967,
1567
+ "learning_rate": 1.2603593721783219e-06,
1568
+ "loss": 0.2653,
1569
+ "mean_token_accuracy": 0.9116804599761963,
1570
+ "step": 195
1571
+ },
1572
+ {
1573
+ "epoch": 0.9116279069767442,
1574
+ "grad_norm": 2.2635421752929688,
1575
+ "learning_rate": 1.2363679082908766e-06,
1576
+ "loss": 0.2987,
1577
+ "mean_token_accuracy": 0.8961791396141052,
1578
+ "step": 196
1579
+ },
1580
+ {
1581
+ "epoch": 0.9162790697674419,
1582
+ "grad_norm": 2.3147056102752686,
1583
+ "learning_rate": 1.2135061239363161e-06,
1584
+ "loss": 0.2587,
1585
+ "mean_token_accuracy": 0.9121925234794617,
1586
+ "step": 197
1587
+ },
1588
+ {
1589
+ "epoch": 0.9209302325581395,
1590
+ "grad_norm": 2.262284755706787,
1591
+ "learning_rate": 1.1917800765063954e-06,
1592
+ "loss": 0.2704,
1593
+ "mean_token_accuracy": 0.9092745780944824,
1594
+ "step": 198
1595
+ },
1596
+ {
1597
+ "epoch": 0.9255813953488372,
1598
+ "grad_norm": 2.5080342292785645,
1599
+ "learning_rate": 1.1711955224713209e-06,
1600
+ "loss": 0.2932,
1601
+ "mean_token_accuracy": 0.898510217666626,
1602
+ "step": 199
1603
+ },
1604
+ {
1605
+ "epoch": 0.9302325581395349,
1606
+ "grad_norm": 2.2359910011291504,
1607
+ "learning_rate": 1.1517579158545386e-06,
1608
+ "loss": 0.2699,
1609
+ "mean_token_accuracy": 0.9084086418151855,
1610
+ "step": 200
1611
+ },
1612
+ {
1613
+ "epoch": 0.9348837209302325,
1614
+ "grad_norm": 2.4288928508758545,
1615
+ "learning_rate": 1.1334724067876463e-06,
1616
+ "loss": 0.3007,
1617
+ "mean_token_accuracy": 0.8988877534866333,
1618
+ "step": 201
1619
+ },
1620
+ {
1621
+ "epoch": 0.9395348837209302,
1622
+ "grad_norm": 2.267275810241699,
1623
+ "learning_rate": 1.1163438401458358e-06,
1624
+ "loss": 0.2668,
1625
+ "mean_token_accuracy": 0.9053376913070679,
1626
+ "step": 202
1627
+ },
1628
+ {
1629
+ "epoch": 0.9441860465116279,
1630
+ "grad_norm": 2.3959038257598877,
1631
+ "learning_rate": 1.1003767542642021e-06,
1632
+ "loss": 0.2833,
1633
+ "mean_token_accuracy": 0.9051226377487183,
1634
+ "step": 203
1635
+ },
1636
+ {
1637
+ "epoch": 0.9488372093023256,
1638
+ "grad_norm": 2.2642016410827637,
1639
+ "learning_rate": 1.0855753797352868e-06,
1640
+ "loss": 0.2544,
1641
+ "mean_token_accuracy": 0.9085396528244019,
1642
+ "step": 204
1643
+ },
1644
+ {
1645
+ "epoch": 0.9534883720930233,
1646
+ "grad_norm": 2.1275274753570557,
1647
+ "learning_rate": 1.0719436382881466e-06,
1648
+ "loss": 0.2614,
1649
+ "mean_token_accuracy": 0.9060357809066772,
1650
+ "step": 205
1651
+ },
1652
+ {
1653
+ "epoch": 0.958139534883721,
1654
+ "grad_norm": 2.3473360538482666,
1655
+ "learning_rate": 1.0594851417492665e-06,
1656
+ "loss": 0.2869,
1657
+ "mean_token_accuracy": 0.9008370637893677,
1658
+ "step": 206
1659
+ },
1660
+ {
1661
+ "epoch": 0.9627906976744186,
1662
+ "grad_norm": 2.4919135570526123,
1663
+ "learning_rate": 1.0482031910855804e-06,
1664
+ "loss": 0.3038,
1665
+ "mean_token_accuracy": 0.8915042281150818,
1666
+ "step": 207
1667
+ },
1668
+ {
1669
+ "epoch": 0.9674418604651163,
1670
+ "grad_norm": 2.309241533279419,
1671
+ "learning_rate": 1.0381007755298547e-06,
1672
+ "loss": 0.2765,
1673
+ "mean_token_accuracy": 0.9052693843841553,
1674
+ "step": 208
1675
+ },
1676
+ {
1677
+ "epoch": 0.9720930232558139,
1678
+ "grad_norm": 2.2533209323883057,
1679
+ "learning_rate": 1.029180571788672e-06,
1680
+ "loss": 0.2576,
1681
+ "mean_token_accuracy": 0.9127118587493896,
1682
+ "step": 209
1683
+ },
1684
+ {
1685
+ "epoch": 0.9767441860465116,
1686
+ "grad_norm": 2.08575439453125,
1687
+ "learning_rate": 1.021444943333218e-06,
1688
+ "loss": 0.2668,
1689
+ "mean_token_accuracy": 0.908217191696167,
1690
+ "step": 210
1691
+ },
1692
+ {
1693
+ "epoch": 0.9813953488372092,
1694
+ "grad_norm": 2.299942970275879,
1695
+ "learning_rate": 1.0148959397730637e-06,
1696
+ "loss": 0.2626,
1697
+ "mean_token_accuracy": 0.9057932496070862,
1698
+ "step": 211
1699
+ },
1700
+ {
1701
+ "epoch": 0.986046511627907,
1702
+ "grad_norm": 2.232316017150879,
1703
+ "learning_rate": 1.0095352963131057e-06,
1704
+ "loss": 0.2853,
1705
+ "mean_token_accuracy": 0.9010600447654724,
1706
+ "step": 212
1707
+ },
1708
+ {
1709
+ "epoch": 0.9906976744186047,
1710
+ "grad_norm": 2.287255048751831,
1711
+ "learning_rate": 1.0053644332938118e-06,
1712
+ "loss": 0.251,
1713
+ "mean_token_accuracy": 0.9092245697975159,
1714
+ "step": 213
1715
+ },
1716
+ {
1717
+ "epoch": 0.9953488372093023,
1718
+ "grad_norm": 2.229083776473999,
1719
+ "learning_rate": 1.0023844558148912e-06,
1720
+ "loss": 0.2969,
1721
+ "mean_token_accuracy": 0.8984347581863403,
1722
+ "step": 214
1723
+ },
1724
+ {
1725
+ "epoch": 1.0,
1726
+ "grad_norm": 2.0628480911254883,
1727
+ "learning_rate": 1.0005961534424925e-06,
1728
+ "loss": 0.2309,
1729
+ "mean_token_accuracy": 0.909231960773468,
1730
+ "step": 215
1731
+ },
1732
+ {
1733
+ "epoch": 1.0,
1734
+ "step": 215,
1735
+ "total_flos": 2.0144468407196058e+17,
1736
+ "train_loss": 0.325891209826913,
1737
+ "train_runtime": 788.7208,
1738
+ "train_samples_per_second": 8.686,
1739
+ "train_steps_per_second": 0.273
1740
+ }
1741
+ ],
1742
+ "logging_steps": 1,
1743
+ "max_steps": 215,
1744
+ "num_input_tokens_seen": 0,
1745
+ "num_train_epochs": 1,
1746
+ "save_steps": 500,
1747
+ "stateful_callbacks": {
1748
+ "TrainerControl": {
1749
+ "args": {
1750
+ "should_epoch_stop": false,
1751
+ "should_evaluate": false,
1752
+ "should_log": false,
1753
+ "should_save": true,
1754
+ "should_training_stop": true
1755
+ },
1756
+ "attributes": {}
1757
+ }
1758
+ },
1759
+ "total_flos": 2.0144468407196058e+17,
1760
+ "train_batch_size": 4,
1761
+ "trial_name": null,
1762
+ "trial_params": null
1763
+ }
training.log CHANGED
@@ -580,3 +580,56 @@ weight_decay=0.0,
580
  )
581
  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
582
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
580
  )
581
  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
582
  )
583
+ 2025-09-15 02:07:52 - INFO - __main__ - *** Save model ***
584
+ 2025-09-15 02:07:52 - INFO - __main__ - πŸ’Ύ Saving MoE bias states...
585
+ 2025-09-15 02:07:52 - INFO - __main__ - πŸ” Searching for MoE layers with bias states...
586
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.0.mlp: 60 experts, update_speed=0.001000
587
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.1.mlp: 60 experts, update_speed=0.001000
588
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.2.mlp: 60 experts, update_speed=0.001000
589
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.3.mlp: 60 experts, update_speed=0.001000
590
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.4.mlp: 60 experts, update_speed=0.001000
591
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.5.mlp: 60 experts, update_speed=0.001000
592
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.6.mlp: 60 experts, update_speed=0.001000
593
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.7.mlp: 60 experts, update_speed=0.001000
594
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.8.mlp: 60 experts, update_speed=0.001000
595
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.9.mlp: 60 experts, update_speed=0.001000
596
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.10.mlp: 60 experts, update_speed=0.001000
597
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.11.mlp: 60 experts, update_speed=0.001000
598
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.12.mlp: 60 experts, update_speed=0.001000
599
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.13.mlp: 60 experts, update_speed=0.001000
600
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.14.mlp: 60 experts, update_speed=0.001000
601
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.15.mlp: 60 experts, update_speed=0.001000
602
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.16.mlp: 60 experts, update_speed=0.001000
603
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.17.mlp: 60 experts, update_speed=0.001000
604
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.18.mlp: 60 experts, update_speed=0.001000
605
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.19.mlp: 60 experts, update_speed=0.001000
606
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.20.mlp: 60 experts, update_speed=0.001000
607
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.21.mlp: 60 experts, update_speed=0.001000
608
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.22.mlp: 60 experts, update_speed=0.001000
609
+ 2025-09-15 02:07:52 - INFO - __main__ - βœ… Saved bias from model.layers.23.mlp: 60 experts, update_speed=0.001000
610
+ 2025-09-15 02:07:52 - INFO - __main__ - πŸŽ‰ Successfully saved 24 MoE bias states to /tmp/data/Qwen1.5-MOE/aux_free_sft/math7k/1e-3-gamma/moe_bias_states.json
611
+ 2025-09-15 02:07:52 - INFO - __main__ - πŸ“Š Bias States Summary:
612
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.0.mlp: 60 experts, range=[-0.5000, 0.5000]
613
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.1.mlp: 60 experts, range=[-0.5000, 0.5000]
614
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.2.mlp: 60 experts, range=[-0.5000, 0.5000]
615
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.3.mlp: 60 experts, range=[-0.5000, 0.5000]
616
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.4.mlp: 60 experts, range=[-0.5000, 0.5000]
617
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.5.mlp: 60 experts, range=[-0.5000, 0.5000]
618
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.6.mlp: 60 experts, range=[-0.5000, 0.5000]
619
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.7.mlp: 60 experts, range=[-0.5000, 0.5000]
620
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.8.mlp: 60 experts, range=[-0.5000, 0.5000]
621
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.9.mlp: 60 experts, range=[-0.5000, 0.5000]
622
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.10.mlp: 60 experts, range=[-0.5000, 0.5000]
623
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.11.mlp: 60 experts, range=[-0.5000, 0.5000]
624
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.12.mlp: 60 experts, range=[-0.5000, 0.5000]
625
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.13.mlp: 60 experts, range=[-0.5000, 0.5000]
626
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.14.mlp: 60 experts, range=[-0.5000, 0.5000]
627
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.15.mlp: 60 experts, range=[-0.5000, 0.5000]
628
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.16.mlp: 60 experts, range=[-0.5000, 0.5000]
629
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.17.mlp: 60 experts, range=[-0.5000, 0.5000]
630
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.18.mlp: 60 experts, range=[-0.5000, 0.5000]
631
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.19.mlp: 60 experts, range=[-0.5000, 0.5000]
632
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.20.mlp: 60 experts, range=[-0.5000, 0.5000]
633
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.21.mlp: 60 experts, range=[-0.5000, 0.5000]
634
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.22.mlp: 60 experts, range=[-0.5000, 0.5000]
635
+ 2025-09-15 02:07:52 - INFO - __main__ - model.layers.23.mlp: 60 experts, range=[-0.5000, 0.5000]