Update README.md
Browse files
README.md
CHANGED
|
@@ -59,6 +59,18 @@ Some very interesting results on diversity also:
|
|
| 59 |
| arxiv_cs | Pairwise diversity | 0.895 | **0.901** | +0.7% |
|
| 60 |
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
|
| 63 |
|
| 64 |
## Merge Details
|
|
|
|
| 59 |
| arxiv_cs | Pairwise diversity | 0.895 | **0.901** | +0.7% |
|
| 60 |
|
| 61 |
|
| 62 |
+
Additional experiment (after quantization, should affect further training but not existing quants):
|
| 63 |
+
Initializing the <think></think> tokens in embedding space.
|
| 64 |
+
|
| 65 |
+
Original embeddings were identical (cos=1.0) at 0.3x norm, untrained.
|
| 66 |
+
|
| 67 |
+
Optimized via AdamW on GSM8k reasoning traces with 3-shot prefix, loss on
|
| 68 |
+
reasoning+answer tokens, norm clamped to 1.5x avg embedding norm.
|
| 69 |
+
|
| 70 |
+
After: two distinct vectors (cos=0.07) at 1.5x norm.
|
| 71 |
+
GSM8k 3-shot accuracy: 96.7% (29/30) vs 90.0% with original embeddings.
|
| 72 |
+
CE loss improvement: +7.8% on held-out eval.
|
| 73 |
+
|
| 74 |
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
|
| 75 |
|
| 76 |
## Merge Details
|