| --- |
| base_model: |
| - Lambent/Qwen3-4B-Base-Continued-GRPO-Wave |
| library_name: transformers |
| tags: |
| - mergekit |
| - merge |
| license: apache-2.0 |
| --- |
| |
| For this one... |
|
|
| ... (over)trained a SmolLM2-360M on 5 epochs at swept-for LR and rank on each of the target domains to fit style, |
| then rewarded the model for lowering perplexity on the proxy model. |
|
|
| In this case, trained an adapter per domain and then Karcher merged them. |
| I'm not sure if any of the domains had notably different effect, they all basically had the same result on evals. |
| However, the karcher combination of them seem to have significantly lowered perplexity on lambada_openai, which is interesting enough to publish. |
| |
| Additionally, attempted to implement MARA from https://im-ant.github.io/mara/ on the GRPO side to help preserve distribution entropy, though I'm unsure how correctly/usefully we did so. |
| |
| | Task | Metric | Qwen3-4B-Base | GRPO-Merge | Δ Base | GRPO-Wave | Δ Base | Δ Merge | Style-Karcher | Δ Base | Δ Wave | |
| |:-----|:-------|:-------------:|:----------:|:------:|:---------:|:------:|:-------:|:-------------:|:------:|:------:| |
| | arc_easy | acc | 0.7891 | 0.7870 | -0.27% | 0.7912 | +0.27% | +0.53% | 0.7883 | -0.10% | -0.37% | |
| | arc_easy | acc_norm | 0.7609 | 0.7605 | -0.05% | 0.7643 | +0.45% | +0.50% | 0.7576 | -0.43% | -1.04% | |
| | lambada_openai | acc | 0.6912 | 0.6984 | +1.04% | 0.7006 | +1.36% | +0.31% | **0.7087** | **+2.53%** | +1.16% | |
| | lambada_openai | perplexity ↓ | 4.2433 | 4.0490 | -4.58% | 3.9616 | -6.64% | -2.16% | **3.8343** | **-9.63%** | -3.21% | |
| | openbookqa | acc | 0.3160 | 0.3180 | +0.63% | 0.3180 | +0.63% | ±0.00% | 0.3160 | ±0.00% | -0.63% | |
| | openbookqa | acc_norm | 0.4100 | 0.4120 | +0.49% | 0.4100 | ±0.00% | -0.49% | 0.4080 | -0.49% | -0.49% | |
| | piqa | acc | 0.7797 | 0.7807 | +0.13% | 0.7813 | +0.21% | +0.08% | 0.7786 | -0.14% | -0.35% | |
| | piqa | acc_norm | 0.7807 | 0.7807 | ±0.00% | 0.7813 | +0.08% | +0.08% | 0.7807 | ±0.00% | -0.08% | |
| |
| |
| Some very interesting results on diversity also: |
|
|
| **Diversity Metrics (Qwen3-4B-Base vs Style-Karcher, temperature=1.0, 8 completions per prompt)** |
| |
| | Domain | Metric | Base | Karcher | Δ | |
| |--------|--------|:----:|:-------:|:-:| |
| | ao3_english | Prefix entropy | 3.309 | 3.238 | -2.1% | |
| | ao3_english | Distinct-1 | 0.618 | **0.683** | **+10.5%** | |
| | ao3_english | Distinct-2 | 0.962 | **0.984** | +2.3% | |
| | ao3_english | Pairwise diversity | 0.919 | **0.932** | +1.4% | |
| | github_python | Prefix entropy | 1.514 | 1.456 | -3.8% | |
| | github_python | Distinct-1 | 0.610 | **0.624** | +2.3% | |
| | github_python | Distinct-2 | 0.890 | 0.876 | -1.6% | |
| | github_python | Pairwise diversity | 0.933 | 0.933 | ±0.0% | |
| | wikipedia_english | Prefix entropy | 1.974 | 1.892 | -4.2% | |
| | wikipedia_english | Distinct-1 | 0.599 | 0.559 | -6.7% | |
| | wikipedia_english | Distinct-2 | 0.932 | 0.898 | -3.6% | |
| | wikipedia_english | Pairwise diversity | 0.907 | 0.900 | -0.8% | |
| | bbc_news | Prefix entropy | 2.252 | 2.186 | -2.9% | |
| | bbc_news | Distinct-1 | 0.557 | **0.577** | +3.6% | |
| | bbc_news | Distinct-2 | 0.949 | **0.951** | +0.3% | |
| | bbc_news | Pairwise diversity | 0.901 | **0.908** | +0.8% | |
| | arxiv_cs | Prefix entropy | 2.455 | 2.346 | -4.4% | |
| | arxiv_cs | Distinct-1 | 0.555 | **0.567** | +2.3% | |
| | arxiv_cs | Distinct-2 | 0.905 | **0.906** | +0.2% | |
| | arxiv_cs | Pairwise diversity | 0.895 | **0.901** | +0.7% | |
| |
|
|
| Additional experiment (after quantization, should affect further training but not existing quants): |
| Initializing the \<think\>\</think\> tokens in embedding space. |
|
|
| Original embeddings were identical (cos=1.0) at 0.3x norm, untrained. |
|
|
| Optimized via AdamW on GSM8k reasoning traces with 3-shot prefix, loss on |
| reasoning+answer tokens, norm clamped to 1.5x avg embedding norm. |
|
|
| After: two distinct vectors (cos=0.07) at 1.5x norm. |
| GSM8k 3-shot accuracy: 96.7% (29/30) vs 90.0% with original embeddings. |
| CE loss improvement: +7.8% on held-out eval. |
|
|
| This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit). |
|
|
| ## Merge Details |
| ### Merge Method |
|
|
| This model was merged using the [Karcher Mean](https://en.wikipedia.org/wiki/Karcher_mean) merge method. |
|
|
| ### Models Merged |
|
|
| The following models were included in the merge: |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_javascript-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_cs-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-general-ao3style-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-ao3_english-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_math-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_python-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-wikipedia_english-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-arxiv_physics-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_cpp-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-bbc_news-mara-360m |
| * ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave + ../rlvr-envs/grpo-github_markdown-mara-360m |
|
|
| ### Configuration |
|
|
| The following YAML configuration was used to produce this model: |
|
|
| ```yaml |
| models: |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-ao3_english-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_cs-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_math-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-arxiv_physics-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-bbc_news-mara-360m |
| |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_cpp-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_javascript-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_markdown-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-github_python-mara-360m |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-wikipedia_english-mara-360m |
| |
| - model: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave+../rlvr-envs/grpo-general-ao3style-360m |
| merge_method: karcher |
| dtype: bfloat16 |
| tokenizer_source: ../rlvr-envs/Qwen3-4B-Base-Continued-GRPO-Wave |
| |
| ``` |