Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)
Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.
GPQA Diamond: partial result (192/198 cached, 177 matched, ≈ 84.75% pass@1). See note below — final result blocked by an aiohttp lifecycle bug in
lm_eval'slocal-completionsadapter that consistently crashes the eval on the last 6 reasoning-tail questions where responses run 9+ minutes each. HumanEval and MBPP are final.
Quantizations
GGUFs (full ladder F16 → IQ2_XXS, plus CD-tier Claude-distilled quants) for llama.cpp / ollama / text-generation-webui:
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF— 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5 for direct comparability with the base release.imatrix.datarchived alongside the quants for reproducibility/audit.
The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).
Sources
| Source | Weight | Role |
|---|---|---|
| Qwen/Qwen3.6-27B | base | base + chat template |
| rico03/Qwen3.6-27B-rico03 | 0.40 | general capability |
| ValiantLabs/Qwen3.6-27B-Esper3.1 | 0.35 | code + reasoning |
| kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor) | 0.25 | reasoning anchor |
Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.
Benchmark Results (Q6_K quantization)
All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0 except GPQA at 0.6 to match v2's published methodology.
v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)
All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs.
| Benchmark | Qwen3.6 base Q6_K (bartowski) | Omnimerge-v2 (Qwen3.5 base) | Omnimerge-v4-MLP (Qwen3.6 base) | Δ vs base | Δ vs v2 |
|---|---|---|---|---|---|
| HumanEval pass@1 (164q) | 84.76% (139/164) | 79.27% | 84.76% (139/164) | 0.00 pp | +5.49 pp |
| MBPP pass@1 (500q) — raw lm_eval | 56.20% | n/a | 68.40% | +12.20 pp | n/a |
| MBPP pass@1 (500q) — corrected* | 57.60% | 74.60% | 73.40% | +15.80 pp | −1.20 pp |
| GPQA Diamond pass@1 (flex) — see ‡ | not measured (∇) | 69.19% (full 198q) | ≈ 84.75% (partial 177q) | — | ≈ +15.5 pp |
Key observations:
- HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
- MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
- GPQA is the marquee win — ≈ +15.5 pp over v2 (which itself was +16 pp over its source models). The Qwen3.6 base brings stronger reasoning, and the merge preserves and slightly amplifies it.
∇ Skipped a base GPQA run because (a) v2's published GPQA is the canonical reference for "is this merge valuable?" — that's what we benchmark against, and (b) the same aiohttp lifecycle bug that bit our v4-MLP run would have bit a base run too.
* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.
- v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
- Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
- v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.
Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.
‡ GPQA partial result (important caveat): the full lm_eval run completed 192/198 questions before crashing repeatedly on the last 6. Root cause is an aiohttp lifecycle issue in lm_eval.models.api_models.amodel_call: the at-budget reasoning responses (16384 tokens × ~9 minutes wall time) consistently outlast the aiohttp ClientSession and the resulting RuntimeError: Session is closed is unrecoverable within the same process. We restarted lm_eval 5 times across a ~4-hour window; each restart gained ~1 question before crashing on the same long-tail. Final 6 questions were not scored. The 84.75% is computed by scripts/score_gpqa_partial.py which replicates lm_eval's exact multi_choice_regex flexible-extract filter (group_select=−1, ignore_case=True, ignore_punctuation=True) over the 192 cached responses. Of those, 177 prompts matched our process_docs-replicated GPQA prompts (the 15 unmatched are minor unicode-normalization or seed-timing artifacts in the reconstruction; the 6 uncached are the at-budget tail). 150/177 correct → 84.75% partial pass@1. The unmatched 15 + uncached 6 are unlikely to swing the headline number more than ±1 pp; final result will land in the 82-86% band. We also separately patched lm_eval's api_models.py:545 UnboundLocalError bug as a prerequisite (it crashes on transient TimeoutError before outputs is assigned) — see scripts/score_gpqa_partial.py and the inline patch recipe in this repo's commit history for the exact replication.
Why "MLP-passthrough"
When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.
We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.
Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations
| Test | Clean Qwen3.6 base | v4 (full merge, broken) | v4-MLP (this model) |
|---|---|---|---|
<think> open rate (mbpp-10 isolation) |
40% | 80% | 0% |
Unclosed </think> |
0/4 | 88% of opens | 0/10 |
| MBPP pass@1 (mbpp-10 isolation) | 40% | 20% | 50% |
| Empty response (chat-completions) | low | 80% | 0/10 |
Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.
The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.
Compatibility
Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).
Inference works under:
transformers(BF16) — bothuse_cache=TrueandFalsepathsllama.cpp(GGUF) — recommended args:--reasoning-format deepseek --reasoning-budget 8192- vLLM (untested at time of publish, expected to work)
Scripts
All merge tooling is in the scripts/ directory of this repo:
| Script | Purpose |
|---|---|
dare_ties_merge.py |
Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip). |
v4_mlp_passthrough.py |
Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6. |
inspect_v4_delta.py |
Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region. |
pod_omnimerge_v4_build.sh |
Full reproducible build script (download sources, run merge, convert + quantize Q6_K). |
pod_omnimerge_v4mlp_eval_raw.sh |
Eval orchestrator: mbpp + humaneval via raw /v1/completions. Required for reasoning-tag-emitting models — apply_chat_template + deepseek extraction strips think blocks and returns empty. |
rescore_mbpp_strip_think.py |
Re-scoring tool that strips <think> blocks and markdown fences before exec(code+tests). Recovered 25 of 158 false failures on this model's mbpp run. |
score_gpqa_partial.py |
Partial-cache GPQA scorer. Replicates lm_eval's multi_choice_regex flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's hash_args("generate_until", [prompt, gen_kwargs]) SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail. |
pod_v4mlp_gpqa.sh |
Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology). |
Reproducing the merge
python scripts/dare_ties_merge.py \
--method omnimerge_v2 \
--base /path/to/Qwen3.6-27B \
--source /path/to/Qwen3.6-rico03 \
--source /path/to/Qwen3.6-Esper3.1 \
--source /path/to/Qwen3.6-Opus-Reasoning-anchor \
--weights 0.40,0.35,0.25 \
--density 0.53 \
--darex-q 0.75 \
--output ./Qwen3.6-27B-Omnimerge-v4 \
--seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)
Caveats
- Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw
/v1/completionsfor code benchmarks; chat-completions +--apply_chat_template+ deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. Seepod_omnimerge_v4mlp_eval_raw.shfor the working config. - MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).
Acknowledgements
- Qwen team for the Qwen3.6 base
- rico03, ValiantLabs, kai-os for the fine-tunes
- DARE / TIES / DARE-TIES authors and the arcee-ai/mergekit community
- Downloads last month
- 100