Qwen3.6-27B-Omnimerge-v4 (MLP-passthrough)

Same-base DARE-TIES (Omnimerge_v2 method) merge of Qwen/Qwen3.6-27B + 3 Qwen3.6 fine-tunes, with MLP-passthrough surgery applied to defend against a fragility we discovered in Qwen3.6's reasoning-tag emission policy. Successor to ManniX-ITA/Qwen3.5-27B-Omnimerge-v2 on the newer Qwen3.6 base.

GPQA Diamond: partial result (192/198 cached, 177 matched, ≈ 84.75% pass@1). See note below — final result blocked by an aiohttp lifecycle bug in lm_eval's local-completions adapter that consistently crashes the eval on the last 6 reasoning-tail questions where responses run 9+ minutes each. HumanEval and MBPP are final.

Quantizations

GGUFs (full ladder F16 → IQ2_XXS, plus CD-tier Claude-distilled quants) for llama.cpp / ollama / text-generation-webui:

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5 for direct comparability with the base release. imatrix.dat archived alongside the quants for reproducibility/audit.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

Sources

Source Weight Role
Qwen/Qwen3.6-27B base base + chat template
rico03/Qwen3.6-27B-rico03 0.40 general capability
ValiantLabs/Qwen3.6-27B-Esper3.1 0.35 code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor) 0.25 reasoning anchor

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampling temperature 0.0 except GPQA at 0.6 to match v2's published methodology.

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs.

Benchmark Qwen3.6 base Q6_K (bartowski) Omnimerge-v2 (Qwen3.5 base) Omnimerge-v4-MLP (Qwen3.6 base) Δ vs base Δ vs v2
HumanEval pass@1 (164q) 84.76% (139/164) 79.27% 84.76% (139/164) 0.00 pp +5.49 pp
MBPP pass@1 (500q) — raw lm_eval 56.20% n/a 68.40% +12.20 pp n/a
MBPP pass@1 (500q) — corrected* 57.60% 74.60% 73.40% +15.80 pp −1.20 pp
GPQA Diamond pass@1 (flex) — see ‡ not measured (∇) 69.19% (full 198q) ≈ 84.75% (partial 177q) ≈ +15.5 pp

Key observations:

  • HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
  • MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
  • GPQA is the marquee win — ≈ +15.5 pp over v2 (which itself was +16 pp over its source models). The Qwen3.6 base brings stronger reasoning, and the merge preserves and slightly amplifies it.

∇ Skipped a base GPQA run because (a) v2's published GPQA is the canonical reference for "is this merge valuable?" — that's what we benchmark against, and (b) the same aiohttp lifecycle bug that bit our v4-MLP run would have bit a base run too.

* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.

  • v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
  • Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
  • v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

GPQA partial result (important caveat): the full lm_eval run completed 192/198 questions before crashing repeatedly on the last 6. Root cause is an aiohttp lifecycle issue in lm_eval.models.api_models.amodel_call: the at-budget reasoning responses (16384 tokens × ~9 minutes wall time) consistently outlast the aiohttp ClientSession and the resulting RuntimeError: Session is closed is unrecoverable within the same process. We restarted lm_eval 5 times across a ~4-hour window; each restart gained ~1 question before crashing on the same long-tail. Final 6 questions were not scored. The 84.75% is computed by scripts/score_gpqa_partial.py which replicates lm_eval's exact multi_choice_regex flexible-extract filter (group_select=−1, ignore_case=True, ignore_punctuation=True) over the 192 cached responses. Of those, 177 prompts matched our process_docs-replicated GPQA prompts (the 15 unmatched are minor unicode-normalization or seed-timing artifacts in the reconstruction; the 6 uncached are the at-budget tail). 150/177 correct → 84.75% partial pass@1. The unmatched 15 + uncached 6 are unlikely to swing the headline number more than ±1 pp; final result will land in the 82-86% band. We also separately patched lm_eval's api_models.py:545 UnboundLocalError bug as a prerequisite (it crashes on transient TimeoutError before outputs is assigned) — see scripts/score_gpqa_partial.py and the inline patch recipe in this repo's commit history for the exact replication.

Why "MLP-passthrough"

When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.

We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Test Clean Qwen3.6 base v4 (full merge, broken) v4-MLP (this model)
<think> open rate (mbpp-10 isolation) 40% 80% 0%
Unclosed </think> 0/4 88% of opens 0/10
MBPP pass@1 (mbpp-10 isolation) 40% 20% 50%
Empty response (chat-completions) low 80% 0/10

Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.

The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

  • transformers (BF16) — both use_cache=True and False paths
  • llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
  • vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Script Purpose
dare_ties_merge.py Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip).
v4_mlp_passthrough.py Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
inspect_v4_delta.py Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.
pod_omnimerge_v4_build.sh Full reproducible build script (download sources, run merge, convert + quantize Q6_K).
pod_omnimerge_v4mlp_eval_raw.sh Eval orchestrator: mbpp + humaneval via raw /v1/completions. Required for reasoning-tag-emitting models — apply_chat_template + deepseek extraction strips think blocks and returns empty.
rescore_mbpp_strip_think.py Re-scoring tool that strips <think> blocks and markdown fences before exec(code+tests). Recovered 25 of 158 false failures on this model's mbpp run.
score_gpqa_partial.py Partial-cache GPQA scorer. Replicates lm_eval's multi_choice_regex flexible-extract filter exactly (group_select=−1, ignore_case, ignore_punctuation), looks up cached responses by lm_eval's hash_args("generate_until", [prompt, gen_kwargs]) SHA-256 key, scores against ground truth. Used for the partial 84.75% above when the lm_eval run could not complete the long-tail.
pod_v4mlp_gpqa.sh Full GPQA Diamond eval runner against the v4-MLP server. T=0.6, top_p=0.95, max_gen_toks=16384 (matches v2's published methodology).

Reproducing the merge

python scripts/dare_ties_merge.py \
    --method omnimerge_v2 \
    --base /path/to/Qwen3.6-27B \
    --source /path/to/Qwen3.6-rico03 \
    --source /path/to/Qwen3.6-Esper3.1 \
    --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
    --weights 0.40,0.35,0.25 \
    --density 0.53 \
    --darex-q 0.75 \
    --output ./Qwen3.6-27B-Omnimerge-v4 \
    --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

  • Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
  • MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Downloads last month
100
Safetensors
Model size
28B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Merge model
this model
Quantizations
1 model