GGUF quantizations available at ManniX-ITA/Qwen3.5-27B-Omnimerge-GGUF

Qwen3.5-27B-Omnimerge

A 3-way Task Arithmetic weight-space merge of three Qwen3.5-27B reasoning-distilled fine-tunes, built with a custom Python merger because mergekit does not currently support the Qwen3.5 hybrid architecture (Qwen3_5ForConditionalGeneration with linear-attention layers).

This is an experimental 3-way Task Arithmetic merge that outperforms its best source model (Claude-4.6-Opus-Reasoning-Distilled) across all tested benchmarks: +8 pp on GPQA Diamond reasoning, +3.7 pp on HumanEval, and comparable MBPP. Published as a research artifact demonstrating that weight-space merging can improve both reasoning and code capabilities simultaneously.

Source models and weights

Source Weight Focus
Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled 0.40 Claude 4.6 Opus reasoning-trace distillation
ValiantLabs/Qwen3.5-27B-Esper3.1 0.35 Code / DevOps specialist
DavidAU/Qwen3.5-27B-Gemini3-Pro-High-Reasoning-Compact-Thinking 0.25 Gemini 3 Pro reasoning-trace distillation, compact thinking

Base: Qwen/Qwen3.5-27B

A 4th source, ConicCat/Qwen3.5-27B-Writer-V2 (creative writing), was tried in an earlier 4-way equal-weight DARE-TIES experiment (pass@1 = 80.49% HumanEval, below every constituent). It was dropped from this release because its creative-writing training direction actively interfered with the reasoning axis — the 3-way reasoning-only set performed better on both benchmarks.

Method

The custom merger: dare_ties_merge.py

Included in this repo (dare_ties_merge.py) — a minimal, architecture-agnostic PyTorch merger that processes tensors one at a time using name-matching, so it works on any architecture as long as tensors match by name between the base and the sources. Qwen3.5's hybrid language-model + vision-tower + linear_attn (SSM) layers are all handled the same way: read base tensor, read the same tensor from each source, compute the delta, apply the merge rule, write back.

Supports four merge methods via --method:

  • dare_ties — DARE drop (uniform Bernoulli) + TIES sign consensus (the original; our v1 baseline)
  • dare_linear — DARE drop + weighted linear, no sign consensus
  • task_arithmetic — no drop, no sign consensus, just base + Σ(wᵢ · Δᵢ)the method used for this release
  • della — DELLA MAGPRUNE magnitude-ranked drop + TIES sign consensus

Key design points of the merger:

  • Chunked processing: each tensor is flattened and processed in 50M-element chunks in fp32, so peak RAM per tensor is bounded regardless of the tensor's total size. Lets us merge 27B models on a single 3090-class box via CPU RAM.
  • Weighted source contributions: --weights "0.40,0.35,0.25" lets you bias toward the best source. For TIES-style methods, weights are applied before sign election, which means small weights effectively silence a source.
  • Architecture agnostic: the merger iterates over the base model's model.safetensors.index.json weight map, loads the same-named tensor from each source via safetensors.safe_open, and merges by name. No per-architecture template files required.
  • Deterministic seed: --seed 42 makes runs reproducible for DARE / DELLA variants; task_arithmetic has no randomness and is bit-exact reproducible regardless of seed.
  • bf16 output, fp32 math: numerical stability during the merge, native dtype on disk.

This specific merge

python3 dare_ties_merge.py \
    --method task_arithmetic \
    --base Qwen/Qwen3.5-27B \
    --source Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
    --source ValiantLabs/Qwen3.5-27B-Esper3.1 \
    --source DavidAU/Qwen3.5-27B-Gemini3-Pro-High-Reasoning-Compact-Thinking \
    --weights 0.40,0.35,0.25 \
    --output merged_omnimerge \
    --seed 42 --shard-size 5 --dtype bfloat16

The task_arithmetic formula is just:

merged = base + Σᵢ wᵢ · (sourceᵢ − base)

No drop mask, no sign consensus. Literature (Ilharco et al. 2023, Yadav et al. 2023, and arXiv 2511.21437) consistently finds this is the most reliable method when merging multiple fine-tunes of the same base — precisely because it doesn't throw away delta information.

Evaluation

Both the merge and the best-scoring source (Claude-4.6-Opus-Reasoning-Distilled) were evaluated at Q6_K via llama.cpp server in Qwen3 reasoning mode (--jinja --reasoning-format deepseek --reasoning-budget 16384 --temp 0.6 --top-p 0.95 --top-k 20 --dry-multiplier 0.5) using lm-eval-harness.

HumanEval pass@1 (164 questions, Q6_K)

Raw text-completion API (local-completions / /v1/completions) to bypass Qwen3.5 chat-mode markdown-fence behavior that causes exec(prompt+gen) to fail.

Model pass@1 Notes
Claude-4.6-Opus-Reasoning-Distilled (best source) 76.22% mradermacher i1-Q6_K
Esper3.1 83.54% original source eval
Gemini3-Pro-High-Reasoning-Compact-Thinking 83.54% original source eval
Writer-V2 (not in final merge) 82.32% original source eval
Qwen3.5-27B-Omnimerge 79.88% +3.66 pp vs Claude-distill (fence-corrected)

MBPP pass@1 (500 questions, Q6_K)

Model pass@1 Notes
Qwen3.5-27B-Omnimerge 71.80%
Claude-4.6-Opus-Reasoning-Distilled 71.20% mradermacher i1-Q6_K

GSM8K exact-match (200 questions, Q6_K, CoT zeroshot)

Model exact_match (flex)
Claude-4.6-Opus-Reasoning-Distilled 82.50%
Qwen3.5-27B-Omnimerge 79.50%

GPQA Diamond exact-match (198 questions, Q6_K, flexible-extract)

gpqa_diamond_cot_zeroshot via lm-eval-harness, chain-of-thought reasoning mode.

Model exact_match Delta vs source
Claude-4.6-Opus-Reasoning-Distilled (best source) 53.03% baseline
Qwen3.5-27B-Omnimerge 61.11% +8.08 pp

(20q stratified subset preview, seed 42: Claude-distill 60%, Omnimerge 70% — +10 pp advantage for the merge.)

Inference

This is a standard HF-format Qwen3.5-27B bf16 model. Load with transformers as usual:

from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
    "ManniX-ITA/Qwen3.5-27B-Omnimerge",
    dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("ManniX-ITA/Qwen3.5-27B-Omnimerge")

For GGUF quantizations, see ManniX-ITA/Qwen3.5-27B-Omnimerge-GGUF.

Serving via llama.cpp (recommended settings, same as evaluation):

llama-server -m Qwen3.5-27B-Omnimerge-Q6_K.gguf -c 32768 -ngl 99 \
    --jinja --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 0.6 --top-p 0.95 --top-k 20 --dry-multiplier 0.5

Reproducibility

All evaluation artifacts (results JSONs, samples, logs, caches, scripts) are attached as eval_artifacts.tar.gz in the -GGUF repo.

License

Apache-2.0, inherited from the base and all sources.

Acknowledgements

  • Qwen team for Qwen3.5-27B
  • Jackrong, ValiantLabs, DavidAU, ConicCat for the fine-tune sources
  • arcee-ai/mergekit team for the TIES/DARE implementation that inspired the custom merger
  • Yadav et al. 2023 (TIES), Yu et al. 2024 (DARE), Ilharco et al. 2023 (Task Arithmetic), Deep et al. 2024 (DELLA)
Downloads last month
68
Safetensors
Model size
28B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/Qwen3.5-27B-Omnimerge

Base model

Qwen/Qwen3.5-27B
Finetuned
(233)
this model
Quantizations
1 model

Papers for ManniX-ITA/Qwen3.5-27B-Omnimerge