GGUF quantizations available at ManniX-ITA/Qwen3.5-27B-Omnimerge-GGUF
Qwen3.5-27B-Omnimerge
A 3-way Task Arithmetic weight-space merge of three Qwen3.5-27B reasoning-distilled fine-tunes, built with a custom Python merger because mergekit does not currently support the Qwen3.5 hybrid architecture (Qwen3_5ForConditionalGeneration with linear-attention layers).
This is an experimental 3-way Task Arithmetic merge that outperforms its best source model (Claude-4.6-Opus-Reasoning-Distilled) across all tested benchmarks: +8 pp on GPQA Diamond reasoning, +3.7 pp on HumanEval, and comparable MBPP. Published as a research artifact demonstrating that weight-space merging can improve both reasoning and code capabilities simultaneously.
Source models and weights
| Source | Weight | Focus |
|---|---|---|
| Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | 0.40 | Claude 4.6 Opus reasoning-trace distillation |
| ValiantLabs/Qwen3.5-27B-Esper3.1 | 0.35 | Code / DevOps specialist |
| DavidAU/Qwen3.5-27B-Gemini3-Pro-High-Reasoning-Compact-Thinking | 0.25 | Gemini 3 Pro reasoning-trace distillation, compact thinking |
Base: Qwen/Qwen3.5-27B
A 4th source, ConicCat/Qwen3.5-27B-Writer-V2 (creative writing), was tried in an earlier 4-way equal-weight DARE-TIES experiment (pass@1 = 80.49% HumanEval, below every constituent). It was dropped from this release because its creative-writing training direction actively interfered with the reasoning axis — the 3-way reasoning-only set performed better on both benchmarks.
Method
The custom merger: dare_ties_merge.py
Included in this repo (dare_ties_merge.py) — a minimal, architecture-agnostic PyTorch merger that processes tensors one at a time using name-matching, so it works on any architecture as long as tensors match by name between the base and the sources. Qwen3.5's hybrid language-model + vision-tower + linear_attn (SSM) layers are all handled the same way: read base tensor, read the same tensor from each source, compute the delta, apply the merge rule, write back.
Supports four merge methods via --method:
dare_ties— DARE drop (uniform Bernoulli) + TIES sign consensus (the original; our v1 baseline)dare_linear— DARE drop + weighted linear, no sign consensustask_arithmetic— no drop, no sign consensus, justbase + Σ(wᵢ · Δᵢ)— the method used for this releasedella— DELLA MAGPRUNE magnitude-ranked drop + TIES sign consensus
Key design points of the merger:
- Chunked processing: each tensor is flattened and processed in 50M-element chunks in fp32, so peak RAM per tensor is bounded regardless of the tensor's total size. Lets us merge 27B models on a single 3090-class box via CPU RAM.
- Weighted source contributions:
--weights "0.40,0.35,0.25"lets you bias toward the best source. For TIES-style methods, weights are applied before sign election, which means small weights effectively silence a source. - Architecture agnostic: the merger iterates over the base model's
model.safetensors.index.jsonweight map, loads the same-named tensor from each source viasafetensors.safe_open, and merges by name. No per-architecture template files required. - Deterministic seed:
--seed 42makes runs reproducible for DARE / DELLA variants;task_arithmetichas no randomness and is bit-exact reproducible regardless of seed. - bf16 output, fp32 math: numerical stability during the merge, native dtype on disk.
This specific merge
python3 dare_ties_merge.py \
--method task_arithmetic \
--base Qwen/Qwen3.5-27B \
--source Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
--source ValiantLabs/Qwen3.5-27B-Esper3.1 \
--source DavidAU/Qwen3.5-27B-Gemini3-Pro-High-Reasoning-Compact-Thinking \
--weights 0.40,0.35,0.25 \
--output merged_omnimerge \
--seed 42 --shard-size 5 --dtype bfloat16
The task_arithmetic formula is just:
merged = base + Σᵢ wᵢ · (sourceᵢ − base)
No drop mask, no sign consensus. Literature (Ilharco et al. 2023, Yadav et al. 2023, and arXiv 2511.21437) consistently finds this is the most reliable method when merging multiple fine-tunes of the same base — precisely because it doesn't throw away delta information.
Evaluation
Both the merge and the best-scoring source (Claude-4.6-Opus-Reasoning-Distilled) were evaluated at Q6_K via llama.cpp server in Qwen3 reasoning mode (--jinja --reasoning-format deepseek --reasoning-budget 16384 --temp 0.6 --top-p 0.95 --top-k 20 --dry-multiplier 0.5) using lm-eval-harness.
HumanEval pass@1 (164 questions, Q6_K)
Raw text-completion API (local-completions / /v1/completions) to bypass Qwen3.5 chat-mode markdown-fence behavior that causes exec(prompt+gen) to fail.
| Model | pass@1 | Notes |
|---|---|---|
| Claude-4.6-Opus-Reasoning-Distilled (best source) | 76.22% | mradermacher i1-Q6_K |
| Esper3.1 | 83.54% | original source eval |
| Gemini3-Pro-High-Reasoning-Compact-Thinking | 83.54% | original source eval |
| Writer-V2 (not in final merge) | 82.32% | original source eval |
| Qwen3.5-27B-Omnimerge | 79.88% | +3.66 pp vs Claude-distill (fence-corrected) |
MBPP pass@1 (500 questions, Q6_K)
| Model | pass@1 | Notes |
|---|---|---|
| Qwen3.5-27B-Omnimerge | 71.80% | |
| Claude-4.6-Opus-Reasoning-Distilled | 71.20% | mradermacher i1-Q6_K |
GSM8K exact-match (200 questions, Q6_K, CoT zeroshot)
| Model | exact_match (flex) |
|---|---|
| Claude-4.6-Opus-Reasoning-Distilled | 82.50% |
| Qwen3.5-27B-Omnimerge | 79.50% |
GPQA Diamond exact-match (198 questions, Q6_K, flexible-extract)
gpqa_diamond_cot_zeroshot via lm-eval-harness, chain-of-thought reasoning mode.
| Model | exact_match | Delta vs source |
|---|---|---|
| Claude-4.6-Opus-Reasoning-Distilled (best source) | 53.03% | baseline |
| Qwen3.5-27B-Omnimerge | 61.11% | +8.08 pp |
(20q stratified subset preview, seed 42: Claude-distill 60%, Omnimerge 70% — +10 pp advantage for the merge.)
Inference
This is a standard HF-format Qwen3.5-27B bf16 model. Load with transformers as usual:
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"ManniX-ITA/Qwen3.5-27B-Omnimerge",
dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("ManniX-ITA/Qwen3.5-27B-Omnimerge")
For GGUF quantizations, see ManniX-ITA/Qwen3.5-27B-Omnimerge-GGUF.
Serving via llama.cpp (recommended settings, same as evaluation):
llama-server -m Qwen3.5-27B-Omnimerge-Q6_K.gguf -c 32768 -ngl 99 \
--jinja --reasoning-format deepseek --reasoning-budget 16384 \
--temp 0.6 --top-p 0.95 --top-k 20 --dry-multiplier 0.5
Reproducibility
All evaluation artifacts (results JSONs, samples, logs, caches, scripts) are attached as eval_artifacts.tar.gz in the -GGUF repo.
License
Apache-2.0, inherited from the base and all sources.
Acknowledgements
- Qwen team for Qwen3.5-27B
- Jackrong, ValiantLabs, DavidAU, ConicCat for the fine-tune sources
- arcee-ai/mergekit team for the TIES/DARE implementation that inspired the custom merger
- Yadav et al. 2023 (TIES), Yu et al. 2024 (DARE), Ilharco et al. 2023 (Task Arithmetic), Deep et al. 2024 (DELLA)
- Downloads last month
- 68