Qwen3.6-35B-A3B for hipfire

Pre-quantized Qwen3.6-35B-A3B (MoE, 35B total / 3B activated) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Quantized from Qwen/Qwen3.6-35B-A3B. Qwen3.6's April 2026 refresh of the A3B line, with a coding/agentic fine-tune recipe. Architecture is unchanged from Qwen3.5-35B-A3B — 256 experts top-8, hybrid DeltaNet + Full Attention (3:1 ratio), head_dim=256 with partial_rotary_factor=0.25, shared expert, tied embeddings — so hipfire's arch_id=6 path loads it without any engine changes.

Files

File	Quant	Size	Min VRAM	RX 7900 XTX decode
qwen3.6-35b-a3b.mq4 ⭐	MQ4	18.7 GB	22 GB	~148 tok/s
qwen3.6-35b-a3b.mq4.hermes.triattn.bin	TriAttention sidecar	983 KB	—	—

⭐ MQ4 is FWHT-rotated 4-bit, quality-gated against the Q8 reference.

The .hermes.triattn.bin is an Aureth-corpus-calibrated CASK sidecar for KV cache eviction — trained on agentic/tool-use transcripts rather than generic wikitext, so it preserves what the model actually attends to during long-context coding and tool invocations. Enables 131K-context inference on 24 GB consumer cards via the FlashTriAttn + CASK m-folding pipeline shipped in hipfire 0.1.7-alpha.

Usage

# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull the model
hipfire pull qwen3.6:35b-a3b

hipfire run qwen3.6:35b-a3b "Write a Rust function that parses an ISO-8601 date."

Optional — enable TriAttention KV eviction with the Hermes sidecar

# Download the sidecar alongside the model
hf download schuttdev/hipfire-qwen3.6-35b-a3b \
    qwen3.6-35b-a3b.mq4.hermes.triattn.bin \
    --local-dir ~/.hipfire/models

# Wire it into the per-model config
hipfire config qwen3.6:35b-a3b set cask_sidecar ~/.hipfire/models/qwen3.6-35b-a3b.mq4.hermes.triattn.bin
hipfire config qwen3.6:35b-a3b set cask true
hipfire config qwen3.6:35b-a3b set cask_budget 512
hipfire config qwen3.6:35b-a3b set cask_beta 128

With the sidecar loaded, long-context (≥4K prompt) inference runs with KV memory capped at cask_budget tokens instead of growing linearly.

Configuration notes

thinking:off recommended — Qwen3.6-A3B is a heavy thinker and default thinking-mode prompts produce long reasoning chains that can loop on complex tasks. For production-style usage:
```
hipfire config qwen3.6:35b-a3b set thinking off
```
Default dflash_mode: auto — the engine keeps DFlash speculative decoding off for A3B unless a cask_sidecar is configured, because A3B drafts reject most tokens (τ≈1.0–1.5 on non-math), and the cycle overhead outweighs the AR win. With the sidecar + CASK enabled, DFlash remains on (needed for long-context eviction anyway).

Quantization format

MQ4 (MagnumQuant-4) — FWHT-rotated 4-bit with asym3 KV cache default. Matches Q8 output quality at ~Q4 bandwidth on hipfire's WMMA/dot2 fused kernel paths. See docs/QUANTIZATION.md for details on the rotation invariance property and the quality gate.

License

Apache 2.0, following the upstream Qwen/Qwen3.6-35B-A3B license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schuttdev/hipfire-qwen3.6-35b-a3b

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(73)

this model