Qwen3.6-35B-A3B for hipfire
Pre-quantized Qwen3.6-35B-A3B (MoE, 35B total / 3B activated) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.
Quantized from Qwen/Qwen3.6-35B-A3B.
Qwen3.6's April 2026 refresh of the A3B line, with a coding/agentic
fine-tune recipe. Architecture is unchanged from Qwen3.5-35B-A3B β
256 experts top-8, hybrid DeltaNet + Full Attention (3:1 ratio), head_dim=256
with partial_rotary_factor=0.25, shared expert, tied embeddings β so
hipfire's arch_id=6 path loads it without any engine changes.
Files
| File | Quant | Size | Min VRAM | RX 7900 XTX decode |
|---|---|---|---|---|
| qwen3.6-35b-a3b.mq4 β | MQ4 | 18.7 GB | 22 GB | ~148 tok/s |
| qwen3.6-35b-a3b.mq4.hermes.triattn.bin | TriAttention sidecar | 983 KB | β | β |
β MQ4 is FWHT-rotated 4-bit, quality-gated against the Q8 reference.
The .hermes.triattn.bin is an Aureth-corpus-calibrated CASK sidecar
for KV cache eviction β trained on agentic/tool-use transcripts rather than
generic wikitext, so it preserves what the model actually attends to during
long-context coding and tool invocations. Enables 131K-context inference on
24 GB consumer cards via the FlashTriAttn + CASK m-folding pipeline shipped
in hipfire 0.1.7-alpha.
Usage
# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash
# Pull the model
hipfire pull qwen3.6:35b-a3b
hipfire run qwen3.6:35b-a3b "Write a Rust function that parses an ISO-8601 date."
Optional β enable TriAttention KV eviction with the Hermes sidecar
# Download the sidecar alongside the model
hf download schuttdev/hipfire-qwen3.6-35b-a3b \
qwen3.6-35b-a3b.mq4.hermes.triattn.bin \
--local-dir ~/.hipfire/models
# Wire it into the per-model config
hipfire config qwen3.6:35b-a3b set cask_sidecar ~/.hipfire/models/qwen3.6-35b-a3b.mq4.hermes.triattn.bin
hipfire config qwen3.6:35b-a3b set cask true
hipfire config qwen3.6:35b-a3b set cask_budget 512
hipfire config qwen3.6:35b-a3b set cask_beta 128
With the sidecar loaded, long-context (β₯4K prompt) inference runs with
KV memory capped at cask_budget tokens instead of growing linearly.
Configuration notes
thinking:offrecommended β Qwen3.6-A3B is a heavy thinker and default thinking-mode prompts produce long reasoning chains that can loop on complex tasks. For production-style usage:hipfire config qwen3.6:35b-a3b set thinking off- Default
dflash_mode: autoβ the engine keeps DFlash speculative decoding off for A3B unless acask_sidecaris configured, because A3B drafts reject most tokens (Οβ1.0β1.5 on non-math), and the cycle overhead outweighs the AR win. With the sidecar + CASK enabled, DFlash remains on (needed for long-context eviction anyway).
Quantization format
- MQ4 (MagnumQuant-4) β FWHT-rotated 4-bit with asym3 KV cache default. Matches Q8 output quality at ~Q4 bandwidth on hipfire's WMMA/dot2 fused kernel paths. See docs/QUANTIZATION.md for details on the rotation invariance property and the quality gate.
License
Apache 2.0, following the upstream Qwen/Qwen3.6-35B-A3B license.
Model tree for schuttdev/hipfire-qwen3.6-35b-a3b
Base model
Qwen/Qwen3.6-35B-A3B