Qwen3.6-35B-A3B for hipfire

Pre-quantized Qwen3.6-35B-A3B (MoE, 35B total / 3B activated) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Quantized from Qwen/Qwen3.6-35B-A3B. Qwen3.6's April 2026 refresh of the A3B line, with a coding/agentic fine-tune recipe. Architecture is unchanged from Qwen3.5-35B-A3B β€” 256 experts top-8, hybrid DeltaNet + Full Attention (3:1 ratio), head_dim=256 with partial_rotary_factor=0.25, shared expert, tied embeddings β€” so hipfire's arch_id=6 path loads it without any engine changes.

Files

File Quant Size Min VRAM RX 7900 XTX decode
qwen3.6-35b-a3b.mq4 ⭐ MQ4 18.7 GB 22 GB ~148 tok/s
qwen3.6-35b-a3b.mq4.hermes.triattn.bin TriAttention sidecar 983 KB β€” β€”

⭐ MQ4 is FWHT-rotated 4-bit, quality-gated against the Q8 reference.

The .hermes.triattn.bin is an Aureth-corpus-calibrated CASK sidecar for KV cache eviction β€” trained on agentic/tool-use transcripts rather than generic wikitext, so it preserves what the model actually attends to during long-context coding and tool invocations. Enables 131K-context inference on 24 GB consumer cards via the FlashTriAttn + CASK m-folding pipeline shipped in hipfire 0.1.7-alpha.

Usage

# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull the model
hipfire pull qwen3.6:35b-a3b

hipfire run qwen3.6:35b-a3b "Write a Rust function that parses an ISO-8601 date."

Optional β€” enable TriAttention KV eviction with the Hermes sidecar

# Download the sidecar alongside the model
hf download schuttdev/hipfire-qwen3.6-35b-a3b \
    qwen3.6-35b-a3b.mq4.hermes.triattn.bin \
    --local-dir ~/.hipfire/models

# Wire it into the per-model config
hipfire config qwen3.6:35b-a3b set cask_sidecar ~/.hipfire/models/qwen3.6-35b-a3b.mq4.hermes.triattn.bin
hipfire config qwen3.6:35b-a3b set cask true
hipfire config qwen3.6:35b-a3b set cask_budget 512
hipfire config qwen3.6:35b-a3b set cask_beta 128

With the sidecar loaded, long-context (β‰₯4K prompt) inference runs with KV memory capped at cask_budget tokens instead of growing linearly.

Configuration notes

  • thinking:off recommended β€” Qwen3.6-A3B is a heavy thinker and default thinking-mode prompts produce long reasoning chains that can loop on complex tasks. For production-style usage:
    hipfire config qwen3.6:35b-a3b set thinking off
    
  • Default dflash_mode: auto β€” the engine keeps DFlash speculative decoding off for A3B unless a cask_sidecar is configured, because A3B drafts reject most tokens (Ο„β‰ˆ1.0–1.5 on non-math), and the cycle overhead outweighs the AR win. With the sidecar + CASK enabled, DFlash remains on (needed for long-context eviction anyway).

Quantization format

  • MQ4 (MagnumQuant-4) β€” FWHT-rotated 4-bit with asym3 KV cache default. Matches Q8 output quality at ~Q4 bandwidth on hipfire's WMMA/dot2 fused kernel paths. See docs/QUANTIZATION.md for details on the rotation invariance property and the quality gate.

License

Apache 2.0, following the upstream Qwen/Qwen3.6-35B-A3B license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for schuttdev/hipfire-qwen3.6-35b-a3b

Finetuned
(73)
this model