Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8

MLX mixed 4-/8-bit quantization of tvall43/Qwen3.6-35B-A3B-heretic, targeting Apple Silicon via mlx-lm.

Text-only. Vision-tower weights are not present in this quantization (the base model's vision encoder was excluded during conversion). Despite config.json still declaring Qwen3_5MoeForConditionalGeneration (required for mlx-lm compatibility), image inputs will fail. For multimodal use, see the original Qwen/Qwen3.6-35B-A3B.

~20 GB on disk. Runs on Apple Silicon Macs with 64 GB unified memory for comfortable generation.

Lineage

Qwen/Qwen3.6-35B-A3B
  └─ tvall43/Qwen3.6-35B-A3B-heretic   (abliterated)
       └─ cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8   (this repo, MLX quant)

Qwen/Qwen3.6-35B-A3B — Qwen's open-weight 35B MoE model with only 3B parameters activated per token. Features 40 hybrid transformer layers (alternating Gated DeltaNet linear-attention and Gated Attention blocks), 256 routed experts + 1 shared expert, and a native context window of 262,144 tokens. Notable for strong agentic/coding performance and multimodal capability.

tvall43/Qwen3.6-35B-A3B-heretic — A decensored derivative produced with the Heretic v1.2.0 abliteration technique. Safety refusals were reduced from ~86/100 to ~5/100 by modifying attention output projection weights (attn.o_proj) on a per-layer basis, while maintaining a KL divergence of just 0.0097 from the original model — preserving general behavior almost entirely.

cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8 (this repo) — The heretic weights quantized for Apple Silicon using mlx-lm's convert with a custom quant_predicate. A sensitivity-informed mixed strategy keeps routing and boundary layers at 8-bit while compressing the bulk expert weights to 4-bit. See Quantization Layout below.

Model Details

Property	Value
Base architecture	`Qwen3_5MoeForConditionalGeneration`
Total parameters	~35B
Active parameters per token	~3B
Transformer layers	40 hybrid (30 Gated DeltaNet linear-attn + 10 Gated self-attn)
Self-attn layer positions	3, 7, 11, 15, 19, 23, 27, 31, 35, 39
Routed experts	256
Shared experts	1
Experts activated per token	8 routed + 1 shared
Context length (base)	262,144 tokens
Base dtype	bfloat16
Vision tower	Not present (excluded at quantization time)
Quantized format	MLX affine quantization (safetensors)
Shards	4 safetensors files
On-disk size	~20 GB

Quantization Layout

Default: 4-bit affine, group_size=64. Explicit 8-bit overrides are applied to 132 tensors across six categories based on sensitivity analysis and layer-position anchoring.

Tensor pattern	Scope	Tensor count	Bits	Rationale
`embed_tokens`	input embedding	1	8	Input boundary; errors amplify through all layers
`lm_head`	output head	1	8	Output head sensitivity at the prediction boundary
`mlp.gate`	MoE router in all 40 layers	40	8	Expert routing decisions are disproportionately sensitive to precision
`shared_expert_gate`	shared-expert gate in all 40 layers	40	8	Gates token flow to the always-active shared expert
`linear_attn.out_proj`	all 30 linear-attn (Gated DeltaNet) layers	30	8	Identified by OptiQ sensitivity analysis as the highest-KLD tensor (KLD ~6.0) in these blocks
All tensors in `layers.0.*`	first transformer block	10	8	Input anchor — first and last blocks are known to be especially sensitive
All tensors in `layers.39.*`	last transformer block	10	8	Output anchor
Expert FFN (`switch_mlp.gate_proj`, `switch_mlp.up_proj`, `switch_mlp.down_proj`), shared-expert FFN, attention projections in layers 1–38	bulk of the model	~380	4	Holds the vast majority of ~35B parameters; quality loss is acceptable at 4-bit given MoE sparsity
Vision tower (`.visual.`)	—	0	—	Excluded entirely — small relative to LM and not needed for text-only use

Design rationale: MoE expert weights tolerate 4-bit compression well because only 8 of 256 experts activate per token, limiting quantization noise accumulation. Routing gates, embeddings, and the most sensitive linear-attention projection are kept at 8-bit to preserve routing quality and reduce perplexity degradation. The first and last transformer blocks are anchored at 8-bit as an additional safety margin. All tensors use group_size=64.

Usage

mlx-lm (Python)

from mlx_lm import load, generate

model, tokenizer = load("cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8")

messages = [{"role": "user", "content": "Explain mixture-of-experts in one paragraph."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
print(response)

mlx-lm (CLI)

mlx_lm.generate \
  --model cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8 \
  --prompt "Explain mixture-of-experts in one paragraph." \
  --max-tokens 512

LM Studio

LM Studio supports MLX safetensors natively on Apple Silicon. Either:

From HuggingFace: Search for cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8 in the Discover tab.
Local folder: Use Load Model from Disk and point to the downloaded folder. The chat_template.jinja and tokenizer.json are included, so chat templating works out of the box.

Important Caveats

Text-only. config.json retains Qwen3_5MoeForConditionalGeneration (a VLM architecture class) and the image_token_id field because altering these would break mlx-lm's model loader. However, the vision encoder weights were not included in the safetensors; any attempt to pass image inputs will raise an error. This is intentional.
Heretic variant. This model is derived from an abliterated base. It is uncensored and will comply with requests the original Qwen model would refuse. Use responsibly.
Memory. ~64 GB unified memory is recommended for comfortable inference. The model loads in ~20 GB of weight but requires additional memory for the KV cache at long context lengths.
Thinking mode. Qwen3.6 supports an explicit <think> reasoning mode. The heretic abliteration was applied without disabling this capability; it should remain functional.

License

Apache 2.0 — same as the upstream Qwen model. See LICENSE.

This derivative is distributed under the same terms. Commercial use is permitted.

Credits

Qwen team (Alibaba) — original Qwen3.6-35B-A3B architecture and weights.
tvall43 — Heretic v1.2.0 abliteration producing the decensored base.
mlx-lm — Apple's MLX framework and convert utility used for quantization.
Quantization — produced by cspenn using a custom quant_predicate informed by OptiQ sensitivity analysis.

Downloads last month: 919

Safetensors

Model size

35B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for cspenn/Qwen3.6-35B-A3B-heretic-MLX-Mixed-4-8

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

tvall43/Qwen3.6-35B-A3B-heretic

Quantized

(14)

this model