Qwen3.6-35B-A3B — Claude Opus 4.7 Reasoning

MLX 5-bit | Apple Silicon Optimized

Converted & released by Antimatter AI

Original model by lordx64

35B total → ~3B active per token | 5.5 bpw | ~22 GB on disk | runs on 32 GB+ Apple Silicon

What is this?

An MLX-native 5-bit quantization of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — a Mixture-of-Experts reasoning model fine-tuned to think like Claude Opus 4.7.

The original bf16 weights (70 GB) are too large for most local setups. This conversion brings the model to **22 GB** with 5-bit affine quantization via Apple's MLX framework, making it practical to run entirely on-device on a Mac with 32 GB or more of unified memory.

Why this model matters


Claude-grade reasoning, open weights	Trained on ~8K reasoning traces from Claude Opus 4.7 with explicit `<think>…</think>` chain-of-thought
Sparse MoE efficiency	256 experts, 8 routed + 1 shared — only ~3B parameters active per token despite 35B total
Apple Silicon native	MLX quantized weights load directly into unified memory — no CPU↔GPU copies, no CUDA required
Long context	262K token context window (base architecture); 64K usable at inference

Quick Start

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit")

messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=8192, verbose=True)
print(response)

Serve (OpenAI-compatible API)

python -m mlx_lm server \
  --model AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit \
  --host 127.0.0.1 \
  --port 8080

Then query it like any OpenAI endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit",
    "messages": [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}],
    "max_tokens": 4096
  }'

Architecture

Parameter	Value
Architecture	`qwen3_5_moe` (Qwen 3.6 Mixture-of-Experts)
Total parameters	35.95B
Active parameters/token	~3B (8 of 256 routed experts + 1 shared)
Layers	40 (mixed linear + full attention, every 4th layer is full attention)
Hidden size	2048
Attention heads	16 (KV heads: 2)
Head dim	256
Expert FFN size	512 per expert
Context window	262,144 tokens (architecture max)
Vocabulary	248,320 tokens

Quantization Details


Method	MLX affine quantization
Precision	5-bit weights (5.502 effective bits per weight)
Group size	64
Router gates	Preserved at 8-bit (all 40 layers — `mlp.gate` and `mlp.shared_expert_gate`)
Vision tower	Stripped during conversion (text-only serving)
Disk size	~22 GB (5 sharded safetensors)
Conversion tool	`mlx-lm` v0.31.x (`python -m mlx_lm convert`)

Router gates are kept at 8-bit precision to preserve expert routing quality — the gates are small tensors but critical for MoE performance.

Benchmarks

Benchmarks below are from the original bf16 model as evaluated by lordx64. Quantization to 5-bit typically incurs minimal degradation on MoE architectures due to the sparse activation pattern.

Benchmark	Setup	Score
GSM8K (CoT)	8-shot multiturn	84.3% (flex) / 76.7% (strict)
MMLU-Pro	5-shot multiturn	74.9%

MMLU-Pro subject breakdown:

Subject	Acc	Subject	Acc
Biology	86.0%	Chemistry	78.8%
Psychology	83.4%	Health	73.8%
Math	83.6%	Business	74.4%
Economics	83.0%	Other	72.6%
Physics	81.0%	Philosophy	71.3%
Computer Science	79.0%	History	70.9%
Engineering	54.8%	Law	55.6%

Full evaluation data: lordx64/qwen3-6-distill-evals

Memory & Performance Guide

System RAM	Recommended Settings
32 GB	`--prefill-step-size 1024` — tight but workable for shorter contexts
64 GB	`--prefill-step-size 2048 --prompt-cache-bytes 8GB`
128 GB	`--prefill-step-size 4096 --prompt-cache-bytes 16GB` — full speed
192 GB+	`--prefill-step-size 8192 --prompt-cache-bytes 32GB`

The MoE architecture activates only 3B parameters per token, so prefill memory per step is much smaller than a dense 35B model. Expect **30–50 tokens/sec** decode speed on M-series chips depending on memory bandwidth.

Reasoning Behavior

This model emits explicit <think>…</think> reasoning blocks before its final answer — this is by design, inherited from the Claude Opus 4.7 distillation. Example:

<think>
I need to count positive integers less than 1000 whose digits sum to 20.
Let me denote a 3-digit number as having digits a, b, c where...
[detailed step-by-step reasoning]
</think>

The answer is 75.

For applications where you only need the final answer, strip everything between <think> and </think> in post-processing.

Preventing Thought Loops

At 5-bit quantization, this model will degenerate inside <think> blocks and during extended generation. The degeneration takes multiple forms: paragraph-level repetition, single-word stutter, synonym chain runaway, word concatenation, and free-association word salad. Repetition penalties alone are insufficient — you must also disable extended thinking and cap token output.

Recommended: disable thinking mode at the server level:

python -m mlx_lm server \
  --model /path/to/model \
  --chat-template-args '{"enable_thinking": false}'

Recommended generation parameters:

{
  "repetition_penalty": 2.0,
  "repetition_context_size": 1024,
  "temperature": 0.7,
  "max_tokens": 2048
}

Parameter	Why
`enable_thinking: false`	Critical. Prevents the model from entering `<think>` blocks where degeneration is most severe
`repetition_penalty: 2.0`	Strong penalty needed — lower values (1.15–1.5) only change the degeneration pattern without stopping it
`repetition_context_size: 1024`	Wide lookback window to catch long-distance repetition
`max_tokens: 2048`	Hard cap — the model produces ~100–300 tokens of useful content before quality degrades

For production use, the included proxy (mlx_openai_input_limit_proxy.py) provides output-side loop detection that catches degeneration patterns and truncates responses at the last coherent sentence. This is the most reliable defense.

Why this happens: The 5-bit quantization reduces the model's ability to maintain coherent long-range generation. Combined with reasoning-distillation training that strongly biases toward extended output, the model enters self-reinforcing degeneration modes that escalate from word repetition to increasingly creative forms of incoherence.

Training Lineage

Qwen/Qwen3.6-35B-A3B (Apache 2.0)
  └── lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled (LoRA SFT on Claude Opus 4.7 traces)
        └── AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit (this model — MLX 5-bit conversion)

Stage	Detail
Base	Qwen/Qwen3.6-35B-A3B
Fine-tune	LoRA (r=16, attention-only) on lordx64/reasoning-distill-opus-4-7-max-sft (~7.8K conversations)
Teacher	Claude Opus 4.7 (Anthropic)
Quantization	MLX 5-bit affine (this release) by Antimatter AI

Limitations

Reasoning over knowledge. Distillation transfers how to think, not new facts. The model's knowledge boundary is that of Qwen 3.6.
Long generation budgets. The model will use 5–30K tokens of <think> reasoning on hard problems. Set max_tokens accordingly.
Attention-only LoRA. Expert FFNs are untouched from the base — domains where Claude and Qwen diverge may show uneven results.
Distillation provenance. Training data was generated via Anthropic's Claude API. Downstream users should confirm compliance with Anthropic's usage policies.

Citation

If you use this model, please cite the original work and this conversion:

@misc{lordx64_qwen36_distill_2026,
  title  = {Qwen3.6-35B-A3B distilled from Claude Opus 4.7 reasoning},
  author = {lordx64},
  year   = {2026},
  url    = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled},
}

@misc{qwen36_a3b_2026,
  title  = {Qwen3.6-35B-A3B},
  author = {Qwen Team},
  year   = {2026},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B},
}

@misc{antimatterai_mlx_5bit_2026,
  title  = {Qwen3.6-35B-A3B Claude Opus 4.7 Reasoning — MLX 5-bit conversion},
  author = {Antimatter AI},
  year   = {2026},
  url    = {https://huggingface.co/AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit},
}

Acknowledgements

lordx64 — original model author; fine-tuned Qwen3.6 on Claude Opus 4.7 reasoning traces and published the weights under Apache 2.0.
Qwen Team — for the Qwen3.6-35B-A3B base model with a permissive license.
Anthropic — for Claude Opus 4.7, the teacher model.
Apple MLX Team — for the MLX framework enabling efficient on-device inference.
Unsloth — for accelerated LoRA training tooling used in the original fine-tune.

Antimatter AI — Building Digital Solutions That Matter

AI Development | Product Design | Healthcare Apps | IoT

Downloads last month: 1,684

Safetensors

Model size

35B params

Tensor type

BF16

U32

MLX

Hardware compatibility

5-bit

Model tree for AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit

Base model

Qwen/Qwen3.6-35B-A3B

Adapter

lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Quantized

(27)

this model

AntimatterAI
/

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit