Qwen3.6-35B-A3B — Claude Opus 4.7 Reasoning
MLX 5-bit | Apple Silicon Optimized
Converted & released by Antimatter AI
Original model by lordx64
35B total → ~3B active per token | 5.5 bpw | ~22 GB on disk | runs on 32 GB+ Apple Silicon
What is this?
An MLX-native 5-bit quantization of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — a Mixture-of-Experts reasoning model fine-tuned to think like Claude Opus 4.7.
The original bf16 weights (70 GB) are too large for most local setups. This conversion brings the model to **22 GB** with 5-bit affine quantization via Apple's MLX framework, making it practical to run entirely on-device on a Mac with 32 GB or more of unified memory.
Why this model matters
| Claude-grade reasoning, open weights | Trained on ~8K reasoning traces from Claude Opus 4.7 with explicit <think>…</think> chain-of-thought |
| Sparse MoE efficiency | 256 experts, 8 routed + 1 shared — only ~3B parameters active per token despite 35B total |
| Apple Silicon native | MLX quantized weights load directly into unified memory — no CPU↔GPU copies, no CUDA required |
| Long context | 262K token context window (base architecture); 64K usable at inference |
Quick Start
Install
pip install mlx-lm
Generate
from mlx_lm import load, generate
model, tokenizer = load("AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit")
messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=8192, verbose=True)
print(response)
Serve (OpenAI-compatible API)
python -m mlx_lm server \
--model AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit \
--host 127.0.0.1 \
--port 8080
Then query it like any OpenAI endpoint:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit",
"messages": [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}],
"max_tokens": 4096
}'
Architecture
| Parameter | Value |
|---|---|
| Architecture | qwen3_5_moe (Qwen 3.6 Mixture-of-Experts) |
| Total parameters | 35.95B |
| Active parameters/token | ~3B (8 of 256 routed experts + 1 shared) |
| Layers | 40 (mixed linear + full attention, every 4th layer is full attention) |
| Hidden size | 2048 |
| Attention heads | 16 (KV heads: 2) |
| Head dim | 256 |
| Expert FFN size | 512 per expert |
| Context window | 262,144 tokens (architecture max) |
| Vocabulary | 248,320 tokens |
Quantization Details
| Method | MLX affine quantization |
| Precision | 5-bit weights (5.502 effective bits per weight) |
| Group size | 64 |
| Router gates | Preserved at 8-bit (all 40 layers — mlp.gate and mlp.shared_expert_gate) |
| Vision tower | Stripped during conversion (text-only serving) |
| Disk size | ~22 GB (5 sharded safetensors) |
| Conversion tool | mlx-lm v0.31.x (python -m mlx_lm convert) |
Router gates are kept at 8-bit precision to preserve expert routing quality — the gates are small tensors but critical for MoE performance.
Benchmarks
Benchmarks below are from the original bf16 model as evaluated by lordx64. Quantization to 5-bit typically incurs minimal degradation on MoE architectures due to the sparse activation pattern.
| Benchmark | Setup | Score |
|---|---|---|
| GSM8K (CoT) | 8-shot multiturn | 84.3% (flex) / 76.7% (strict) |
| MMLU-Pro | 5-shot multiturn | 74.9% |
MMLU-Pro subject breakdown:
| Subject | Acc | Subject | Acc |
|---|---|---|---|
| Biology | 86.0% | Chemistry | 78.8% |
| Psychology | 83.4% | Health | 73.8% |
| Math | 83.6% | Business | 74.4% |
| Economics | 83.0% | Other | 72.6% |
| Physics | 81.0% | Philosophy | 71.3% |
| Computer Science | 79.0% | History | 70.9% |
| Engineering | 54.8% | Law | 55.6% |
Full evaluation data: lordx64/qwen3-6-distill-evals
Memory & Performance Guide
| System RAM | Recommended Settings |
|---|---|
| 32 GB | --prefill-step-size 1024 — tight but workable for shorter contexts |
| 64 GB | --prefill-step-size 2048 --prompt-cache-bytes 8GB |
| 128 GB | --prefill-step-size 4096 --prompt-cache-bytes 16GB — full speed |
| 192 GB+ | --prefill-step-size 8192 --prompt-cache-bytes 32GB |
The MoE architecture activates only 3B parameters per token, so prefill memory per step is much smaller than a dense 35B model. Expect **30–50 tokens/sec** decode speed on M-series chips depending on memory bandwidth.
Reasoning Behavior
This model emits explicit <think>…</think> reasoning blocks before its final answer — this is by design, inherited from the Claude Opus 4.7 distillation. Example:
<think>
I need to count positive integers less than 1000 whose digits sum to 20.
Let me denote a 3-digit number as having digits a, b, c where...
[detailed step-by-step reasoning]
</think>
The answer is 75.
For applications where you only need the final answer, strip everything between <think> and </think> in post-processing.
Preventing Thought Loops
At 5-bit quantization, this model will degenerate inside <think> blocks and during extended generation. The degeneration takes multiple forms: paragraph-level repetition, single-word stutter, synonym chain runaway, word concatenation, and free-association word salad. Repetition penalties alone are insufficient — you must also disable extended thinking and cap token output.
Recommended: disable thinking mode at the server level:
python -m mlx_lm server \
--model /path/to/model \
--chat-template-args '{"enable_thinking": false}'
Recommended generation parameters:
{
"repetition_penalty": 2.0,
"repetition_context_size": 1024,
"temperature": 0.7,
"max_tokens": 2048
}
| Parameter | Why |
|---|---|
enable_thinking: false |
Critical. Prevents the model from entering <think> blocks where degeneration is most severe |
repetition_penalty: 2.0 |
Strong penalty needed — lower values (1.15–1.5) only change the degeneration pattern without stopping it |
repetition_context_size: 1024 |
Wide lookback window to catch long-distance repetition |
max_tokens: 2048 |
Hard cap — the model produces ~100–300 tokens of useful content before quality degrades |
For production use, the included proxy (mlx_openai_input_limit_proxy.py) provides output-side loop detection that catches degeneration patterns and truncates responses at the last coherent sentence. This is the most reliable defense.
Why this happens: The 5-bit quantization reduces the model's ability to maintain coherent long-range generation. Combined with reasoning-distillation training that strongly biases toward extended output, the model enters self-reinforcing degeneration modes that escalate from word repetition to increasingly creative forms of incoherence.
Training Lineage
Qwen/Qwen3.6-35B-A3B (Apache 2.0)
└── lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled (LoRA SFT on Claude Opus 4.7 traces)
└── AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit (this model — MLX 5-bit conversion)
| Stage | Detail |
|---|---|
| Base | Qwen/Qwen3.6-35B-A3B |
| Fine-tune | LoRA (r=16, attention-only) on lordx64/reasoning-distill-opus-4-7-max-sft (~7.8K conversations) |
| Teacher | Claude Opus 4.7 (Anthropic) |
| Quantization | MLX 5-bit affine (this release) by Antimatter AI |
Limitations
- Reasoning over knowledge. Distillation transfers how to think, not new facts. The model's knowledge boundary is that of Qwen 3.6.
- Long generation budgets. The model will use 5–30K tokens of
<think>reasoning on hard problems. Setmax_tokensaccordingly. - Attention-only LoRA. Expert FFNs are untouched from the base — domains where Claude and Qwen diverge may show uneven results.
- Distillation provenance. Training data was generated via Anthropic's Claude API. Downstream users should confirm compliance with Anthropic's usage policies.
Citation
If you use this model, please cite the original work and this conversion:
@misc{lordx64_qwen36_distill_2026,
title = {Qwen3.6-35B-A3B distilled from Claude Opus 4.7 reasoning},
author = {lordx64},
year = {2026},
url = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled},
}
@misc{qwen36_a3b_2026,
title = {Qwen3.6-35B-A3B},
author = {Qwen Team},
year = {2026},
url = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B},
}
@misc{antimatterai_mlx_5bit_2026,
title = {Qwen3.6-35B-A3B Claude Opus 4.7 Reasoning — MLX 5-bit conversion},
author = {Antimatter AI},
year = {2026},
url = {https://huggingface.co/AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit},
}
Acknowledgements
- lordx64 — original model author; fine-tuned Qwen3.6 on Claude Opus 4.7 reasoning traces and published the weights under Apache 2.0.
- Qwen Team — for the Qwen3.6-35B-A3B base model with a permissive license.
- Anthropic — for Claude Opus 4.7, the teacher model.
- Apple MLX Team — for the MLX framework enabling efficient on-device inference.
- Unsloth — for accelerated LoRA training tooling used in the original fine-tune.
Antimatter AI — Building Digital Solutions That Matter
AI Development | Product Design | Healthcare Apps | IoT
- Downloads last month
- 1,684
5-bit
Model tree for AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit
Base model
Qwen/Qwen3.6-35B-A3B