Qwen3.6-35B-A3B — Claude Opus 4.7 Reasoning

MLX 5-bit  |  Apple Silicon Optimized

Converted & released by Antimatter AI

Original model by lordx64


35B total  →  ~3B active per token  |  5.5 bpw  |  ~22 GB on disk  |  runs on 32 GB+ Apple Silicon


What is this?

An MLX-native 5-bit quantization of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — a Mixture-of-Experts reasoning model fine-tuned to think like Claude Opus 4.7.

The original bf16 weights (70 GB) are too large for most local setups. This conversion brings the model to **22 GB** with 5-bit affine quantization via Apple's MLX framework, making it practical to run entirely on-device on a Mac with 32 GB or more of unified memory.

Why this model matters

Claude-grade reasoning, open weights Trained on ~8K reasoning traces from Claude Opus 4.7 with explicit <think>…</think> chain-of-thought
Sparse MoE efficiency 256 experts, 8 routed + 1 shared — only ~3B parameters active per token despite 35B total
Apple Silicon native MLX quantized weights load directly into unified memory — no CPU↔GPU copies, no CUDA required
Long context 262K token context window (base architecture); 64K usable at inference

Quick Start

Install

pip install mlx-lm

Generate

from mlx_lm import load, generate

model, tokenizer = load("AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit")

messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=8192, verbose=True)
print(response)

Serve (OpenAI-compatible API)

python -m mlx_lm server \
  --model AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit \
  --host 127.0.0.1 \
  --port 8080

Then query it like any OpenAI endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit",
    "messages": [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}],
    "max_tokens": 4096
  }'

Architecture

Parameter Value
Architecture qwen3_5_moe (Qwen 3.6 Mixture-of-Experts)
Total parameters 35.95B
Active parameters/token ~3B (8 of 256 routed experts + 1 shared)
Layers 40 (mixed linear + full attention, every 4th layer is full attention)
Hidden size 2048
Attention heads 16 (KV heads: 2)
Head dim 256
Expert FFN size 512 per expert
Context window 262,144 tokens (architecture max)
Vocabulary 248,320 tokens

Quantization Details

Method MLX affine quantization
Precision 5-bit weights (5.502 effective bits per weight)
Group size 64
Router gates Preserved at 8-bit (all 40 layers — mlp.gate and mlp.shared_expert_gate)
Vision tower Stripped during conversion (text-only serving)
Disk size ~22 GB (5 sharded safetensors)
Conversion tool mlx-lm v0.31.x (python -m mlx_lm convert)

Router gates are kept at 8-bit precision to preserve expert routing quality — the gates are small tensors but critical for MoE performance.


Benchmarks

Benchmarks below are from the original bf16 model as evaluated by lordx64. Quantization to 5-bit typically incurs minimal degradation on MoE architectures due to the sparse activation pattern.

Benchmark Setup Score
GSM8K (CoT) 8-shot multiturn 84.3% (flex) / 76.7% (strict)
MMLU-Pro 5-shot multiturn 74.9%

MMLU-Pro subject breakdown:

Subject Acc Subject Acc
Biology 86.0% Chemistry 78.8%
Psychology 83.4% Health 73.8%
Math 83.6% Business 74.4%
Economics 83.0% Other 72.6%
Physics 81.0% Philosophy 71.3%
Computer Science 79.0% History 70.9%
Engineering 54.8% Law 55.6%

Full evaluation data: lordx64/qwen3-6-distill-evals


Memory & Performance Guide

System RAM Recommended Settings
32 GB --prefill-step-size 1024 — tight but workable for shorter contexts
64 GB --prefill-step-size 2048 --prompt-cache-bytes 8GB
128 GB --prefill-step-size 4096 --prompt-cache-bytes 16GB — full speed
192 GB+ --prefill-step-size 8192 --prompt-cache-bytes 32GB

The MoE architecture activates only 3B parameters per token, so prefill memory per step is much smaller than a dense 35B model. Expect **30–50 tokens/sec** decode speed on M-series chips depending on memory bandwidth.


Reasoning Behavior

This model emits explicit <think>…</think> reasoning blocks before its final answer — this is by design, inherited from the Claude Opus 4.7 distillation. Example:

<think>
I need to count positive integers less than 1000 whose digits sum to 20.
Let me denote a 3-digit number as having digits a, b, c where...
[detailed step-by-step reasoning]
</think>

The answer is 75.

For applications where you only need the final answer, strip everything between <think> and </think> in post-processing.

Preventing Thought Loops

At 5-bit quantization, this model will degenerate inside <think> blocks and during extended generation. The degeneration takes multiple forms: paragraph-level repetition, single-word stutter, synonym chain runaway, word concatenation, and free-association word salad. Repetition penalties alone are insufficient — you must also disable extended thinking and cap token output.

Recommended: disable thinking mode at the server level:

python -m mlx_lm server \
  --model /path/to/model \
  --chat-template-args '{"enable_thinking": false}'

Recommended generation parameters:

{
  "repetition_penalty": 2.0,
  "repetition_context_size": 1024,
  "temperature": 0.7,
  "max_tokens": 2048
}
Parameter Why
enable_thinking: false Critical. Prevents the model from entering <think> blocks where degeneration is most severe
repetition_penalty: 2.0 Strong penalty needed — lower values (1.15–1.5) only change the degeneration pattern without stopping it
repetition_context_size: 1024 Wide lookback window to catch long-distance repetition
max_tokens: 2048 Hard cap — the model produces ~100–300 tokens of useful content before quality degrades

For production use, the included proxy (mlx_openai_input_limit_proxy.py) provides output-side loop detection that catches degeneration patterns and truncates responses at the last coherent sentence. This is the most reliable defense.

Why this happens: The 5-bit quantization reduces the model's ability to maintain coherent long-range generation. Combined with reasoning-distillation training that strongly biases toward extended output, the model enters self-reinforcing degeneration modes that escalate from word repetition to increasingly creative forms of incoherence.


Training Lineage

Qwen/Qwen3.6-35B-A3B (Apache 2.0)
  └── lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled (LoRA SFT on Claude Opus 4.7 traces)
        └── AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit (this model — MLX 5-bit conversion)
Stage Detail
Base Qwen/Qwen3.6-35B-A3B
Fine-tune LoRA (r=16, attention-only) on lordx64/reasoning-distill-opus-4-7-max-sft (~7.8K conversations)
Teacher Claude Opus 4.7 (Anthropic)
Quantization MLX 5-bit affine (this release) by Antimatter AI

Limitations

  • Reasoning over knowledge. Distillation transfers how to think, not new facts. The model's knowledge boundary is that of Qwen 3.6.
  • Long generation budgets. The model will use 5–30K tokens of <think> reasoning on hard problems. Set max_tokens accordingly.
  • Attention-only LoRA. Expert FFNs are untouched from the base — domains where Claude and Qwen diverge may show uneven results.
  • Distillation provenance. Training data was generated via Anthropic's Claude API. Downstream users should confirm compliance with Anthropic's usage policies.

Citation

If you use this model, please cite the original work and this conversion:

@misc{lordx64_qwen36_distill_2026,
  title  = {Qwen3.6-35B-A3B distilled from Claude Opus 4.7 reasoning},
  author = {lordx64},
  year   = {2026},
  url    = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled},
}

@misc{qwen36_a3b_2026,
  title  = {Qwen3.6-35B-A3B},
  author = {Qwen Team},
  year   = {2026},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B},
}

@misc{antimatterai_mlx_5bit_2026,
  title  = {Qwen3.6-35B-A3B Claude Opus 4.7 Reasoning — MLX 5-bit conversion},
  author = {Antimatter AI},
  year   = {2026},
  url    = {https://huggingface.co/AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit},
}

Acknowledgements

  • lordx64 — original model author; fine-tuned Qwen3.6 on Claude Opus 4.7 reasoning traces and published the weights under Apache 2.0.
  • Qwen Team — for the Qwen3.6-35B-A3B base model with a permissive license.
  • Anthropic — for Claude Opus 4.7, the teacher model.
  • Apple MLX Team — for the MLX framework enabling efficient on-device inference.
  • Unsloth — for accelerated LoRA training tooling used in the original fine-tune.

Antimatter AI — Building Digital Solutions That Matter

AI Development  |  Product Design  |  Healthcare Apps  |  IoT

Downloads last month
1,684
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit

Dataset used to train AntimatterAI/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-MLX-5bit