Text Generation
Transformers
Safetensors
English
qwen3_5_moe
image-text-to-text
reasoning
distillation
chain-of-thought
qwen
qwen3.6
mixture-of-experts
Mixture of Experts
lora
unsloth
conversational
Instructions to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled") model = AutoModelForMultimodalLM.from_pretrained("r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
- SGLang
How to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled", max_seq_length=2048, ) - Docker Model Runner
How to use r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled with Docker Model Runner:
docker model run hf.co/r3lax/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| base_model: Qwen/Qwen3.6-35B-A3B | |
| datasets: | |
| - lordx64/reasoning-distill-opus-4-7-max-sft | |
| tags: | |
| - text-generation | |
| - reasoning | |
| - distillation | |
| - chain-of-thought | |
| - qwen | |
| - qwen3.6 | |
| - mixture-of-experts | |
| - moe | |
| - lora | |
| - unsloth | |
| model-index: | |
| - name: Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled | |
| results: [] | |
| # Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled | |
| A reasoning-distilled variant of **Qwen3.6-35B-A3B** taught to imitate the chain-of-thought style of **Claude Opus 4.7**, the frontier reasoning model from Anthropic. The goal: port Claude-grade reasoning behavior into a permissively-licensed Mixture-of-Experts model that an individual can actually run. | |
| ## Why this model | |
| - **Claude-style reasoning, open weights.** Claude Opus 4.7 is one of the strongest reasoning models available, but only via a proprietary API. This model has been fine-tuned on ~8k high-quality reasoning traces produced by Opus 4.7, teaching the base to *think* before answering — with explicit `<think>…</think>` blocks — in Claude's structure and cadence. | |
| - **Sparse activation, dense knowledge.** The base is a 35B-parameter MoE with **256 experts, 8 routed + 1 shared**, of which only about **3B parameters are active** per token. You get the capacity of a 35B model at the inference cost of a small dense model. Full-quality bf16 inference runs on a single 80GB A100 or H100. | |
| - **Long thinking supported.** 64k token context. The model routinely emits 5–30k tokens of `<think>` reasoning on hard problems before giving the final answer — which is the whole point of reasoning models, and why this one was specifically trained end-to-end with an upstream teacher that also reasons explicitly. | |
| - **Clean base to build on.** LoRA adapter is also published separately (`…-adapter`), so you can apply the distillation to other checkpoints of the same base, or stack further fine-tunes. | |
| ## Intended use | |
| Built for hard reasoning: graduate-level STEM, competition math (AIME / MATH), code reasoning with explicit walk-through, multi-step logic puzzles, and agentic planning where explicit `<think>` helps correctness. | |
| For short-turn conversational latency-sensitive workloads the thinking budget can be large; cap `max_new_tokens` or post-process to strip `<think>…</think>` blocks if you only want final answers in production. | |
| ## How to use | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| repo = "lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled" | |
| tok = AutoTokenizer.from_pretrained(repo) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, | |
| ) | |
| messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}] | |
| inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) | |
| out = model.generate(inputs, max_new_tokens=32768, do_sample=False) | |
| print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True)) | |
| ``` | |
| Recommended backend: **vLLM** for serving — the MoE routing + KV cache benefit significantly from continuous batching. | |
| ``` | |
| vllm serve lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled \ | |
| --dtype bfloat16 --max-model-len 65536 --gpu-memory-utilization 0.9 | |
| ``` | |
| ### GGUF (LM Studio / llama.cpp) | |
| Quantized GGUF weights are available for `llama.cpp` and LM Studio: | |
| - [**IQ4_XS** (18.9 GB)](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-IQ4_XS-GGUF) — fits in ~24 GB RAM/VRAM, default pick for LM Studio | |
| Search `lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled` inside LM Studio's model browser once HF has indexed the GGUF repo (usually within an hour of publication). More quant levels (`Q4_K_M`, `Q5_K_M`, `Q8_0`) can be added on request. | |
| ## Training | |
| | | | | |
| |---|---| | |
| | Base model | `Qwen/Qwen3.6-35B-A3B` (loaded via `unsloth/Qwen3.6-35B-A3B` for faster finetuning) | | |
| | Teacher | Claude Opus 4.7 (Anthropic) | | |
| | Training dataset | [`lordx64/reasoning-distill-opus-4-7-max-sft`](https://huggingface.co/datasets/lordx64/reasoning-distill-opus-4-7-max-sft) — reasoning traces from Claude Opus 4.7 reformatted into SFT conversations | | |
| | Source dataset | [`lordx64/reasoning-distill-claude-opus-4-7-max`](https://huggingface.co/datasets/lordx64/reasoning-distill-claude-opus-4-7-max) — raw teacher traces (pre-SFT formatting) | | |
| | Dataset size | ~7,800 full conversations, assistant side trained including `<think>…</think>` | | |
| | Method | SFT with Unsloth + TRL `SFTTrainer` + `train_on_responses_only` (loss only on assistant tokens) | | |
| | LoRA config | `r=16, alpha=16, dropout=0.0, targets=["q_proj","k_proj","v_proj","o_proj"]` (attention-only) | | |
| | Hyperparameters | `lr=2e-5`, cosine schedule, `warmup_ratio=0.03`, `weight_decay=0.01`, optimizer `adamw_8bit` | | |
| | Batch | `per_device=1, grad_accum=16, effective=16`, 2 epochs = 978 steps | | |
| | Sequence | 4096 tokens during training (64k usable at inference — base supports it natively) | | |
| | Precision | bf16 on 1× H200 141GB (HF Inference Endpoint, custom container) | | |
| | Trainable | 3.44M params out of 35.1B (0.01%) | | |
| ### Why attention-only LoRA on a MoE | |
| The initial plan was full LoRA including the MoE expert FFNs (`gate_proj/up_proj/down_proj`). In the course of this project I filed and upstreamed a shape-mismatch fix to unsloth-zoo's MoE+LoRA grouped-mm path — [unslothai/unsloth-zoo#601](https://github.com/unslothai/unsloth-zoo/pull/601) — without which the expert-LoRA forward crashes on Qwen3.6's 256-expert layout. Even with that fix, single-GPU memory made expert-LoRA impractical for this run. Attention-only captures most of the signal on *style* distillation anyway (the point of this model) while leaving the expert FFNs' learned knowledge intact — a v2 training run with expert LoRA on multi-GPU is a natural next step if the style-only signal isn't enough. | |
| ## Evaluation | |
| Evaluated via `lm-evaluation-harness` (v0.4.9) with vLLM backend at 64k context, bf16. Custom eval path strips `<think>…</think>` from generations before the filter pipeline, uses per-task conventional fewshot counts, and runs with `fewshot_as_multiturn=True` so few-shot examples are proper chat turns rather than concatenated prompt text. Raw results JSON is public: [lordx64/qwen3-6-distill-evals](https://huggingface.co/datasets/lordx64/qwen3-6-distill-evals). | |
| | Benchmark | Setup | Score | | |
| |---|---|---| | |
| | **GSM8K CoT** | 8-shot multiturn, limit 300 | **84.3%** (flexible-extract) / 76.7% (strict-match) | | |
| | **MMLU-Pro** | 5-shot multiturn, limit 500 | **74.9%** | | |
| | AIME 2024 | 0-shot, full (30) | _extraction fix in progress — model generates answers but not in a format the AIME extractor recognizes (`\boxed{}` vs plain prose)_ | | |
| | AIME 2025 | 0-shot, full (30) | _same — pending_ | | |
| | GPQA Diamond | 0-shot CoT, full (198) | _same — pending_ | | |
| | MATH-500 | 0-shot, limit 100 | _rerun pending (missing `sympy` / `math_verify` dep in the first run)_ | | |
| ### MMLU-Pro subject breakdown | |
| Standard reasoning-model profile: strong on STEM, weaker on law/engineering. All subjects evaluated at limit 500, 5-shot multiturn. | |
| | Subject | Acc | Subject | Acc | | |
| |---|---:|---|---:| | |
| | Biology | 86.0% | Chemistry | 78.8% | | |
| | Psychology | 83.4% | Health | 73.8% | | |
| | Math | 83.6% | Business | 74.4% | | |
| | Economics | 83.0% | Other | 72.6% | | |
| | Physics | 81.0% | Philosophy | 71.3% | | |
| | Computer Science | 79.0% | History | 70.9% | | |
| | | | **Engineering** | **54.8%** | | |
| | | | **Law** | **55.6%** | | |
| Full per-task JSON with stderr, filter configs, and timings lives in the [evals dataset](https://huggingface.co/datasets/lordx64/qwen3-6-distill-evals/tree/main/reasoning/lordx64__Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled). The remaining tasks will be added to this table after a diagnostic rerun identifies why AIME/GPQA extraction is returning no-match on generated outputs. | |
| ## Limitations | |
| - **Reasoning ≠ knowledge.** Distillation transfers *how to reason*, not new facts. Anything the base Qwen3.6-35B-A3B doesn't already know, this model still doesn't know. | |
| - **Attention-only LoRA.** Expert FFNs are untouched from the base — domains where Claude and Qwen3.6 diverge in factual priors may see uneven improvement. | |
| - **Long generations.** The model will genuinely use tens of thousands of tokens on hard problems. Budget your `max_new_tokens` accordingly, and provide `max_model_len ≥ 32k` at inference. | |
| - **Distillation provenance.** Training data was generated with Anthropic's Claude Opus 4.7 via API. Downstream users should confirm compliance with Anthropic's [usage policies](https://www.anthropic.com/legal/usage-policy) for their specific use case. | |
| ## Citation | |
| If you use this model, please cite the base and the distillation: | |
| ```bibtex | |
| @misc{qwen36_a3b_2026, | |
| title = {Qwen3.6-35B-A3B}, | |
| author = {Qwen Team}, | |
| year = {2026}, | |
| howpublished = {\url{https://huggingface.co/Qwen/Qwen3.6-35B-A3B}}, | |
| } | |
| @misc{lordx64_qwen36_distill_2026, | |
| title = {Qwen3.6-35B-A3B distilled from Claude Opus 4.7 reasoning}, | |
| author = {lordx64}, | |
| year = {2026}, | |
| howpublished = {\url{https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled}}, | |
| } | |
| ``` | |
| ## Acknowledgements | |
| - **Unsloth** — 2× faster training of large MoE LoRA; the bug we hit and fixed was in their `unsloth-zoo` patches (credit for rapid review of PR #601). | |
| - **Anthropic** — for the teacher model. | |
| - **Qwen team** — for releasing Qwen3.6 with a permissive Apache-2.0 license, enabling work like this. | |
| - **lm-evaluation-harness (EleutherAI)** — evaluation methodology. | |