Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT
A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.
The training order is the thesis: teach the model how to reason first (distillation from Thinking teacher), then teach it what to reason about (legal SFT). The Thinking teacher's extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.
"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division
Training Pipeline
Stage 1: Knowledge Distillation (STEM Reasoning Backbone)
Qwen3-0.6B distilled from Qwen3-30B-A3B-Thinking-2507 — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.
Why the Thinking teacher matters at 0.6B: The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.
Data: 6,122 STEM chain-of-thought samples across 12 domains:
| Domain | Samples |
|---|---|
| Physics | 2,254 |
| Linear Algebra | 667 |
| Differential Equations | 636 |
| Electromagnetism | 580 |
| Mathematics | 576 |
| Engineering | 574 |
| Classical Mechanics | 343 |
| Theoretical Mechanics | 307 |
| Advanced Calculus | 268 |
| Modern Physics | 177 |
| Physiology | 114 |
| Molecular Biology | 71 |
All from 0xZee. Shuffled seed 42, split 95/5 train/eval.
Loss function:
- Proof-Weighted Cross-Entropy (55%) — 2.5x weight on derivation tokens, decaying to 1.5x. Forces the student to allocate its limited capacity to reasoning steps, not answer formatting.
- Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T². Transfers the Thinking teacher's full deliberation landscape.
Training format:
Solve the following problem carefully and show a rigorous derivation.
Problem:
{question}
Proof:
{CoT}
Final Answer:
{response}
Stage 1 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Training samples | 5,815 |
| Effective batch size | 8 |
| Learning rate | 1.5e-5 → 1e-6 (cosine) |
| Temperature | 2.0 |
| Proof weight | 2.5 → 1.5 |
| Precision | bf16 |
Stage 2: Supervised Fine-Tuning (Legal Domain)
The distilled model was fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL's SFTTrainer.
Why legal on top of STEM: Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.
Training format:
### Instruction:
{instruction}
### Response:
{output}
Stage 2 hyperparameters:
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Effective batch size | 8 |
| Learning rate | 5e-6 (lower than Stage 1 to preserve backbone) |
| Gradient checkpointing | Enabled |
| Precision | bf16 |
Model Details
| Attribute | Value |
|---|---|
| Architecture | Qwen3 (causal LM, RoPE, GQA) |
| Parameters | 0.6B |
| Base model | Qwen/Qwen3-0.6B |
| Teacher model | Qwen/Qwen3-30B-A3B-Thinking-2507 |
| Compression ratio | 50x |
| Stage 1 data | 6,122 STEM CoT samples (12 datasets) |
| Stage 2 data | Alignment-Lab-AI/Lawyer-Instruct |
| Context length | 1024 tokens (training) |
| License | Apache 2.0 |
| Developer | Reaperdoesntrun / Convergent Intelligence LLC: Research Division |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
)
# Legal instruction-following
prompt = """### Instruction:
What is the difference between a felony and a misdemeanor?
### Response:
"""
# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.
Problem:
Compute the determinant of the matrix [[1, 2], [3, 4]].
Proof:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF
Quantized versions at reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF.
Prompt Formats
STEM derivation (Stage 1):
Solve the following problem carefully and show a rigorous derivation.
Problem:
[Your problem]
Proof:
Instruction-following (Stage 2):
### Instruction:
[Your question]
### Response:
Intended Uses
Good for: Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.
Not for: Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.
Limitations
0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Description |
|---|---|
| Qwen3-0.6B-STEM-Proof-Distilled-Thinking | Stage 1 only — pure STEM backbone |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | This model quantized for edge deployment |
| Qwen3-1.7B-STEM-Proof-Distilled | Larger 1.7B variant (Instruct teacher) |
| Qwen3-1.7B-Distilled-30B-A3B-SFT | Larger 1.7B variant + legal SFT |
Citation
@misc{colca2026thinking06bsft,
title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
note={Convergent Intelligence LLC: Research Division}
}
Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."
Convergent Intelligence Portfolio
Part of the Qwen3 0.6B Distillation Series by Convergent Intelligence LLC: Research Division
Mathematical Foundations: Discrepancy Calculus (DISC)
This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.
Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).
Related Models
| Model | Downloads | Format |
|---|---|---|
| Qwen3-0.6B-Distilled-30B-A3B | 36 | HF |
| Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF | 203 | GGUF |
Top Models from Our Lab
| Model | Downloads |
|---|---|
| Qwen3-1.7B-Thinking-Distil | 501 |
| LFM2.5-1.2B-Distilled-SFT | 342 |
| Qwen3-1.7B-Coder-Distilled-SFT | 302 |
| Qwen3-1.7B-Coder-Distilled-SFT-GGUF | 194 |
| Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF | 175 |
Total Portfolio: 41 models | 2,781 total downloads
Last updated: 2026-03-28 12:56 UTC
DistilQwen Collection
This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads
Teacher Variant Comparison
| Teacher | Student Size | Strength | Models |
|---|---|---|---|
| Qwen3-30B-A3B (Instruct) | 1.7B | Instruction following, structured output, legal reasoning | 3 (833 DL) |
| Qwen3-30B-A3B (Thinking) | 0.6B | Extended deliberation, higher-entropy distributions, proof derivation | 3 (779 DL) ← this model |
| Qwen3-30B-A3B (Coder) | 1.7B | Structured decomposition, STEM derivation, logical inference | 2 (825 DL) |
Methodology
The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.
All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.
Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)
Related in this series
- Qwen3-0.6B-Distilled-30B-A3B (236 downloads)
- Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF (316 downloads)
- Downloads last month
- 4,500