Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT

A 0.6B parameter model built in two stages: knowledge distillation from a 30B Thinking teacher to establish a structured reasoning backbone, then supervised fine-tuning on legal instruction data. 50x compression. Under 500MB quantized. Runs on a phone.

The training order is the thesis: teach the model how to reason first (distillation from Thinking teacher), then teach it what to reason about (legal SFT). The Thinking teacher's extended deliberation traces transfer deeper reasoning structure than an Instruct teacher — critical when the student has only 0.6B parameters to work with.

"Structure beats scale, collaboration beats hierarchy, observation beats theory." — Convergent Intelligence LLC: Research Division

Training Pipeline

Stage 1: Knowledge Distillation (STEM Reasoning Backbone)

Qwen3-0.6B distilled from Qwen3-30B-A3B-Thinking-2507 — a Mixture-of-Experts model with 30B total parameters, ~3B active per token, using the Thinking variant that generates extended internal reasoning traces.

Why the Thinking teacher matters at 0.6B: The Thinking variant produces higher-entropy softmax distributions than the Instruct variant — it considers more reasoning paths before committing. At distillation temperature T=2.0, the 0.6B student sees a richer landscape of alternative derivation strategies. With only 0.6B parameters, every bit of transferred structure counts. The Thinking teacher gives more.

Data: 6,122 STEM chain-of-thought samples across 12 domains:

Domain	Samples
Physics	2,254
Linear Algebra	667
Differential Equations	636
Electromagnetism	580
Mathematics	576
Engineering	574
Classical Mechanics	343
Theoretical Mechanics	307
Advanced Calculus	268
Modern Physics	177
Physiology	114
Molecular Biology	71

All from 0xZee. Shuffled seed 42, split 95/5 train/eval.

Loss function:

Proof-Weighted Cross-Entropy (55%) — 2.5x weight on derivation tokens, decaying to 1.5x. Forces the student to allocate its limited capacity to reasoning steps, not answer formatting.
Knowledge Distillation KL Divergence (45%) — T=2.0, scaled by T². Transfers the Thinking teacher's full deliberation landscape.

Training format:

Solve the following problem carefully and show a rigorous derivation.

Problem:
{question}

Proof:
{CoT}

Final Answer:
{response}

Stage 1 hyperparameters:

Parameter	Value
Epochs	1
Training samples	5,815
Effective batch size	8
Learning rate	1.5e-5 → 1e-6 (cosine)
Temperature	2.0
Proof weight	2.5 → 1.5
Precision	bf16

Stage 2: Supervised Fine-Tuning (Legal Domain)

The distilled model was fine-tuned on Alignment-Lab-AI/Lawyer-Instruct using TRL's SFTTrainer.

Why legal on top of STEM: Legal reasoning is structurally isomorphic to mathematical reasoning — premise identification, logical chaining, exception handling, structured argumentation toward a conclusion. A model that learned rigorous derivation transfers that structure to legal analysis rather than learning legal templates from scratch.

Training format:

### Instruction:
{instruction}

### Response:
{output}

Stage 2 hyperparameters:

Parameter	Value
Epochs	1
Effective batch size	8
Learning rate	5e-6 (lower than Stage 1 to preserve backbone)
Gradient checkpointing	Enabled
Precision	bf16

Model Details

Attribute	Value
Architecture	Qwen3 (causal LM, RoPE, GQA)
Parameters	0.6B
Base model	Qwen/Qwen3-0.6B
Teacher model	Qwen/Qwen3-30B-A3B-Thinking-2507
Compression ratio	50x
Stage 1 data	6,122 STEM CoT samples (12 datasets)
Stage 2 data	Alignment-Lab-AI/Lawyer-Instruct
Context length	1024 tokens (training)
License	Apache 2.0
Developer	Reaperdoesntrun / Convergent Intelligence LLC: Research Division

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
)

# Legal instruction-following
prompt = """### Instruction:
What is the difference between a felony and a misdemeanor?

### Response:
"""

# STEM derivation (Stage 1 format still works)
prompt_stem = """Solve the following problem carefully and show a rigorous derivation.

Problem:
Compute the determinant of the matrix [[1, 2], [3, 4]].

Proof:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF

Quantized versions at reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF.

Prompt Formats

STEM derivation (Stage 1):

Solve the following problem carefully and show a rigorous derivation.

Problem:
[Your problem]

Proof:

Instruction-following (Stage 2):

### Instruction:
[Your question]

### Response:

Intended Uses

Good for: Ultra-lightweight reasoning on mobile/edge/IoT, legal and STEM instruction-following, educational tutoring, embedded inference, component in multi-model pipelines, anywhere you need reasoning in under 500MB.

Not for: Formal proof verification, actual legal counsel, safety-critical analysis, complex multi-step proofs (>8 steps), or long-context tasks beyond 1024 tokens.

Limitations

0.6B is a hard capacity constraint. The model trades depth for deployability. It will make reasoning errors that a larger model would not. Multi-step derivations beyond ~8 steps degrade. Legal reasoning covers general concepts but lacks the nuance of larger models. Performance is weakest on underrepresented domains (molecular biology, physiology). Always verify outputs.

Mathematical Foundations: Discrepancy Calculus (DISC)

This model is part of a distillation chain built on Discrepancy Calculus — a measure-theoretic framework where the teacher's output distribution is decomposed via the Mesh Fundamental Identity into smooth (AC), jump, and Cantor components. The discrepancy operator $Df(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|} dt$ quantifies local structural mismatch that standard KL divergence averages away.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Model	Description
Qwen3-0.6B-STEM-Proof-Distilled-Thinking	Stage 1 only — pure STEM backbone
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	This model quantized for edge deployment
Qwen3-1.7B-STEM-Proof-Distilled	Larger 1.7B variant (Instruct teacher)
Qwen3-1.7B-Distilled-30B-A3B-SFT	Larger 1.7B variant + legal SFT

Citation

@misc{colca2026thinking06bsft,
  title={Two-Stage Reasoning Transfer at 0.6B: Thinking Teacher Distillation + Legal SFT},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT},
  note={Convergent Intelligence LLC: Research Division}
}

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."

Convergent Intelligence Portfolio

Part of the Qwen3 0.6B Distillation Series by Convergent Intelligence LLC: Research Division

Mathematical Foundations: Discrepancy Calculus (DISC)

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division). Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165).

Related Models

Model	Downloads	Format
Qwen3-0.6B-Distilled-30B-A3B	36	HF
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	203	GGUF

Top Models from Our Lab

Model	Downloads
Qwen3-1.7B-Thinking-Distil	501
LFM2.5-1.2B-Distilled-SFT	342
Qwen3-1.7B-Coder-Distilled-SFT	302
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	194
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF	175

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:56 UTC

DistilQwen Collection

This model is part of the DistilQwen proof-weighted distillation series. Collection: 9 models | 2,788 downloads

Teacher Variant Comparison

Teacher	Student Size	Strength	Models
Qwen3-30B-A3B (Instruct)	1.7B	Instruction following, structured output, legal reasoning	3 (833 DL)
Qwen3-30B-A3B (Thinking)	0.6B	Extended deliberation, higher-entropy distributions, proof derivation	3 (779 DL) ← this model
Qwen3-30B-A3B (Coder)	1.7B	Structured decomposition, STEM derivation, logical inference	2 (825 DL)

Methodology

The only BF16 collection in the portfolio. While the broader Convergent Intelligence catalog (43 models, 12,000+ downloads) was trained on CPU at FP32 for $24 total compute, the DistilQwen series was trained on H100 at BF16 with a 30B-parameter teacher. Same methodology, premium hardware. This is what happens when you give the pipeline real compute.

All models use proof-weighted knowledge distillation: 55% cross-entropy with decaying proof weights (2.5× → 1.5×), 45% KL divergence at T=2.0. The proof weight amplifies loss on reasoning-critical tokens, forcing the student to allocate capacity to structural understanding rather than surface-level pattern matching.

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)