Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

GGUF quantizations of Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled by TheCyberVine

Model Overview

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a highly capable reasoning model fine-tuned on top of the powerful Qwen3.5 architecture. The model's core directive is to leverage state-of-the-art Chain-of-Thought (CoT) distillation primarily sourced from Claude-4.6 Opus interactions.

Key Features

Claude-4.6 Opus Reasoning Distillation: Deep distillation and structural imitation of Claude-4.6-Opus reasoning chains
Structured Thinking: Uses `` tags for internal reasoning with "Let me analyze this request carefully: 1..2..3..." pattern
Native Developer Role Support: Fully supports the "developer" role without Jinja template patches
Full Thinking Mode: Preserves complete chain-of-thought reasoning process (thinking=1)
262K Context Window: Full context with no compromises
Coding Agent Optimized: Tested and optimized for Claude Code and OpenCode environments

Training Datasets

nohurry/Opus-4.6-Reasoning-3000x-filtered - Comprehensive Claude 4.6 Opus reasoning trajectories
TeichAI/claude-4.5-opus-high-reasoning-250x - High-intensity, structured reasoning instances
Jackrong/Qwen3.5-reasoning-700x - Additional curated reasoning samples

Quantizations Available

Quantization	File Size	BPW	Imatrix	Recommended Use
IQ2_S	9.36 GB	~2.7 BPW	✅ Custom	Minimal VRAM, basic tasks
IQ3_M	12.6 GB	~3.3 BPW	✅ Custom	Balanced performance
TQ3_1S	13.9 GB	4.12 BPW	❌ No	Best 3-bit option
IQ4_XS	15.1 GB	~4.2 BPW	✅ Custom	Most users
Q8_0	28.6 GB	~8.0 BPW	❌ No	High quality

Quantization Details

Custom Imatrix Calibration

The imatrix quantizations (IQ2_S, IQ3_M, IQ4_XS) use a custom importance matrix derived from OpenCode sessions with the following composition:

50% Reasoning - Complex problem-solving and analytical tasks
30% Tools - Command execution, file operations, and tool usage
20% TypeScript - TypeScript code generation and analysis

This calibration ensures optimal performance for coding agent workloads, maintaining reasoning quality while minimizing file size.

TQ3_1S - Turbo Quantization

TQ3_1S is a ternary quantization (TQ) optimized for speed:

Type: Ternary Quantization - uses three values {-1, 0, 1}
Architecture: WHT-rotated 3-bit with dual half-precision scaling
Performance: Extremely fast on AVX2 CPUs (up to 2x faster than standard Q4_K)
Best for: Users wanting a 3-bit quantization with maximum speed

Ternary quantization uses optimized 8-level Lloyd-Max centroids and Walsh-Hadamard Transform rotation for efficient weight distribution. Note: Actual BPW varies by layer (typically ~3.5-4.0) - verify for your specific model.

Usage with llama.cpp

Basic Usage

# Download a quantization
huggingface-cli download superbudvar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf

# Run with llama-cli
llama-cli -m Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 4096 -t 8

Full 262K Context

llama-cli -m Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 262144 -t 8

Using with HuggingFace Hub

llama-cli --hf-repo superbudvar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --hf-file Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 262144

Prompt Template

This model uses the Qwen3 chat template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Write a hello world program in Python.<|im_end|>
<|im_start|>assistant

Hardware Requirements

VRAM Requirements (Approximate)

Quantization	VRAM/RAM Required	Recommended GPU
IQ2_S	~10 GB	RTX 3060 / RX 6600 XT
IQ3_M	~13 GB	RTX 3060 12GB / RX 6700 XT
TQ3_1S	~14 GB	RTX 3070 / RX 6750 XT
IQ4_XS	~16 GB	RTX 3070 / RX 6800 XT
Q8_0	~29 GB	RTX 3090 / RX 7900 XT

CPU Inference

All quantizations work efficiently on modern CPUs with AVX2 or AVX-512 support. TQ3_1S is particularly optimized for AVX2 CPUs.

Example Reasoning Output

The model demonstrates structured thinking with clear step-by-step reasoning:

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.

This streamlined reasoning paradigm reduces redundant cognitive loops while preserving deep analytical capacity.

Credits

Base Model: Qwen/Qwen3.5-27B
Fine-tuned Model: TheCyberVine/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled by TheCyberVine
Training Framework: Unsloth 2026.3.3
Datasets:
Quantization: llama.cpp by ggerganov
GGUF Format: ggml-org/ggml

License

This quantization inherits the Apache 2.0 license from the base model.

Citation

If you use this model in your research or projects, please cite the original model:

@misc{jackrong_qwen35_opus_distilled,
  title        = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TheCyberVine/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled}}
}

Downloads last month: 5,351

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

2-bit

3-bit

4-bit

8-bit

View +1 variant

Model tree for SuperBudVar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Base model

Qwen/Qwen3.5-27B

Quantized

(199)

this model

SuperBudVar
/

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF