Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

GGUF quantizations of Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled by TheCyberVine

Model Overview

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a highly capable reasoning model fine-tuned on top of the powerful Qwen3.5 architecture. The model's core directive is to leverage state-of-the-art Chain-of-Thought (CoT) distillation primarily sourced from Claude-4.6 Opus interactions.

Key Features

  • Claude-4.6 Opus Reasoning Distillation: Deep distillation and structural imitation of Claude-4.6-Opus reasoning chains
  • Structured Thinking: Uses `` tags for internal reasoning with "Let me analyze this request carefully: 1..2..3..." pattern
  • Native Developer Role Support: Fully supports the "developer" role without Jinja template patches
  • Full Thinking Mode: Preserves complete chain-of-thought reasoning process (thinking=1)
  • 262K Context Window: Full context with no compromises
  • Coding Agent Optimized: Tested and optimized for Claude Code and OpenCode environments

Training Datasets

Quantizations Available

Quantization File Size BPW Imatrix Recommended Use
IQ2_S 9.36 GB ~2.7 BPW ✅ Custom Minimal VRAM, basic tasks
IQ3_M 12.6 GB ~3.3 BPW ✅ Custom Balanced performance
TQ3_1S 13.9 GB 4.12 BPW ❌ No Best 3-bit option
IQ4_XS 15.1 GB ~4.2 BPW ✅ Custom Most users
Q8_0 28.6 GB ~8.0 BPW ❌ No High quality

Quantization Details

Custom Imatrix Calibration

The imatrix quantizations (IQ2_S, IQ3_M, IQ4_XS) use a custom importance matrix derived from OpenCode sessions with the following composition:

  • 50% Reasoning - Complex problem-solving and analytical tasks
  • 30% Tools - Command execution, file operations, and tool usage
  • 20% TypeScript - TypeScript code generation and analysis

This calibration ensures optimal performance for coding agent workloads, maintaining reasoning quality while minimizing file size.

TQ3_1S - Turbo Quantization

TQ3_1S is a ternary quantization (TQ) optimized for speed:

  • Type: Ternary Quantization - uses three values {-1, 0, 1}
  • Architecture: WHT-rotated 3-bit with dual half-precision scaling
  • Performance: Extremely fast on AVX2 CPUs (up to 2x faster than standard Q4_K)
  • Best for: Users wanting a 3-bit quantization with maximum speed

Ternary quantization uses optimized 8-level Lloyd-Max centroids and Walsh-Hadamard Transform rotation for efficient weight distribution. Note: Actual BPW varies by layer (typically ~3.5-4.0) - verify for your specific model.

Usage with llama.cpp

Basic Usage

# Download a quantization
huggingface-cli download superbudvar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf

# Run with llama-cli
llama-cli -m Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 4096 -t 8

Full 262K Context

llama-cli -m Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 262144 -t 8

Using with HuggingFace Hub

llama-cli --hf-repo superbudvar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --hf-file Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-IQ4_XS.gguf -c 262144

Prompt Template

This model uses the Qwen3 chat template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Write a hello world program in Python.<|im_end|>
<|im_start|>assistant

Hardware Requirements

VRAM Requirements (Approximate)

Quantization VRAM/RAM Required Recommended GPU
IQ2_S ~10 GB RTX 3060 / RX 6600 XT
IQ3_M ~13 GB RTX 3060 12GB / RX 6700 XT
TQ3_1S ~14 GB RTX 3070 / RX 6750 XT
IQ4_XS ~16 GB RTX 3070 / RX 6800 XT
Q8_0 ~29 GB RTX 3090 / RX 7900 XT

CPU Inference

All quantizations work efficiently on modern CPUs with AVX2 or AVX-512 support. TQ3_1S is particularly optimized for AVX2 CPUs.

Example Reasoning Output

The model demonstrates structured thinking with clear step-by-step reasoning:

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step-by-step solution plan.
5. Execute the reasoning sequentially and verify consistency.

This streamlined reasoning paradigm reduces redundant cognitive loops while preserving deep analytical capacity.

Credits

License

This quantization inherits the Apache 2.0 license from the base model.

Citation

If you use this model in your research or projects, please cite the original model:

@misc{jackrong_qwen35_opus_distilled,
  title        = {Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled},
  author       = {Jackrong},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/TheCyberVine/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled}}
}
Downloads last month
5,351
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SuperBudVar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Base model

Qwen/Qwen3.5-27B
Quantized
(199)
this model

Datasets used to train SuperBudVar/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF