Instructions to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF", dtype="auto")

llama-cpp-python

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF",
	filename="Qwen3.5-9B-Claude-Opus-4.6-Distill-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Use Docker

docker model run hf.co/empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Ollama:
```
ollama run hf.co/empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
```

Unsloth Studio new

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF to start chatting

Pi new

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Docker Model Runner:
```
docker model run hf.co/empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M
```

Lemonade

How to use empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.5-9B-Claude-Opus-4.6-Distill-GGUF-Q4_K_M

List all available models

lemonade list

empero-ai commited on Mar 15

Commit

cc9955b

verified ·

1 Parent(s): 963a36d

Update README.md

Browse files

Files changed (1) hide show

README.md +114 -3

README.md CHANGED Viewed

@@ -1,3 +1,114 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- Qwen/Qwen3.5-9B
+library_name: transformers
+tags:
+- qwen
+- claude
+- opus
+- reasoning
+- distill
+datasets:
+- nohurry/Opus-4.6-Reasoning-3000x-filtered
+- Jackrong/Qwen3.5-reasoning-700x
+- TeichAI/claude-4.5-opus-high-reasoning-250x
+---
+# Qwen3.5-9B Claude Opus 4.6 Reasoning Distill — GGUF
+GGUF quantizations of [empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill](https://huggingface.co/empero-ai/Qwen3.5-9B-Claude-Opus-4.6-Distill), a reasoning-focused fine-tune of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B).
+This model was trained to produce detailed chain-of-thought reasoning inside `<think>` tags before giving its final answer, distilled from Claude Opus 4.6 and Qwen3.5 reasoning traces.
+## Quantizations
+| File | Quant | Size | Description |
+|------|-------|------|-------------|
+| `qwen3.5-9b-opus4.6-distill-Q2_K.gguf` | Q2_K | ~3.5 GB | Smallest, lowest quality. For very constrained devices. |
+| `qwen3.5-9b-opus4.6-distill-Q3_K_M.gguf` | Q3_K_M | ~4.5 GB | Low quality, usable for testing. |
+| `qwen3.5-9b-opus4.6-distill-Q4_K_M.gguf` | Q4_K_M | ~5.5 GB | **Recommended.** Best balance of quality and size. |
+| `qwen3.5-9b-opus4.6-distill-Q5_K_M.gguf` | Q5_K_M | ~6.5 GB | High quality, moderate size. |
+| `qwen3.5-9b-opus4.6-distill-Q6_K.gguf` | Q6_K | ~7.5 GB | Very high quality, near-lossless. |
+| `qwen3.5-9b-opus4.6-distill-Q8_0.gguf` | Q8_0 | ~9.5 GB | Highest quality quantization. |
+| `qwen3.5-9b-opus4.6-distill-f16.gguf` | F16 | ~18 GB | Full precision, no quantization loss. |
+For most users, **Q4_K_M** or **Q5_K_M** is the sweet spot.
+## How to Use
+### llama.cpp
+```bash
+llama-cli -m qwen3.5-9b-opus4.6-distill-Q5_K_M.gguf -p "<|im_start|>system\nYou are a deep reasoning AI. Think carefully inside <think> tags before answering.<|im_end|>\n<|im_start|>user\nExplain why the sky is blue.<|im_end|>\n<|im_start|>assistant\n" -n 2048
+```
+### Ollama
+```bash
+ollama run empero-ai/qwen3.5-9b-opus4.6-distill
+```
+### LM Studio / GPT4All / Jan
+Download the GGUF file of your choice and load it directly in the application.
+## Training Details
+### Method
+- **Stage 1 — SFT (Supervised Fine-Tuning):** 3 epochs on ~13K examples teaching the model the `<think>` reasoning format using QLoRA (4-bit, rank 64, alpha 128)
+- **Base model:** Qwen/Qwen3.5-9B
+- **Hardware:** RTX 5090 (32GB VRAM)
+- **Attention:** SDPA
+- **Optimizer:** Paged AdamW 8-bit
+- **Learning rate:** 1e-4 with cosine schedule
+- **Effective batch size:** 8 (batch 1 × gradient accumulation 8)
+- **Max sequence length:** 4096
+### SFT Results
+| Metric | Epoch 1 | Epoch 2 (best) | Epoch 3 |
+|--------|---------|-----------------|---------|
+| Eval Loss | 0.5205 | **0.4809** | 0.4915 |
+| Eval Token Accuracy | 0.8494 | **0.8615** | 0.8617 |
+| Eval Entropy | 0.508 | 0.434 | 0.394 |
+Best checkpoint (epoch 2) was selected via `load_best_model_at_end`.
+### Datasets
+| Dataset | Examples | Type |
+|---------|----------|------|
+| [nohurry/Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | 2,326 | Problem → thinking → solution |
+| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 633 | ShareGPT with `<think>` tags |
+| [TeichAI/claude-4.5-opus-high-reasoning-250x](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) | 250 | Messages with `<think>` tags |
+| [Roman1111111/claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x) | 9,631 | Messages with reasoning traces |
+| **Total** | **12,840** | |
+### Output Format
+The model outputs reasoning in `<think>` tags followed by its final answer:
+```
+<think>
+The user is asking about why the sky appears blue. This involves Rayleigh scattering...
+Sunlight contains all wavelengths of visible light. When it enters Earth's atmosphere,
+shorter wavelengths (blue/violet) scatter more than longer wavelengths (red/orange)...
+While violet actually scatters more than blue, our eyes are more sensitive to blue light,
+and some violet is absorbed by the upper atmosphere...
+</think>
+The sky appears blue due to Rayleigh scattering. When sunlight passes through Earth's
+atmosphere, the shorter blue wavelengths scatter in all directions more than the longer
+red wavelengths. Although violet light scatters even more, our eyes are more sensitive
+to blue, and some violet is absorbed higher in the atmosphere — so we perceive the sky
+as blue.
+```
+## License
+This model inherits the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license from Qwen3.5-9B.