---
tags:
- gguf
- llama.cpp
- vision-language-model
license: apache-2.0
datasets:
- TeichAI/Claude-Opus-4.6-Reasoning-887x
- nphearum/gsm8k-thinking
base_model:
- Qwen/Qwen3.5-4B
pipeline_tag: question-answering
---
# Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF

A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode.

---

## Overview

* Distilled from Qwen3.5 (4B class)
* Optimized for llama.cpp inference
* Strong performance on code and reasoning tasks
* Supports extended context (practical range: 32K–64K)
* Compatible with Jinja chat templates
* Optional thinking mode (may increase latency)

---

## Model Files

* `Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf`
  Quantized model (balanced size vs quality)

---

## Example Usage

```bash
llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja
```

---

## Running with llama.cpp

Ensure your build includes:

* Flash Attention
* Jinja/chat template support

---

## Server Configuration (Optimized)

```bash
llama-server \
  -m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \
  --port 8001 \
  --alias qwen3.5-4b-opus \
  -c 65536 \
  -n 8192 \
  --no-context-shift \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --presence-penalty 0.0 \
  --flash-attn on \
  --fa on \
  --ctk q8_0 \
  --ctv q8_0 \
  --jinja \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  -ngl -1
```

---

## Parameter Breakdown

### Model Loading

* `-m`
  Loads the GGUF model file

* `--alias`
  Sets a simple API name

---

### Context and Output

* `-c 65536`
  Context window (recommended for 4B models)

* `-n 8192`
  Maximum output tokens

* `--no-context-shift`
  Prevents automatic truncation of earlier tokens

---

### Sampling Behavior

* `--temp 0.6`
  Controls randomness

* `--top-p 0.95`
  Nucleus sampling

* `--top-k 40`
  Limits token candidates

* `--repeat-penalty 1.05`
  Reduces repetition (important for code)

* `--presence-penalty 0.0`
  No penalty for introducing new tokens

---

### Performance and Memory

* `-ngl -1`
  Full GPU offload

* `--flash-attn on`, `--fa on`
  Enables faster attention

* `--ctk q8_0`, `--ctv q8_0`
  8-bit KV cache (reduces memory usage)

---

### Chat and Reasoning

* `--jinja`
  Enables chat template rendering

* `--chat-template-kwargs`
  Enables thinking mode

Note: Thinking mode may:

* Improve reasoning quality
* Increase latency
* Produce unstable outputs if not aligned with training

---

## Test Request

```bash
curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ]
  }'
```

---

## Recommended Configurations

### Coding (deterministic)

```bash
--temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1
```

### Reasoning (balanced)

```bash
--temp 0.6 --top-p 0.95 --top-k 40
```

### Low VRAM

```bash
-c 32768 -n 4096 --flash-attn off -ngl 20
```

---

## Limitations

* Quality degrades at extreme context lengths (>64K)
* Thinking mode increases latency
* Small models (4B) require tighter sampling tuning
* Performance depends heavily on GPU memory bandwidth

---

## License

Follow the original Qwen license and any additional distillation terms.

---

## Credits

* Base model: Qwen3.5
* Distillation: Code and reasoning optimization
* Runtime: llama.cpp ecosystem

---

## Key Takeaway

This configuration is **not a direct copy of larger models (e.g., 30B+)**.
It is tuned specifically for a 4B model to balance:

* latency
* memory usage
* reasoning quality