Question Answering
GGUF
qwen3_5
llama.cpp
vision-language-model
conversational
nphearum's picture
Update README.md
2ebc214 verified
|
raw
history blame
3.83 kB
---
tags:
- gguf
- llama.cpp
- vision-language-model
license: apache-2.0
datasets:
- TeichAI/Claude-Opus-4.6-Reasoning-887x
- nphearum/gsm8k-thinking
base_model:
- Qwen/Qwen3.5-4B
pipeline_tag: question-answering
---
# Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF
A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode.
---
## Overview
* Distilled from Qwen3.5 (4B class)
* Optimized for llama.cpp inference
* Strong performance on code and reasoning tasks
* Supports extended context (practical range: 32K–64K)
* Compatible with Jinja chat templates
* Optional thinking mode (may increase latency)
---
## Model Files
* `Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf`
Quantized model (balanced size vs quality)
---
## Example Usage
```bash
llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja
```
---
## Running with llama.cpp
Ensure your build includes:
* Flash Attention
* Jinja/chat template support
---
## Server Configuration (Optimized)
```bash
llama-server \
-m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \
--port 8001 \
--alias qwen3.5-4b-opus \
-c 65536 \
-n 8192 \
--no-context-shift \
--temp 0.6 \
--top-p 0.95 \
--top-k 40 \
--repeat-penalty 1.05 \
--presence-penalty 0.0 \
--flash-attn on \
--fa on \
--ctk q8_0 \
--ctv q8_0 \
--jinja \
--chat-template-kwargs "{\"enable_thinking\": true}" \
-ngl -1
```
---
## Parameter Breakdown
### Model Loading
* `-m`
Loads the GGUF model file
* `--alias`
Sets a simple API name
---
### Context and Output
* `-c 65536`
Context window (recommended for 4B models)
* `-n 8192`
Maximum output tokens
* `--no-context-shift`
Prevents automatic truncation of earlier tokens
---
### Sampling Behavior
* `--temp 0.6`
Controls randomness
* `--top-p 0.95`
Nucleus sampling
* `--top-k 40`
Limits token candidates
* `--repeat-penalty 1.05`
Reduces repetition (important for code)
* `--presence-penalty 0.0`
No penalty for introducing new tokens
---
### Performance and Memory
* `-ngl -1`
Full GPU offload
* `--flash-attn on`, `--fa on`
Enables faster attention
* `--ctk q8_0`, `--ctv q8_0`
8-bit KV cache (reduces memory usage)
---
### Chat and Reasoning
* `--jinja`
Enables chat template rendering
* `--chat-template-kwargs`
Enables thinking mode
Note: Thinking mode may:
* Improve reasoning quality
* Increase latency
* Produce unstable outputs if not aligned with training
---
## Test Request
```bash
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a Python function to reverse a linked list"}
]
}'
```
---
## Recommended Configurations
### Coding (deterministic)
```bash
--temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1
```
### Reasoning (balanced)
```bash
--temp 0.6 --top-p 0.95 --top-k 40
```
### Low VRAM
```bash
-c 32768 -n 4096 --flash-attn off -ngl 20
```
---
## Limitations
* Quality degrades at extreme context lengths (>64K)
* Thinking mode increases latency
* Small models (4B) require tighter sampling tuning
* Performance depends heavily on GPU memory bandwidth
---
## License
Follow the original Qwen license and any additional distillation terms.
---
## Credits
* Base model: Qwen3.5
* Distillation: Code and reasoning optimization
* Runtime: llama.cpp ecosystem
---
## Key Takeaway
This configuration is **not a direct copy of larger models (e.g., 30B+)**.
It is tuned specifically for a 4B model to balance:
* latency
* memory usage
* reasoning quality