Question Answering
GGUF
qwen3_5
llama.cpp
vision-language-model
conversational
nphearum's picture
Update README.md
9dacfdc verified
metadata
tags:
  - gguf
  - llama.cpp
  - vision-language-model
license: apache-2.0
datasets:
  - TeichAI/Claude-Opus-4.6-Reasoning-887x
  - nphearum/gsm8k-thinking
base_model:
  - Qwen/Qwen3.5-4B
pipeline_tag: question-answering
new_version: nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF

A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode.


Overview

  • Distilled from Qwen3.5 (4B class)
  • Optimized for llama.cpp inference
  • Strong performance on code and reasoning tasks
  • Supports extended context (practical range: 32K–64K)
  • Compatible with Jinja chat templates
  • Optional thinking mode (may increase latency)

Model Files

  • Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf Quantized model (balanced size vs quality)

Example Usage

llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja

Running with llama.cpp

Ensure your build includes:

  • Flash Attention
  • Jinja/chat template support

Server Configuration (Optimized)

llama-server \
  -m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \
  --port 8001 \
  --alias qwen3.5-4b-opus \
  -c 65536 \
  -n 8192 \
  --no-context-shift \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --presence-penalty 0.0 \
  --flash-attn on \
  --fa on \
  --ctk q8_0 \
  --ctv q8_0 \
  --jinja \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  -ngl -1

Parameter Breakdown

Model Loading

  • -m Loads the GGUF model file

  • --alias Sets a simple API name


Context and Output

  • -c 65536 Context window (recommended for 4B models)

  • -n 8192 Maximum output tokens

  • --no-context-shift Prevents automatic truncation of earlier tokens


Sampling Behavior

  • --temp 0.6 Controls randomness

  • --top-p 0.95 Nucleus sampling

  • --top-k 40 Limits token candidates

  • --repeat-penalty 1.05 Reduces repetition (important for code)

  • --presence-penalty 0.0 No penalty for introducing new tokens


Performance and Memory

  • -ngl -1 Full GPU offload

  • --flash-attn on, --fa on Enables faster attention

  • --ctk q8_0, --ctv q8_0 8-bit KV cache (reduces memory usage)


Chat and Reasoning

  • --jinja Enables chat template rendering

  • --chat-template-kwargs Enables thinking mode

Note: Thinking mode may:

  • Improve reasoning quality
  • Increase latency
  • Produce unstable outputs if not aligned with training

Test Request

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ]
  }'

Recommended Configurations

Coding (deterministic)

--temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1

Reasoning (balanced)

--temp 0.6 --top-p 0.95 --top-k 40

Low VRAM

-c 32768 -n 4096 --flash-attn off -ngl 20

Limitations

  • Quality degrades at extreme context lengths (>64K)
  • Thinking mode increases latency
  • Small models (4B) require tighter sampling tuning
  • Performance depends heavily on GPU memory bandwidth

License

Follow the original Qwen license and any additional distillation terms.


Credits

  • Base model: Qwen3.5
  • Distillation: Code and reasoning optimization
  • Runtime: llama.cpp ecosystem

Key Takeaway

This configuration is not a direct copy of larger models (e.g., 30B+). It is tuned specifically for a 4B model to balance:

  • latency
  • memory usage
  • reasoning quality