Question Answering
GGUF
qwen3_5
llama.cpp
vision-language-model
conversational
A newer version of this model is available: nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF

A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode.


Overview

  • Distilled from Qwen3.5 (4B class)
  • Optimized for llama.cpp inference
  • Strong performance on code and reasoning tasks
  • Supports extended context (practical range: 32K–64K)
  • Compatible with Jinja chat templates
  • Optional thinking mode (may increase latency)

Model Files

  • Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf Quantized model (balanced size vs quality)

Example Usage

llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja

Running with llama.cpp

Ensure your build includes:

  • Flash Attention
  • Jinja/chat template support

Server Configuration (Optimized)

llama-server \
  -m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \
  --port 8001 \
  --alias qwen3.5-4b-opus \
  -c 65536 \
  -n 8192 \
  --no-context-shift \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --presence-penalty 0.0 \
  --flash-attn on \
  --fa on \
  --ctk q8_0 \
  --ctv q8_0 \
  --jinja \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  -ngl -1

Parameter Breakdown

Model Loading

  • -m Loads the GGUF model file

  • --alias Sets a simple API name


Context and Output

  • -c 65536 Context window (recommended for 4B models)

  • -n 8192 Maximum output tokens

  • --no-context-shift Prevents automatic truncation of earlier tokens


Sampling Behavior

  • --temp 0.6 Controls randomness

  • --top-p 0.95 Nucleus sampling

  • --top-k 40 Limits token candidates

  • --repeat-penalty 1.05 Reduces repetition (important for code)

  • --presence-penalty 0.0 No penalty for introducing new tokens


Performance and Memory

  • -ngl -1 Full GPU offload

  • --flash-attn on, --fa on Enables faster attention

  • --ctk q8_0, --ctv q8_0 8-bit KV cache (reduces memory usage)


Chat and Reasoning

  • --jinja Enables chat template rendering

  • --chat-template-kwargs Enables thinking mode

Note: Thinking mode may:

  • Improve reasoning quality
  • Increase latency
  • Produce unstable outputs if not aligned with training

Test Request

curl http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ]
  }'

Recommended Configurations

Coding (deterministic)

--temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1

Reasoning (balanced)

--temp 0.6 --top-p 0.95 --top-k 40

Low VRAM

-c 32768 -n 4096 --flash-attn off -ngl 20

Limitations

  • Quality degrades at extreme context lengths (>64K)
  • Thinking mode increases latency
  • Small models (4B) require tighter sampling tuning
  • Performance depends heavily on GPU memory bandwidth

License

Follow the original Qwen license and any additional distillation terms.


Credits

  • Base model: Qwen3.5
  • Distillation: Code and reasoning optimization
  • Runtime: llama.cpp ecosystem

Key Takeaway

This configuration is not a direct copy of larger models (e.g., 30B+). It is tuned specifically for a 4B model to balance:

  • latency
  • memory usage
  • reasoning quality
Downloads last month
545
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(242)
this model

Datasets used to train nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF

Collection including nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF