--- tags: - gguf - llama.cpp - vision-language-model license: apache-2.0 datasets: - TeichAI/Claude-Opus-4.6-Reasoning-887x - nphearum/gsm8k-thinking base_model: - Qwen/Qwen3.5-4B pipeline_tag: question-answering --- # Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode. --- ## Overview * Distilled from Qwen3.5 (4B class) * Optimized for llama.cpp inference * Strong performance on code and reasoning tasks * Supports extended context (practical range: 32K–64K) * Compatible with Jinja chat templates * Optional thinking mode (may increase latency) --- ## Model Files * `Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf` Quantized model (balanced size vs quality) --- ## Example Usage ```bash llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja ``` --- ## Running with llama.cpp Ensure your build includes: * Flash Attention * Jinja/chat template support --- ## Server Configuration (Optimized) ```bash llama-server \ -m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \ --port 8001 \ --alias qwen3.5-4b-opus \ -c 65536 \ -n 8192 \ --no-context-shift \ --temp 0.6 \ --top-p 0.95 \ --top-k 40 \ --repeat-penalty 1.05 \ --presence-penalty 0.0 \ --flash-attn on \ --fa on \ --ctk q8_0 \ --ctv q8_0 \ --jinja \ --chat-template-kwargs "{\"enable_thinking\": true}" \ -ngl -1 ``` --- ## Parameter Breakdown ### Model Loading * `-m` Loads the GGUF model file * `--alias` Sets a simple API name --- ### Context and Output * `-c 65536` Context window (recommended for 4B models) * `-n 8192` Maximum output tokens * `--no-context-shift` Prevents automatic truncation of earlier tokens --- ### Sampling Behavior * `--temp 0.6` Controls randomness * `--top-p 0.95` Nucleus sampling * `--top-k 40` Limits token candidates * `--repeat-penalty 1.05` Reduces repetition (important for code) * `--presence-penalty 0.0` No penalty for introducing new tokens --- ### Performance and Memory * `-ngl -1` Full GPU offload * `--flash-attn on`, `--fa on` Enables faster attention * `--ctk q8_0`, `--ctv q8_0` 8-bit KV cache (reduces memory usage) --- ### Chat and Reasoning * `--jinja` Enables chat template rendering * `--chat-template-kwargs` Enables thinking mode Note: Thinking mode may: * Improve reasoning quality * Increase latency * Produce unstable outputs if not aligned with training --- ## Test Request ```bash curl http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "Write a Python function to reverse a linked list"} ] }' ``` --- ## Recommended Configurations ### Coding (deterministic) ```bash --temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1 ``` ### Reasoning (balanced) ```bash --temp 0.6 --top-p 0.95 --top-k 40 ``` ### Low VRAM ```bash -c 32768 -n 4096 --flash-attn off -ngl 20 ``` --- ## Limitations * Quality degrades at extreme context lengths (>64K) * Thinking mode increases latency * Small models (4B) require tighter sampling tuning * Performance depends heavily on GPU memory bandwidth --- ## License Follow the original Qwen license and any additional distillation terms. --- ## Credits * Base model: Qwen3.5 * Distillation: Code and reasoning optimization * Runtime: llama.cpp ecosystem --- ## Key Takeaway This configuration is **not a direct copy of larger models (e.g., 30B+)**. It is tuned specifically for a 4B model to balance: * latency * memory usage * reasoning quality