Instructions to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF", filename="Qwen3.5-4B.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = "{\n \"question\": \"What is my name?\",\n \"context\": \"My name is Clara and I live in Berkeley.\"\n}" ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Use Docker
docker model run hf.co/nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Ollama:
ollama run hf.co/nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
- Unsloth Studio
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF to start chatting
- Pi
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Docker Model Runner:
docker model run hf.co/nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
- Lemonade
How to use nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF-Q4_K_M
List all available models
lemonade list
| tags: | |
| - gguf | |
| - llama.cpp | |
| - vision-language-model | |
| license: apache-2.0 | |
| datasets: | |
| - TeichAI/Claude-Opus-4.6-Reasoning-887x | |
| - nphearum/gsm8k-thinking | |
| base_model: | |
| - Qwen/Qwen3.5-4B | |
| pipeline_tag: question-answering | |
| # Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF | |
| A distilled, code-focused variant of Qwen3.5 4B, optimized for efficient local inference in GGUF format. This model targets coding, structured reasoning, and programmatic generation tasks, with optional reasoning traces via thinking mode. | |
| --- | |
| ## Overview | |
| * Distilled from Qwen3.5 (4B class) | |
| * Optimized for llama.cpp inference | |
| * Strong performance on code and reasoning tasks | |
| * Supports extended context (practical range: 32K–64K) | |
| * Compatible with Jinja chat templates | |
| * Optional thinking mode (may increase latency) | |
| --- | |
| ## Model Files | |
| * `Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf` | |
| Quantized model (balanced size vs quality) | |
| --- | |
| ## Example Usage | |
| ```bash | |
| llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled-GGUF --jinja | |
| ``` | |
| --- | |
| ## Running with llama.cpp | |
| Ensure your build includes: | |
| * Flash Attention | |
| * Jinja/chat template support | |
| --- | |
| ## Server Configuration (Optimized) | |
| ```bash | |
| llama-server \ | |
| -m Qwen3.5-4BxOpus-4.6-Code-Reasoning-Full-Distilled.Q4_K_M.gguf \ | |
| --port 8001 \ | |
| --alias qwen3.5-4b-opus \ | |
| -c 65536 \ | |
| -n 8192 \ | |
| --no-context-shift \ | |
| --temp 0.6 \ | |
| --top-p 0.95 \ | |
| --top-k 40 \ | |
| --repeat-penalty 1.05 \ | |
| --presence-penalty 0.0 \ | |
| --flash-attn on \ | |
| --fa on \ | |
| --ctk q8_0 \ | |
| --ctv q8_0 \ | |
| --jinja \ | |
| --chat-template-kwargs "{\"enable_thinking\": true}" \ | |
| -ngl -1 | |
| ``` | |
| --- | |
| ## Parameter Breakdown | |
| ### Model Loading | |
| * `-m` | |
| Loads the GGUF model file | |
| * `--alias` | |
| Sets a simple API name | |
| --- | |
| ### Context and Output | |
| * `-c 65536` | |
| Context window (recommended for 4B models) | |
| * `-n 8192` | |
| Maximum output tokens | |
| * `--no-context-shift` | |
| Prevents automatic truncation of earlier tokens | |
| --- | |
| ### Sampling Behavior | |
| * `--temp 0.6` | |
| Controls randomness | |
| * `--top-p 0.95` | |
| Nucleus sampling | |
| * `--top-k 40` | |
| Limits token candidates | |
| * `--repeat-penalty 1.05` | |
| Reduces repetition (important for code) | |
| * `--presence-penalty 0.0` | |
| No penalty for introducing new tokens | |
| --- | |
| ### Performance and Memory | |
| * `-ngl -1` | |
| Full GPU offload | |
| * `--flash-attn on`, `--fa on` | |
| Enables faster attention | |
| * `--ctk q8_0`, `--ctv q8_0` | |
| 8-bit KV cache (reduces memory usage) | |
| --- | |
| ### Chat and Reasoning | |
| * `--jinja` | |
| Enables chat template rendering | |
| * `--chat-template-kwargs` | |
| Enables thinking mode | |
| Note: Thinking mode may: | |
| * Improve reasoning quality | |
| * Increase latency | |
| * Produce unstable outputs if not aligned with training | |
| --- | |
| ## Test Request | |
| ```bash | |
| curl http://localhost:8001/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "messages": [ | |
| {"role": "user", "content": "Write a Python function to reverse a linked list"} | |
| ] | |
| }' | |
| ``` | |
| --- | |
| ## Recommended Configurations | |
| ### Coding (deterministic) | |
| ```bash | |
| --temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1 | |
| ``` | |
| ### Reasoning (balanced) | |
| ```bash | |
| --temp 0.6 --top-p 0.95 --top-k 40 | |
| ``` | |
| ### Low VRAM | |
| ```bash | |
| -c 32768 -n 4096 --flash-attn off -ngl 20 | |
| ``` | |
| --- | |
| ## Limitations | |
| * Quality degrades at extreme context lengths (>64K) | |
| * Thinking mode increases latency | |
| * Small models (4B) require tighter sampling tuning | |
| * Performance depends heavily on GPU memory bandwidth | |
| --- | |
| ## License | |
| Follow the original Qwen license and any additional distillation terms. | |
| --- | |
| ## Credits | |
| * Base model: Qwen3.5 | |
| * Distillation: Code and reasoning optimization | |
| * Runtime: llama.cpp ecosystem | |
| --- | |
| ## Key Takeaway | |
| This configuration is **not a direct copy of larger models (e.g., 30B+)**. | |
| It is tuned specifically for a 4B model to balance: | |
| * latency | |
| * memory usage | |
| * reasoning quality |