Gemma-4-e2b-CodeX-Distill-v1-GGUF

A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional β€œthinking” mode enabled via chat templates.


Example usage:

  • For text only LLMs: llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja
  • For multimodal models: llama-mtmd-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinja

πŸ“¦ Available Model Files

  • gemma-4-e2b-it.Q8_0.gguf β€” Quantized model (Q8_0 for high quality)
  • gemma-4-e2b-it.BF16-mmproj.gguf β€” Multimodal projection (required for full functionality)

πŸš€ Features

  • Strong code generation & reasoning (CodeX-style distillation)
  • Long context support (tested up to 131k tokens)
  • Optimized for llama.cpp
  • Supports structured chat templates (Jinja-based)
  • Optional β€œthinking mode” for better reasoning traces

πŸ–₯️ Running with llama.cpp

Make sure you’re using a recent build of llama.cpp with:

  • Flash Attention enabled
  • Jinja/chat template support compiled

Start Server

llama-server \
  -m gemma-4-e2b-it.Q8_0.gguf \
  --port 53281 \
  -c 131072 \
  --parallel 1 \
  --flash-attn on \
  --no-context-shift \
  -ngl -1 \
  --jinja \
  --chat-template-kwargs "{\"enable_thinking\": true}" \
  --mmproj gemma-4-e2b-it.BF16-mmproj.gguf

Key Flags Explained

  • -c 131072 β†’ Enables long context (131k tokens)
  • --flash-attn on β†’ Faster attention (requires compatible GPU)
  • -ngl -1 β†’ Offload all layers to GPU
  • --jinja β†’ Enables chat template rendering
  • --chat-template-kwargs β†’ Activates thinking mode
  • --mmproj β†’ Required for multimodal projection

Test Request

curl http://localhost:53281/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a Python function to reverse a linked list"}
    ]
  }'

🧠 Notes on Thinking Mode

When enable_thinking=true, the model may:

  • Produce intermediate reasoning steps
  • Improve structured problem solving
  • Slightly increase latency

Disable it if you need faster responses.


πŸ¦™ Running with Ollama

Important: ⚠️ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.

Create a Modelfile:

FROM ./gemma-4-e2b-it.Q8_0.gguf

PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"

TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""

# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."

Build & Run

ollama create gemma-4-codex -f Modelfile
ollama run gemma-4-codex

βš™οΈ Recommended Settings

Use Case Context GPU Layers Notes
Coding assistant 32k–64k Full (-1) Best balance
Long reasoning 131k Full Needs high VRAM
Low VRAM setup 8k–16k Partial Disable flash-attn

⚠️ Limitations

  • Requires significant VRAM for full 131k context
  • Thinking mode increases latency
  • Multimodal projection file must match model variant

πŸ“œ License

Follow the original Gemma license and any additional terms from this distillation.


πŸ™Œ Credits

  • Base model: Google Gemma family
  • Distillation: Code-focused adaptation
  • Runtime: llama.cpp ecosystem
Downloads last month
443
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF

Quantized
(138)
this model

Datasets used to train nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF

Collection including nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF