Gemma-4
Collection
3 items β’ Updated
A distilled code-focused variant of Gemma-4 e2b, optimized for efficient local inference using GGUF format. This model is designed for coding assistance, reasoning, and structured generation tasks, with optional βthinkingβ mode enabled via chat templates.
Example usage:
llama-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinjallama-mtmd-cli -hf nphearum/Gemma-4-e2b-CodeX-Distill-v1-GGUF --jinjagemma-4-e2b-it.Q8_0.gguf β Quantized model (Q8_0 for high quality)gemma-4-e2b-it.BF16-mmproj.gguf β Multimodal projection (required for full functionality)Make sure youβre using a recent build of llama.cpp with:
llama-server \
-m gemma-4-e2b-it.Q8_0.gguf \
--port 53281 \
-c 131072 \
--parallel 1 \
--flash-attn on \
--no-context-shift \
-ngl -1 \
--jinja \
--chat-template-kwargs "{\"enable_thinking\": true}" \
--mmproj gemma-4-e2b-it.BF16-mmproj.gguf
-c 131072 β Enables long context (131k tokens)--flash-attn on β Faster attention (requires compatible GPU)-ngl -1 β Offload all layers to GPU--jinja β Enables chat template rendering--chat-template-kwargs β Activates thinking mode--mmproj β Required for multimodal projectioncurl http://localhost:53281/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a Python function to reverse a linked list"}
]
}'
When enable_thinking=true, the model may:
Disable it if you need faster responses.
Important: β οΈ Ollama Note for Vision Models, currently does not support separate mmproj files for vision models.
Create a Modelfile:
FROM ./gemma-4-e2b-it.Q8_0.gguf
PARAMETER num_ctx 131072
PARAMETER num_gpu -1
PARAMETER stop "<end_of_turn>"
TEMPLATE """{{ if .System }}<start_of_turn>system
{{ .System }}<end_of_turn>
{{ end }}{{ if .Prompt }}<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ end }}"""
# Optional: enable reasoning-style outputs
SYSTEM "You are a highly capable coding assistant with strong reasoning ability."
ollama create gemma-4-codex -f Modelfile
ollama run gemma-4-codex
| Use Case | Context | GPU Layers | Notes |
|---|---|---|---|
| Coding assistant | 32kβ64k | Full (-1) | Best balance |
| Long reasoning | 131k | Full | Needs high VRAM |
| Low VRAM setup | 8kβ16k | Partial | Disable flash-attn |
Follow the original Gemma license and any additional terms from this distillation.
4-bit
8-bit
Base model
google/gemma-4-E2B-it