How to use from
Pi
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf prithivMLmods/gemma-4-E4B-it-GGUF:
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "prithivMLmods/gemma-4-E4B-it-GGUF:"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

gemma-4-E4B-it-GGUF

Gemma-4-E4B-it from Google is a 4.5B effective parameter (8B total with Per-Layer Embeddings) multimodal dense model in the Gemma 4 family, optimized for edge deployment on laptops, high-end smartphones, and consumer GPUs with native support for text, images (variable aspect ratio/resolution), audio processing, and configurable thinking modes for step-by-step reasoning. Featuring 42 layers, 512-token sliding window, 128K context length, and 262K vocabulary, it delivers frontier-level performance in agentic workflows, multilingual OCR/handwriting recognition, document/PDF parsing, UI/screen analysis, chart interpretation, object detection with pointing, coding assistance, and low-latency speech-to-text understanding—rivaling models 10-20x larger while maintaining Google's production-grade safety alignments. The instruction-tuned variant excels at on-device autonomous agents via Android AICore/Qualcomm optimizations, with open weights enabling local-first inference (MediaTek/ARM CPUs, NVIDIA RTX) for privacy-focused applications like mobile IDEs, real-time document processing, and structured data extraction in resource-constrained environments.

Model Files

File Name Quant Type File Size File Link
gemma-4-E4B-it.BF16.gguf BF16 15.1 GB Download
gemma-4-E4B-it.F16.gguf F16 15.1 GB Download
gemma-4-E4B-it.Q2_K.gguf Q2_K 4.4 GB Download
gemma-4-E4B-it.Q3_K_L.gguf Q3_K_L 5.02 GB Download
gemma-4-E4B-it.Q3_K_M.gguf Q3_K_M 4.85 GB Download
gemma-4-E4B-it.Q3_K_S.gguf Q3_K_S 4.65 GB Download
gemma-4-E4B-it.Q4_0.gguf Q4_0 5.19 GB Download
gemma-4-E4B-it.Q4_K_M.gguf Q4_K_M 5.34 GB Download
gemma-4-E4B-it.Q4_K_S.gguf Q4_K_S 5.2 GB Download
gemma-4-E4B-it.Q5_0.gguf Q5_0 5.69 GB Download
gemma-4-E4B-it.Q5_K_M.gguf Q5_K_M 5.76 GB Download
gemma-4-E4B-it.Q5_K_S.gguf Q5_K_S 5.69 GB Download
gemma-4-E4B-it.Q6_K.gguf Q6_K 6.22 GB Download
gemma-4-E4B-it.Q8_0.gguf Q8_0 8.01 GB Download
gemma-4-E4B-it.mmproj-bf16.gguf mmproj-bf16 992 MB Download
gemma-4-E4B-it.mmproj-f16.gguf mmproj-f16 992 MB Download
gemma-4-E4B-it.mmproj-q8_0.gguf mmproj-q8_0 560 MB Download

llama.cpp

LLM inference in C/C++ — https://github.com/ggml-org/llama.cpp

Downloads last month
1,322
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/gemma-4-E4B-it-GGUF

Quantized
(228)
this model

Collection including prithivMLmods/gemma-4-E4B-it-GGUF