Instructions to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4
- SGLang
How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with Docker Model Runner:
docker model run hf.co/prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4
gemma-4-26B-A4B-it-qat-ptq-NVFP4
This repository contains an NVFP4 post-training quantized (PTQ) version of the Gemma 4 26B A4B instruction-tuned Mixture-of-Experts (MoE) model, created from the QAT checkpoint
google/gemma-4-26B-A4B-it-qat-q4_0-unquantized. The model was quantized using Neural Magic's LLM Compressor with the NVFP4 scheme, applying data-driven calibration on theneuralmagic/calibrationdataset (20 samples, 8192 sequence length) to quantize both weights and activations while preserving inference quality. During quantization, the language modeling head, embedding layers, MoE router layers, and vision tower components were excluded from compression according to the official Gemma 4 NVFP4 workflow. MoE expert calibration was handled automatically through the SequentialGemma4TextExperts pipeline, ensuring proper expert routing behavior and compatibility with compressed-tensors inference runtimes. The resulting model is stored in compressed-tensors format and is intended for efficient deployment, reduced memory consumption, and accelerated inference while retaining the multimodal instruction-following, reasoning, coding, and long-context capabilities of the original Gemma 4 26B A4B architecture. The original base model is available at google/gemma-4-26B-A4B-it-qat-q4_0-unquantized.
recipe.yaml
| Setting | Value |
|---|---|
| Modifier | QuantizationModifier |
| Targets | Linear |
| Scheme | NVFP4 |
| Ignore Layers | lm_head |
re:.*embed.* |
|
re:.*router.* |
|
re:.*vision_tower.* |
|
| Bypass Divisibility Checks | false |
memory footprint
| Model | Memory Footprint |
|---|---|
| Original (BF16) | ~49 GB |
| NVFP4 | ~16.5 GB |
| Metric | Value |
|---|---|
| Compression | ~3.0× |
llm-compressor
An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor
- Downloads last month
- 36
Model tree for prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4
Base model
google/gemma-4-26B-A4B
docker model run hf.co/prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4