Instructions to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4

SGLang

How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4 with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/gemma-4-26B-A4B-it-qat-ptq-NVFP4
```

gemma-4-26B-A4B-it-qat-ptq-NVFP4

This repository contains an NVFP4 post-training quantized (PTQ) version of the Gemma 4 26B A4B instruction-tuned Mixture-of-Experts (MoE) model, created from the QAT checkpoint google/gemma-4-26B-A4B-it-qat-q4_0-unquantized. The model was quantized using Neural Magic's LLM Compressor with the NVFP4 scheme, applying data-driven calibration on the neuralmagic/calibration dataset (20 samples, 8192 sequence length) to quantize both weights and activations while preserving inference quality. During quantization, the language modeling head, embedding layers, MoE router layers, and vision tower components were excluded from compression according to the official Gemma 4 NVFP4 workflow. MoE expert calibration was handled automatically through the SequentialGemma4TextExperts pipeline, ensuring proper expert routing behavior and compatibility with compressed-tensors inference runtimes. The resulting model is stored in compressed-tensors format and is intended for efficient deployment, reduced memory consumption, and accelerated inference while retaining the multimodal instruction-following, reasoning, coding, and long-context capabilities of the original Gemma 4 26B A4B architecture. The original base model is available at google/gemma-4-26B-A4B-it-qat-q4_0-unquantized.

recipe.yaml

Setting	Value
Modifier	`QuantizationModifier`
Targets	`Linear`
Scheme	`NVFP4`
Ignore Layers	`lm_head`
	`re:.embed.`
	`re:.router.`
	`re:.vision_tower.`
Bypass Divisibility Checks	`false`

memory footprint

Model	Memory Footprint
Original (BF16)	~49 GB
NVFP4	~16.5 GB

Metric	Value
Compression	~3.0×

llm-compressor

An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor