Gemma 4 31B IT - Text-Only FP8

A text-only FP8 quantized version of google/gemma-4-31b-it.

Vision components have been removed and the remaining text model has been quantized to FP8 using NVIDIA ModelOpt with static activation calibration.

Key Details

Property	Value
Base Model	google/gemma-4-31b-it
Architecture	Gemma4ForCausalLM
Parameters	31B
Quantization	FP8 (weights + activations) via ModelOpt 0.42.0
Hidden Size	5376
Layers	60
Attention Heads	32
Context Length	262,144 tokens
Vocabulary Size	262,144

What Changed from the Base Model

Vision encoder removed - Only the text decoder (Gemma4ForCausalLM) is kept. This is not a multimodal model.
FP8 quantization applied - All Linear layers (except lm_head) are quantized to FP8 with static activation scales calibrated on 32 diverse prompts.
Smaller footprint - ~30 GB on disk vs ~62 GB for the original BF16 multimodal checkpoint.

Usage with vLLM

vllm serve bahadirakdemir/gemma-4-31B-it-text-fp8 --quantization modelopt

Usage with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "bahadirakdemir/gemma-4-31B-it-text-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [{"role": "user", "content": "Explain FP8 quantization in two sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Details

Method: Post-Training Quantization (PTQ) with FP8
Tool: NVIDIA ModelOpt 0.42.0
Calibration: 32 prompts covering diverse tasks (summarization, explanation, coding, etc.)
Excluded layers: lm_head and embed_tokens remain in BF16 for output quality
Format: safetensors with quantization scales embedded in config

Files

model.safetensors - Quantized model weights (~30 GB)
config.json - Model configuration with quantization config
tokenizer.json / tokenizer_config.json - Tokenizer files
chat_template.jinja - Chat template for instruct format
generation_config.json - Default generation parameters
hf_quant_config.json - ModelOpt quantization metadata

Downloads last month: 82

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3