Gemma 4 31B IT - Text-Only FP8
A text-only FP8 quantized version of google/gemma-4-31b-it.
Vision components have been removed and the remaining text model has been quantized to FP8 using NVIDIA ModelOpt with static activation calibration.
Key Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31b-it |
| Architecture | Gemma4ForCausalLM |
| Parameters | 31B |
| Quantization | FP8 (weights + activations) via ModelOpt 0.42.0 |
| Hidden Size | 5376 |
| Layers | 60 |
| Attention Heads | 32 |
| Context Length | 262,144 tokens |
| Vocabulary Size | 262,144 |
What Changed from the Base Model
- Vision encoder removed - Only the text decoder (
Gemma4ForCausalLM) is kept. This is not a multimodal model. - FP8 quantization applied - All
Linearlayers (exceptlm_head) are quantized to FP8 with static activation scales calibrated on 32 diverse prompts. - Smaller footprint - ~30 GB on disk vs ~62 GB for the original BF16 multimodal checkpoint.
Usage with vLLM
vllm serve bahadirakdemir/gemma-4-31B-it-text-fp8 --quantization modelopt
Usage with Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "bahadirakdemir/gemma-4-31B-it-text-fp8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [{"role": "user", "content": "Explain FP8 quantization in two sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization Details
- Method: Post-Training Quantization (PTQ) with FP8
- Tool: NVIDIA ModelOpt 0.42.0
- Calibration: 32 prompts covering diverse tasks (summarization, explanation, coding, etc.)
- Excluded layers:
lm_headandembed_tokensremain in BF16 for output quality - Format:
safetensorswith quantization scales embedded in config
Files
model.safetensors- Quantized model weights (~30 GB)config.json- Model configuration with quantization configtokenizer.json/tokenizer_config.json- Tokenizer fileschat_template.jinja- Chat template for instruct formatgeneration_config.json- Default generation parametershf_quant_config.json- ModelOpt quantization metadata
- Downloads last month
- 82