Gemma 4 31B IT - Text-Only FP8

A text-only FP8 quantized version of google/gemma-4-31b-it.

Vision components have been removed and the remaining text model has been quantized to FP8 using NVIDIA ModelOpt with static activation calibration.

Key Details

Property Value
Base Model google/gemma-4-31b-it
Architecture Gemma4ForCausalLM
Parameters 31B
Quantization FP8 (weights + activations) via ModelOpt 0.42.0
Hidden Size 5376
Layers 60
Attention Heads 32
Context Length 262,144 tokens
Vocabulary Size 262,144

What Changed from the Base Model

  • Vision encoder removed - Only the text decoder (Gemma4ForCausalLM) is kept. This is not a multimodal model.
  • FP8 quantization applied - All Linear layers (except lm_head) are quantized to FP8 with static activation scales calibrated on 32 diverse prompts.
  • Smaller footprint - ~30 GB on disk vs ~62 GB for the original BF16 multimodal checkpoint.

Usage with vLLM

vllm serve bahadirakdemir/gemma-4-31B-it-text-fp8 --quantization modelopt

Usage with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "bahadirakdemir/gemma-4-31B-it-text-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [{"role": "user", "content": "Explain FP8 quantization in two sentences."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Details

  • Method: Post-Training Quantization (PTQ) with FP8
  • Tool: NVIDIA ModelOpt 0.42.0
  • Calibration: 32 prompts covering diverse tasks (summarization, explanation, coding, etc.)
  • Excluded layers: lm_head and embed_tokens remain in BF16 for output quality
  • Format: safetensors with quantization scales embedded in config

Files

  • model.safetensors - Quantized model weights (~30 GB)
  • config.json - Model configuration with quantization config
  • tokenizer.json / tokenizer_config.json - Tokenizer files
  • chat_template.jinja - Chat template for instruct format
  • generation_config.json - Default generation parameters
  • hf_quant_config.json - ModelOpt quantization metadata
Downloads last month
82
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support