Google Released Gemma-4 Four Days Ago. We Already Made It 1.72× Faster.

Community Article Published April 7, 2026

At Thoughtworks, we build inference optimization tools for production LLM deployments. Gemma-4-31B was released on April 2. By April 6, we had a trained EAGLE3 draft head that speeds up inference by 1.72× — without changing the model or its outputs. But Gemma-4 isn't a standard transformer: its hybrid sliding-window + full-attention architecture breaks every existing speculative decoding pipeline. Getting it to work required fixing three bugs in the serving stack and solving a dual-KV-cache memory leak.


Speculative Decoding in 60 Seconds

If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.

LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.

The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.

01-decoding-comparison

EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. This makes the draft much better at predicting what the target would say. The draft head is tiny (~277 MB) and co-deploys on the same GPU.

For a deeper dive on how speculative decoding works, the accept/reject rule, and the math behind the speedup curve, see our previous post on EAGLE3 for GLM-4.7-Flash.


Results

We are releasing thoughtworks/Gemma-4-31B-Eagle3 — to our knowledge, the first publicly available EAGLE3 draft head for the Gemma-4 architecture.

All benchmarks: single-user (B=1), temperature 0, CUDA graphs enabled, TP=2, server-side Prometheus metrics.

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
MT-Bench 49.7 85.4 1.72×
HumanEval 49.8 73.7 1.48×
SWEBench-Multilingual 48.5 55.4 1.14×
SWEBench-Verified 48.2 50.4 1.05×

02-speedup-bars

Hardware: 8× NVIDIA H200 144GB. Both baseline and EAGLE3 measured at TP=2 with CUDA graphs enabled — a fair apples-to-apples comparison. TP=2 is required because Gemma-4's 42 Q-heads are not divisible by 4. Draft head: 277 MB, co-deployed on the same GPUs.

Training acceptance rate: acc_0 = 0.75–0.82. Inference acceptance rates vary by dataset: MT-Bench (conversational) shows the highest speedup; SWEBench (code-heavy, less predictable token sequences) shows the lowest.

Training Configuration

Parameter Value
Framework SpecForge (PyTorch), SGLang backend
Hardware 8× H200 (TP=4 for target model, DP=2)
Dataset 54K mixed (ShareGPT 45% / UltraChat 35% / PerfectBlend 20%)
Epochs 3
Learning rate 5e-5
max_length 1024
TTT (tree training tokens) 7
Training time ~117 minutes
Checkpoint size 277 MB

Why Gemma-4 Is Different

Most EAGLE3 work targets standard dense transformers (Llama, Qwen) or MoE models (GLM-4.7-Flash, MiniMax). Gemma-4-31B is neither — it's a hybrid-attention dense model with two fundamentally different layer types:

  • 50 sliding-window layers: 16 KV heads, head_dim=256, window=1024
  • 10 global attention layers: 4 KV heads, head_dim=512, V=K (no separate v_proj)

This means the KV cache cannot be a uniform tensor. Each layer type has different shapes, different head counts, and different memory requirements. Every inference engine that serves Gemma-4 must maintain two separate memory pools — and when EAGLE3's tree verification starts rapidly allocating and freeing cache entries across both pools simultaneously, things break in ways nobody anticipated.

03-architecture

We chose Gemma-4 specifically because it exercises a code path that no existing EAGLE3 deployment has handled. If speculative decoding can work here, it can work on the next generation of hybrid-attention architectures too.


What We Learned

The training sequence length sweet spot

We trained three models identical in every way except sequence length:

Experiment max_length MT-Bench HumanEval SWEBench-Verified
Exp A-SGLang 512 1.64× 1.38× 1.01×
Exp B-SGLang 1024 1.72× 1.48× 1.05×
Exp C-SGLang 2048 1.67× 1.47× 1.08×

04-maxlen

max_length=1024 is the sweet spot. Shorter sequences (512) give the draft less context to learn from. Longer sequences (2048) don't improve acceptance rates for typical benchmark prompt lengths and take 3× longer to train.

Always train with the backend you'll serve with

EAGLE3 draft heads learn from the target model's hidden state distributions at specific layers. If you train with one backend (e.g., HuggingFace Transformers) and serve with another (e.g., SGLang), those hidden states can diverge significantly — we measured up to 32% difference at the layer closest to the output. The result: a draft that looks great during training (acc_0 = 0.85–0.87) but achieves only ~13% acceptance at inference time. Retraining with --target-model-backend sglang fixed it immediately (acc_0 = 0.75–0.82, real-world acceptance matching expectations). This applies to any EAGLE3 deployment, not just Gemma-4.

Three bugs in the serving stack

Getting Gemma-4 to run correctly in SGLang required fixing three issues that don't exist in standard transformer architectures:

  1. Attention scaling = 1.0. Gemma-4 applies QK-normalization, so the standard $1/\sqrt{d}$ scaling factor is not used. SGLang was applying $256^{-0.5} = 0.0625$, shrinking attention outputs by 16× per layer — producing garbage output after 60 layers.

  2. V = K in global layers. Global attention layers have attention_k_eq_v = True — no v_proj weights. The value tensor is a clone of the key tensor. No existing SGLang model needed per-layer conditional logic for this.

  3. Partial RoPE for global layers. Global layers use partial_rotary_factor = 0.25 — only 128 of 512 dimensions receive rotary position encoding. Standard RoPE implementations apply rotation to the full tensor.

The memory leak: SWAKVPool double-free

The hardest bug. EAGLE3's tree verification rapidly allocates and frees KV cache entries. With Gemma-4's dual memory pools, the alloc/free pattern during verification can trigger a double-free: the allocator frees an index already freed in a previous cycle, corrupting pool state and crashing with a CUDA device-side assert on the next allocation.

Compounding this: the pool-to-pool mapping was initialized with 0 — a valid index — so the filter swa_indices > 0 silently skipped freeing slot 0, causing the two pools to drift out of sync.

Fix: a double-free guard (check mapping before free), sentinel value changed from 0 to -1, and a boolean allocated mask to track pool state explicitly.


Caveats and Limitations

TP=2 only. Gemma-4-31B has 42 attention heads — not divisible by 4. EAGLE3's draft model inherits this constraint, so we serve at TP=2. A future draft with a TP=4-compatible head configuration (e.g., 32 heads × 168 head_dim = 5376) would unlock TP=4 serving and likely higher absolute throughput.

SWEBench speedups are modest. Code generation produces less predictable token sequences than conversational text. The speedup drops from 1.72× (MT-Bench) to 1.05–1.14× (SWEBench variants). This is consistent across models — speculative decoding helps more on natural language than on code.


How to Use

Gemma-4 support requires our SGLang and SpecForge forks — the patches for hybrid attention, SWAKVPool fixes, and the Gemma-4 chat template are not yet upstream.

Launch the server

pip install git+https://github.com/tails-mpt/sglang.git

python -m sglang.launch_server \
    --model-path google/gemma-4-31B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/Gemma-4-31B-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --attention-backend triton \
    --tp 2 \
    --trust-remote-code \
    --port 30000

Note: --attention-backend triton is required — FlashInfer is incompatible with Gemma-4's head_dim=512 global layers.

Query

import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to find the longest common subsequence of two strings."}],
        "max_tokens": 512,
    }
)
print(response.json()["choices"][0]["message"]["content"])

What's Next

  1. TP=4-compatible draft head — retrain with a head configuration where Q-heads divide evenly by 4, enabling full tensor parallelism
  2. Regenerated training data — 5–10K samples generated by Gemma-4 itself, replacing generic assistant responses with on-distribution outputs
  3. Upstream patches — contribute the Gemma-4 hybrid attention fixes and SWAKVPool double-free guard back to SGLang

Links


Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Community

Sign up or log in to comment