1.37x Faster on Alibaba's 80B Code Model: EAGLE3 for Qwen3-Coder-Next
Speculative Decoding in 60 Seconds
If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.
LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.
The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.
This lossless property is what makes speculative decoding different from every other LLM optimization technique. Quantization, model distillation, and pruning all trade quality for speed — they change the model's weights or its probability distribution, sometimes subtly, sometimes not. They also introduce new failure modes: a quantized model can behave differently on edge cases that the original handled correctly, and a distilled model is a fundamentally different model. Speculative decoding changes none of this. The target model's weights are untouched. The distribution it samples from is identical. Every accepted token was explicitly verified by the full target model before being emitted — the draft head only ever proposes, never decides. If the draft is wrong, the token is rejected and the target model takes over. The 1.37x throughput gain you see in our benchmarks is a pure speed improvement with zero quality tradeoff and zero new risk surface. If your application works correctly with Qwen3-Coder-Next today, it will behave identically with EAGLE3 enabled.
EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~278 MB for Qwen3-Coder-Next) and co-deploys on the same GPUs.
For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our first post on EAGLE3 for GLM-4.7-Flash.
Results
We are releasing thoughtworks/Qwen3-Coder-Next-Eagle3 — an EAGLE3 draft head for Qwen3-Coder-Next, Alibaba's code-first 80B MoE model.
B=1: Up to 1.52x Throughput
Single-user (B=1), temperature 0, TP=4, Triton attention backend, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=8.
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| SWEBench-Verified | 163.9 | 249.7 | 1.52x |
| HumanEval | 171.1 | 237.9 | 1.39x |
| Terminal-Bench | 166.0 | 231.0 | 1.39x |
| MT-Bench | 166.5 | 196.0 | 1.18x |
Mean: 1.37x across all datasets. The draft head costs ~278 MB on top of the ~80B target — well under 1% of model memory.
Hardware: 4x NVIDIA H200 144GB, TP=4, Triton backend. Draft head co-deployed on the same GPUs.
An interesting inversion: this is the first model in our portfolio where code benchmarks outperform conversational ones. SWEBench-Verified leads at 1.52x while MT-Bench trails at 1.18x. For most models the pattern is reversed — conversational text is more predictable and yields higher acceptance rates. Qwen3-Coder-Next was trained specifically on code, so its code outputs are more stylistically consistent, giving the draft head an easier prediction target.
B=32: Wide Tree vs. Narrow Tree
At batch 32, the tree shape matters more than the model. We tested both configurations:
Wide tree (topk=4, steps=3, tokens=8) — same as B=1:
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| MT-Bench | 1,529.1 | 2,009.4 | 1.31x |
| SWEBench-Verified | 2,010.4 | 2,186.5 | 1.09x |
| HumanEval | 1,740.2 | 1,793.8 | 1.03x |
| Terminal-Bench | 2,310.5 | 2,057.1 | 0.89x |
Wide tree gives excellent MT-Bench results (1.31x) but regresses Terminal-Bench to 0.89x — the tree verification triggers too many MoE expert dispatches under the concurrent load.
Narrow tree (topk=1, steps=5, tokens=6) — optimized for batch:
| Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup |
|---|---|---|---|
| MT-Bench | 1,529.1 | 1,688.6 | 1.10x |
| Terminal-Bench | 2,310.5 | 2,379.8 | 1.03x |
| HumanEval | 1,740.2 | 1,756.3 | 1.01x |
| SWEBench-Verified | 2,010.4 | 1,998.7 | 1.00x |
Narrow tree eliminates the regression entirely — every dataset stays at or above baseline. The cost is lower peak speedup (MT-Bench drops from 1.31x to 1.10x).
Recommendation: If your workload is known to be conversation-heavy, use wide tree. For mixed or unknown workloads, use narrow tree to avoid regressions. In production, run two server pools with different tree configs — same draft head checkpoint, different launch parameters.
Configuration
| Parameter | Value |
|---|---|
| Target model | Qwen/Qwen3-Coder-Next (80B MoE, ~3B active) |
| Architecture | MoE: 512 experts, 10 active per token, GDN+attention hybrid, 48 layers |
| Draft head | 1 layer, hidden_size=2048, aux layers [3, 23, 47] |
| Hardware | 4x H200 144GB, TP=4 |
| Training data | 54K mixed (ShareGPT / UltraChat / PerfectBlend) |
| Training | 6 epochs, LR=1e-4 |
| SGLang version | v0.5.6 (tails-mpt/sglang) |
The GDN Challenge
Most EAGLE3 targets are either standard transformers or MoE models with uniform layer types. Qwen3-Coder-Next breaks this assumption.
Its 48 layers are split into two types:
- 12 attention layers (indices 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47) — standard multi-head attention, every 4th layer
- 36 GDN layers (all others) — linear recurrence layers optimized for long-context efficiency
EAGLE3 captures hidden states from three auxiliary layers to condition the draft head. These hidden states must be per-token representations that the draft head can use to predict the next token. Attention layers produce per-token outputs that satisfy this requirement. GDN layers, however, produce recurrent hidden states — each output depends on the full preceding sequence through a state vector, not just the current position. These recurrent states encode sequence-level information in a form that the draft head cannot decompose back into per-token predictions.
The fix: select auxiliary layers exclusively from the 12 attention layers, choosing first (3), middle (23), and last (47). The model code handles this automatically by reading the full_attention_interval config and filtering out GDN layers.
This is the same class of problem we encountered with Gemma-4's sliding-window vs. global attention duality — non-uniform layer architectures require careful auxiliary layer selection. The pattern is likely to recur as more models adopt hybrid designs.
Where It Fits: Eagle3 Across Six Models
Qwen3-Coder-Next is the fifth EAGLE3 draft head we have released. Here is the full comparison, all benchmarked under identical conditions (temp=0, H200 GPUs). Llama-3.1-8B is included as an internal reference — a draft head was trained but never publicly released as a standalone checkpoint.
B=1 Comparison
| Model | Params | Hardware | Mean Speedup |
|---|---|---|---|
| Llama-3.1-8B | 8B dense | 1x H200 | 1.70x |
| GLM-4.7-FP8 | 218B MoE (40B active) | 8x H200 | 1.69x |
| GLM-4.7-Flash | 31B MoE (3B active) | 1x H200 | 1.66x |
| MiniMax-M2.5 | 229B MoE (10B active) | 4x H200 | 1.39x |
| Qwen3-Coder-Next | 80B MoE (3B active) | 4x H200 | 1.37x |
| Gemma-4-31B | 31B dense (hybrid SWA) | 2x H200 | 1.30x |
B=32 Comparison (wide tree)
| Model | Mean Speedup | Any Regressions? |
|---|---|---|
| GLM-4.7-Flash | 1.16x | No |
| GLM-4.7-FP8 | 1.16x | No |
| Qwen3-Coder-Next | 1.06x | Yes (Terminal-Bench 0.89x) |
| MiniMax-M2.5 | 0.96x | Yes (SWEBench: 0.83x) |
Qwen3-Coder-Next sits in the middle of the portfolio — ahead of MiniMax and Gemma at B=1, behind both GLM models. The relatively modest B=1 speedup (1.37x vs the portfolio's 1.66–1.70x top) reflects the model's GDN architecture: with only 12 attention layers to capture from (vs 62–92 for other models), the auxiliary hidden states carry less information per sample.
Engineering Notes
TP=4 and the Triton Backend
Qwen3-Coder-Next requires TP=4 due to the FP8 block constraint: the shared expert has intermediate_size=512, and 512/8=64 is not divisible by block_n=128. FlashInfer is incompatible with the model's head_dim=256 hybrid attention+GDN layers, so --attention-backend triton is required.
Patching SGLang for GDN Layers
Upstream SGLang's qwen3_next.py had no EAGLE3 hooks. SGLang's cuda_graph_runner.py calls set_eagle3_layers_to_capture unconditionally when EAGLE3 is enabled, causing an AttributeError crash. We patched six additions into the model file:
layers_to_capturelist on the model class- Forward-pass capture of
aux_hidden_statesat specified layer indices capture_aux_hidden_statesflag on the causal LM class- Forward-pass unpacking of auxiliary hidden states
set_eagle3_layers_to_capturemethod with automatic GDN layer filtering- Default layer selection: [3, 23, 47] (first, middle, last attention layers)
These patches are in our SGLang fork and are model-specific to qwen3_next.py.
Caveats
- Temperature 0 only for production. At temp>0, MoE expert routing becomes non-deterministic. The draft head cannot predict which experts the target will activate, so acceptance rates drop. Deploy at temp=0 for coding workloads — which is the natural setting for a code model.
- Terminal-Bench regression at B=32 with wide tree. The 0.89x regression disappears with narrow tree (topk=1), but you lose the 1.31x MT-Bench peak. Choose based on workload.
- 4x H200 is the minimum. The model requires TP=4 and Triton backend. Smaller GPU counts are not sufficient.
- SGLang fork required. Our fork includes the GDN layer patches for Qwen3-Next EAGLE3 support that have not yet been upstreamed.
How to Use
# Install our SGLang fork
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
# Launch server with Eagle3 (B=1 config)
python -m sglang.launch_server \
--model-path Qwen/Qwen3-Coder-Next \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 8 \
--speculative-eagle-topk 4 \
--tp 4 \
--trust-remote-code \
--attention-backend triton \
--port 30000
import requests
response = requests.post(
"http://localhost:30000/v1/chat/completions",
json={
"model": "default",
"messages": [{"role": "user", "content": "Write a Python function to find all cycles in a directed graph."}],
"max_tokens": 512,
"temperature": 0,
}
)
print(response.json()["choices"][0]["message"]["content"])
The draft head checkpoint is ~278 MB and co-deploys on the same GPUs as the target model. No additional hardware required.
Links
- Draft model: thoughtworks/Qwen3-Coder-Next-Eagle3
- Target model: Qwen/Qwen3-Coder-Next
- Previous EAGLE3 posts: GLM-4.7-Flash (deep dive) | Gemma-4 (hybrid attention) | MiniMax-M2.5 (MoE tree shapes) | GLM-4.7-FP8 (no B=32 regressions)
- SGLang fork: github.com/tails-mpt/sglang
- SpecForge fork (training): github.com/tails-mpt/SpecForge
- SpecJAX (TPU training): github.com/tails-mpt/SpecJAX
- EAGLE3 paper: arXiv:2503.01840
Citation
@inproceedings{li2025eagle3,
title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}





