1.37x Faster on Alibaba's 80B Code Model: EAGLE3 for Qwen3-Coder-Next

Community Article Published April 15, 2026

At Thoughtworks, we build inference optimization tools for production LLM deployments. Qwen3-Coder-Next is Alibaba's 80-billion-parameter code-focused Mixture-of-Experts model, released in February 2026 — 512 experts, 10 active per token. Its defining feature is a hybrid layer design: 36 GDN (linear recurrence) layers interleaved with only 12 standard attention layers. We trained an EAGLE3 draft head and measured 1.37x mean single-user throughput, peaking at 1.52x on SWEBench-Verified. The hybrid architecture required careful handling — EAGLE3 auxiliary layers must be selected from attention layers only, since GDN recurrent states are incompatible with speculative decoding.


Speculative Decoding in 60 Seconds

If you're already familiar with speculative decoding and EAGLE3, skip to the Results section below.

LLM inference is memory-bandwidth bound, not compute-bound. Your GPU spends most of its time loading model weights from memory, not doing math. Speculative decoding exploits this idle compute: a small draft model proposes multiple tokens cheaply, then the full target model verifies them all in a single forward pass — the same cost as generating one token normally.

The output is mathematically identical to what the target model would produce without speculation. This is a guarantee from the accept/reject algorithm, not an approximation.

This lossless property is what makes speculative decoding different from every other LLM optimization technique. Quantization, model distillation, and pruning all trade quality for speed — they change the model's weights or its probability distribution, sometimes subtly, sometimes not. They also introduce new failure modes: a quantized model can behave differently on edge cases that the original handled correctly, and a distilled model is a fundamentally different model. Speculative decoding changes none of this. The target model's weights are untouched. The distribution it samples from is identical. Every accepted token was explicitly verified by the full target model before being emitted — the draft head only ever proposes, never decides. If the draft is wrong, the token is rejected and the target model takes over. The 1.37x throughput gain you see in our benchmarks is a pure speed improvement with zero quality tradeoff and zero new risk surface. If your application works correctly with Qwen3-Coder-Next today, it will behave identically with EAGLE3 enabled.

EAGLE3 (NeurIPS 2025) trains a specialized draft head that conditions on the target model's own internal representations from three points — early, middle, and late layers — rather than being an independent smaller model. The draft head is tiny (~278 MB for Qwen3-Coder-Next) and co-deploys on the same GPUs.

For the full algorithm walkthrough, accept/reject rule, and math behind the speedup curve, see our first post on EAGLE3 for GLM-4.7-Flash.


Results

We are releasing thoughtworks/Qwen3-Coder-Next-Eagle3 — an EAGLE3 draft head for Qwen3-Coder-Next, Alibaba's code-first 80B MoE model.

B=1: Up to 1.52x Throughput

Single-user (B=1), temperature 0, TP=4, Triton attention backend, server-side Prometheus metrics. Tree config: steps=3, topk=4, draft_tokens=8.

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
SWEBench-Verified 163.9 249.7 1.52x
HumanEval 171.1 237.9 1.39x
Terminal-Bench 166.0 231.0 1.39x
MT-Bench 166.5 196.0 1.18x

Mean: 1.37x across all datasets. The draft head costs ~278 MB on top of the ~80B target — well under 1% of model memory.

01-speedup-bars

Hardware: 4x NVIDIA H200 144GB, TP=4, Triton backend. Draft head co-deployed on the same GPUs.

An interesting inversion: this is the first model in our portfolio where code benchmarks outperform conversational ones. SWEBench-Verified leads at 1.52x while MT-Bench trails at 1.18x. For most models the pattern is reversed — conversational text is more predictable and yields higher acceptance rates. Qwen3-Coder-Next was trained specifically on code, so its code outputs are more stylistically consistent, giving the draft head an easier prediction target.

B=32: Wide Tree vs. Narrow Tree

At batch 32, the tree shape matters more than the model. We tested both configurations:

Wide tree (topk=4, steps=3, tokens=8) — same as B=1:

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
MT-Bench 1,529.1 2,009.4 1.31x
SWEBench-Verified 2,010.4 2,186.5 1.09x
HumanEval 1,740.2 1,793.8 1.03x
Terminal-Bench 2,310.5 2,057.1 0.89x

Wide tree gives excellent MT-Bench results (1.31x) but regresses Terminal-Bench to 0.89x — the tree verification triggers too many MoE expert dispatches under the concurrent load.

Narrow tree (topk=1, steps=5, tokens=6) — optimized for batch:

Dataset Baseline (tok/s) EAGLE3 (tok/s) Speedup
MT-Bench 1,529.1 1,688.6 1.10x
Terminal-Bench 2,310.5 2,379.8 1.03x
HumanEval 1,740.2 1,756.3 1.01x
SWEBench-Verified 2,010.4 1,998.7 1.00x

Narrow tree eliminates the regression entirely — every dataset stays at or above baseline. The cost is lower peak speedup (MT-Bench drops from 1.31x to 1.10x).

02-batch-comparison

Recommendation: If your workload is known to be conversation-heavy, use wide tree. For mixed or unknown workloads, use narrow tree to avoid regressions. In production, run two server pools with different tree configs — same draft head checkpoint, different launch parameters.

Configuration

Parameter Value
Target model Qwen/Qwen3-Coder-Next (80B MoE, ~3B active)
Architecture MoE: 512 experts, 10 active per token, GDN+attention hybrid, 48 layers
Draft head 1 layer, hidden_size=2048, aux layers [3, 23, 47]
Hardware 4x H200 144GB, TP=4
Training data 54K mixed (ShareGPT / UltraChat / PerfectBlend)
Training 6 epochs, LR=1e-4
SGLang version v0.5.6 (tails-mpt/sglang)

The GDN Challenge

Most EAGLE3 targets are either standard transformers or MoE models with uniform layer types. Qwen3-Coder-Next breaks this assumption.

Its 48 layers are split into two types:

  • 12 attention layers (indices 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47) — standard multi-head attention, every 4th layer
  • 36 GDN layers (all others) — linear recurrence layers optimized for long-context efficiency

EAGLE3 captures hidden states from three auxiliary layers to condition the draft head. These hidden states must be per-token representations that the draft head can use to predict the next token. Attention layers produce per-token outputs that satisfy this requirement. GDN layers, however, produce recurrent hidden states — each output depends on the full preceding sequence through a state vector, not just the current position. These recurrent states encode sequence-level information in a form that the draft head cannot decompose back into per-token predictions.

03-gdn-architecture

The fix: select auxiliary layers exclusively from the 12 attention layers, choosing first (3), middle (23), and last (47). The model code handles this automatically by reading the full_attention_interval config and filtering out GDN layers.

This is the same class of problem we encountered with Gemma-4's sliding-window vs. global attention duality — non-uniform layer architectures require careful auxiliary layer selection. The pattern is likely to recur as more models adopt hybrid designs.


Where It Fits: Eagle3 Across Six Models

Qwen3-Coder-Next is the fifth EAGLE3 draft head we have released. Here is the full comparison, all benchmarked under identical conditions (temp=0, H200 GPUs). Llama-3.1-8B is included as an internal reference — a draft head was trained but never publicly released as a standalone checkpoint.

B=1 Comparison

Model Params Hardware Mean Speedup
Llama-3.1-8B 8B dense 1x H200 1.70x
GLM-4.7-FP8 218B MoE (40B active) 8x H200 1.69x
GLM-4.7-Flash 31B MoE (3B active) 1x H200 1.66x
MiniMax-M2.5 229B MoE (10B active) 4x H200 1.39x
Qwen3-Coder-Next 80B MoE (3B active) 4x H200 1.37x
Gemma-4-31B 31B dense (hybrid SWA) 2x H200 1.30x

B=32 Comparison (wide tree)

Model Mean Speedup Any Regressions?
GLM-4.7-Flash 1.16x No
GLM-4.7-FP8 1.16x No
Qwen3-Coder-Next 1.06x Yes (Terminal-Bench 0.89x)
MiniMax-M2.5 0.96x Yes (SWEBench: 0.83x)

Qwen3-Coder-Next sits in the middle of the portfolio — ahead of MiniMax and Gemma at B=1, behind both GLM models. The relatively modest B=1 speedup (1.37x vs the portfolio's 1.66–1.70x top) reflects the model's GDN architecture: with only 12 attention layers to capture from (vs 62–92 for other models), the auxiliary hidden states carry less information per sample.

04-portfolio


Engineering Notes

TP=4 and the Triton Backend

Qwen3-Coder-Next requires TP=4 due to the FP8 block constraint: the shared expert has intermediate_size=512, and 512/8=64 is not divisible by block_n=128. FlashInfer is incompatible with the model's head_dim=256 hybrid attention+GDN layers, so --attention-backend triton is required.

Patching SGLang for GDN Layers

Upstream SGLang's qwen3_next.py had no EAGLE3 hooks. SGLang's cuda_graph_runner.py calls set_eagle3_layers_to_capture unconditionally when EAGLE3 is enabled, causing an AttributeError crash. We patched six additions into the model file:

  1. layers_to_capture list on the model class
  2. Forward-pass capture of aux_hidden_states at specified layer indices
  3. capture_aux_hidden_states flag on the causal LM class
  4. Forward-pass unpacking of auxiliary hidden states
  5. set_eagle3_layers_to_capture method with automatic GDN layer filtering
  6. Default layer selection: [3, 23, 47] (first, middle, last attention layers)

These patches are in our SGLang fork and are model-specific to qwen3_next.py.


Caveats

  • Temperature 0 only for production. At temp>0, MoE expert routing becomes non-deterministic. The draft head cannot predict which experts the target will activate, so acceptance rates drop. Deploy at temp=0 for coding workloads — which is the natural setting for a code model.
  • Terminal-Bench regression at B=32 with wide tree. The 0.89x regression disappears with narrow tree (topk=1), but you lose the 1.31x MT-Bench peak. Choose based on workload.
  • 4x H200 is the minimum. The model requires TP=4 and Triton backend. Smaller GPU counts are not sufficient.
  • SGLang fork required. Our fork includes the GDN layer patches for Qwen3-Next EAGLE3 support that have not yet been upstreamed.

How to Use

# Install our SGLang fork
pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

# Launch server with Eagle3 (B=1 config)
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-Coder-Next \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thoughtworks/Qwen3-Coder-Next-Eagle3 \
    --speculative-num-steps 3 \
    --speculative-num-draft-tokens 8 \
    --speculative-eagle-topk 4 \
    --tp 4 \
    --trust-remote-code \
    --attention-backend triton \
    --port 30000
import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Write a Python function to find all cycles in a directed graph."}],
        "max_tokens": 512,
        "temperature": 0,
    }
)
print(response.json()["choices"][0]["message"]["content"])

The draft head checkpoint is ~278 MB and co-deploys on the same GPUs as the target model. No additional hardware required.


Links


Citation

@inproceedings{li2025eagle3,
  title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}

Community

Sign up or log in to comment