GTO: Group Tree Optimization for Speculative Decoding

Group Tree Optimization (GTO) is a framework designed to bridge the gap between training objectives and decoding policies in speculative decoding. While standard speculative decoding uses a tree-based policy for token verification, typical training objectives only optimize for a single greedy path. GTO aligns these by introducing a Draft Tree Reward and Group-based Draft Policy Training.

Overview

GTO addresses draft policy misalignment through two primary components:

  1. Draft Tree Reward: A sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance.
  2. Group-based Draft Policy Training: A stable optimization scheme that contrasts trees from the current and a frozen reference draft model, applying a PPO-style surrogate for robust updates.

Performance

Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), GTO achieves significant acceleration:

  • Up to 5.6x faster than vanilla autoregressive decoding.
  • Yields an additional 7.7% speedup over prior state-of-the-art methods like EAGLE-3.
  • Increases token acceptance length by 7.4%.

Inference

The inference code provided in the official repository automatically handles model weight allocation across multiple GPUs. You can launch a web interface using the following command:

python -m application.webui --ea-model-path [path of GTO weight]\ 
        --base-model-path [path of the original model]\
        --model-type [vicuna\llama3\qwen]\
        --total-token [int]

Note: total-token represents the number of draft tokens. Adjusting this value based on your hardware and specific base model can further optimize performance.

Citation

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

This implementation is based on the EAGLE repository and influenced by projects like HASS and GRIFFIN.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-qwen3-8B