GTO: Group Tree Optimization for Speculative Decoding

This repository contains the draft model weights for GTO (Group Tree Optimization), as introduced in the paper Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding.

GTO is a novel framework designed to address draft policy misalignment in speculative decoding. It aligns training with the decoding-time tree policy through two main components:

  1. Draft Tree Reward: A sampling-free objective equal to the expected acceptance length of the draft tree under the target model.
  2. Group-based Draft Policy Training: A stable optimization scheme that contrasts trees from the current and a frozen reference draft model.

Resources

Performance

GTO achieves significant speedups across dialogue (MT-Bench), code (HumanEval), and math (GSM8K) tasks:

  • 5.6x faster than vanilla autoregressive decoding.
  • 7.7% additional speedup over prior state-of-the-art methods like EAGLE-3.

Inference

To use these weights, you should use the inference code provided in the official repository. The implementation supports multi-GPU weight allocation.

You can use the suggested web interface by running:

python -m application.webui --ea-model-path [path of GTO weight] \ 
        --base-model-path [path of the original model] \
        --model-type [vicuna\llama3\qwen] \
        --total-token [int]

The total-token parameter represents the number of draft tokens. Adjusting this value according to the specific device and model can achieve better results.

Citation

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

This implementation is based on the open-source repository of EAGLE. This project has also been influenced by HASS, GRIFFIN, and other projects in the LLM community.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-llama33-instruct-70B