DFlare Draft Model for Qwen3-4B

This is the official DFlare draft model checkpoint for Qwen/Qwen3-4B, released alongside the paper:

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

DFlare is a block-diffusion speculative decoding framework that accelerates large language model inference by predicting an entire block of tokens in one shot for the target model to verify in parallel. It removes the narrow conditioning bottleneck of the prior state-of-the-art DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and giving every draft layer a distinct input. Combined with training-data scaling, this enhanced per-layer expressiveness allows the draft model to scale to deeper architectures with consistent gains, achieving 5.52ร— end-to-end speedup on Qwen3-4B without compromising output quality.

๐Ÿ“– Documentation & code: https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/dflare.html

๐Ÿ“ฆ Repo: Tencent/AngelSlim


Model Details

Target model Qwen/Qwen3-4B
Draft architecture DFlare (7 layers, hidden_size=2560, attention_heads=32, GQA kv_heads=8)
Parameters ~743 M
Block size 16
Target layers used for fusion [1, 5, 9, 13, 17, 21, 25, 29, 33] (out of 36)
Precision bfloat16
RoPE rope_theta = 1,000,000 (no scaling)
Vocab size 151,936
Tied embeddings yes (tie_word_embeddings = true)

The draft predicts a block of block_size tokens in parallel, conditioned on (i) target hidden states extracted from the listed target layers and (ii) noise embeddings of the previous block. The target model verifies the block in a single forward pass and accepts the longest matching prefix.


How to Use

This checkpoint is loaded with AngelSlim's QwenDFlareDraftModel class.

1. Install AngelSlim

git clone https://github.com/Tencent/AngelSlim.git
cd AngelSlim
pip install -e .

2. Run end-to-end speculative decoding benchmark

The repo ships a self-contained benchmark entry that supports both DFlash and DFlare drafts via --draft-arch:

# Single-GPU
python tools/dflash_benchmark.py \
    --model-name-or-path Qwen/Qwen3-4B \
    --draft-name-or-path dflare/qwen3-4b-dflare \
    --draft-arch dflare \
    --dataset gsm8k \
    --max-samples 128 \
    --max-new-tokens 2048 \
    --temperature 0.0
# 8-GPU (workload sharded across ranks, results gathered to rank 0)
torchrun --nproc_per_node=8 --master_port=29600 \
    tools/dflash_benchmark.py \
    --model-name-or-path Qwen/Qwen3-4B \
    --draft-name-or-path dflare/qwen3-4b-dflare \
    --draft-arch dflare \
    --dataset gsm8k \
    --max-samples 128 \
    --max-new-tokens 2048 \
    --temperature 0.0

The script reports:

  • Decoding speedup vs. single-token autoregressive decoding
  • Average acceptance length per block
  • Per-block acceptance-length histogram

โš ๏ธ Do not pass --block-size โ€” the benchmark reads block_size=16 from this checkpoint's config.json and overriding it will break the train/test alignment.

Supported datasets out of the box: gsm8k, math500, aime24, aime25, alpaca, mt-bench, humaneval, mbpp, lbpp, swe-bench, livecodebench.

3. Load the checkpoint manually

import torch
from angelslim.compressor.speculative.train.models.draft.qwen_dflare import (
    QwenDFlareDraftModel,
)

draft = QwenDFlareDraftModel.from_pretrained(
    "dflare/qwen3-4b-dflare",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()

print(draft.target_layer_ids)   # [1, 5, 9, 13, 17, 21, 25, 29, 33]
print(draft.block_size)          # 16
print(draft.mask_token_id)       # 151669

Performance

On six benchmarks spanning mathematical reasoning, code generation, and conversation, DFlare on Qwen3-4B delivers 5.52ร— average wall-clock speedup over single-token autoregressive decoding โ€” improving over DFlash by roughly 11%, with no degradation in output quality (the target model verifies every block, so the final distribution is identical to greedy decoding).

For full per-task results, ablations, and acceptance-length distributions, see the official documentation.


Citation

@article{DFlare2026,
  title={DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding},
  author={Jiebin Zhang and Zhenghan Yu and Song Liu and Eugene J. Yu and Zheng Li and Dawei Zhu and Jiangshan Duo and Weimin Xiong and Yifan Song and Guanghua Yu and Jianchen Zhu and Sujian Li},
  journal={arXiv preprint arXiv},
  year={2026}
}

License

This checkpoint is released under the Apache 2.0 license, following the AngelSlim project. The target model Qwen/Qwen3-4B retains its own license; consult the target model card before deployment.

Downloads last month
33
Safetensors
Model size
0.7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AngelSlim/Qwen3-4b-dflare

Finetuned
Qwen/Qwen3-4B
Finetuned
(709)
this model