Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +1 -0
README.md +31 -14
assets/dflash_system.png +3 -0
assets/speedup.png +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/dflash_system.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,31 +1,26 @@
 ---
-language:
-- en
-- zh
 license: apache-2.0
-base_model: Qwen/Qwen3-8B
 tags:
 - speculative-decoding
 - diffusion
 - efficiency
-- dflash
-- faster-inference
 ---
 # Qwen3-8B-DFlash-b16
-**DFlash** is a lightweight **block diffusion** model designed for speculative decoding. It enables efficient and high-quality parallel drafting by conditioning on the context features extracted from the target model (Qwen3-8B).
 This model is the **drafter** component. It must be used in conjunction with the target model `Qwen/Qwen3-8B`.
 <div align="center">
-| [**Paper (Coming Soon)**](#) | [**GitHub**](https://github.com/z-lab/dspec-dev) |
 </div>
-**TL;DR:** In this work, we introduce **DFlash**, a method utilizing a lightweight **block diffusion** model for drafting in speculative decoding. This enables efficient and high-quality parallel drafting, pushing the limits of speculative decoding. DFlash achieves up to **6.02×** speedup on **Qwen3-8B**, nearly **2.5×** faster than the SOTA speculative decoding method **EAGLE-3**.
 ## 🚀 Quick Start
 This model requires `trust_remote_code=True` to load the custom architecture for block diffusion generation.
@@ -36,14 +31,14 @@ Ensure you have `transformers` and `torch` installed. Our evaluation is conducte
 pip install transformers==4.57.3 torch==2.9.0
 ```
-### Example Usage
 The following example demonstrates how to load the DFlash drafter and the Qwen3-8B target model to perform speculative decoding.
 ```python
 import torch
 from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
 # 1. Load the DFlash Draft Model
-# Note: trust_remote_code=True is required for the custom diffusion architecture. We recommend run on one GPU currently.
 model = AutoModel.from_pretrained(
     "z-lab/Qwen3-8B-DFlash-b16",
     trust_remote_code=True,
@@ -90,3 +85,25 @@ generate_ids = model.spec_generate(
 print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))
 ```

 ---
 license: apache-2.0
+library_name: transformers
 tags:
 - speculative-decoding
 - diffusion
 - efficiency
+- flash-decoding
+- qwen
+- diffusion-language-model
 ---
 # Qwen3-8B-DFlash-b16
+[**Paper (Coming Soon)**](#) | [**GitHub**](https://github.com/z-lab/dspec-dev) | [**Blog**](https://z-lab.ai/projects/dflash/)
+**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
 This model is the **drafter** component. It must be used in conjunction with the target model `Qwen/Qwen3-8B`.
 <div align="center">
+  <img src="assets/dflash_system.png" alt="DFlash Architecture" width="100%">
 </div>
 ## 🚀 Quick Start
 This model requires `trust_remote_code=True` to load the custom architecture for block diffusion generation.
 pip install transformers==4.57.3 torch==2.9.0
 ```
+### Inference Example
 The following example demonstrates how to load the DFlash drafter and the Qwen3-8B target model to perform speculative decoding.
 ```python
 import torch
 from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
 # 1. Load the DFlash Draft Model
+# Note: trust_remote_code=True is required for DFlash. We recommend run on one GPU currently.
 model = AutoModel.from_pretrained(
     "z-lab/Qwen3-8B-DFlash-b16",
     trust_remote_code=True,
 print(tokenizer.decode(generate_ids[0], skip_special_tokens=True))
 ```
+## Evaluation
+DFlash achieves up to **6.17$\times$** lossless acceleration for **Qwen3-8B**, making it nearly **2.5$\times$** faster than the state-of-the-art speculative decoding method EAGLE-3. Check out our [GitHub repository](https://github.com/z-lab/dflash) to see how to reproduce the results.
+<div align="center">
+  <img src="assets/speedup.png" alt="DFlash Architecture" width="100%">
+</div>
+## **Citation**
+If you find DFlash useful for your research or applications, please cite our project. The full paper is coming soon!
+```bibtex
+@article{chen2026dflash,
+  title   = {DFlash: Block Diffusion for Flash Speculative Decoding},
+  author  = {Chen, Jian and Liu, Zhijian},
+  journal = {arXiv preprint},
+  year    = {2026},
+  url     = {[https://github.com/z-lab/dflash](https://github.com/z-lab/dflash)},
+  note    = {Paper coming soon}
+}
+```

assets/dflash_system.png ADDED Viewed

Git LFS Details

SHA256: bea1f82796909c1e4f7261ee3c08af743ec3c25057b83fca918808b76af4a7dc
Pointer size: 131 Bytes
Size of remote file: 338 kB

assets/speedup.png ADDED Viewed