| --- |
| license: mit |
| datasets: |
| - maxin-cn/SkyTimelapse |
| - ltzheng/minecraft |
| language: |
| - en |
| base_model: |
| - facebook/DiT-XL-2-512 |
| - facebook/DiT-XL-2-256 |
| --- |
| <p align="center"> |
| <h2 align="center"> GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation </h2> |
| <p align="center"> |
| <a href="https://snehalstomar.github.io/">Snehal Singh Tomar</a> |
| . |
| <a href="https://alexgraikos.github.io/">Alexandros Graikos</a> |
| . |
| <a href="https://www.linkedin.com/in/arjun-krishna-a3573710/">A. Krishna</a> |
| . |
| <a href="https://www3.cs.stonybrook.edu/~samaras/">Dimitris Samaras</a> |
| . |
| <a href="https://www3.cs.stonybrook.edu/~mueller/">Klaus Mueller</a> |
| </p> |
| <p align="center"> <strong>Transactions on Machine Learning Research (TMLR) 2026</strong></p> |
| <p align="center"> |
| Stony Brook University |
| </p> |
| <h3 align="center"> |
| |
| [](https://arxiv.org/abs/2512.21276) |
| []() |
| [](https://github.com/snehalstomar/GriDiT) |
| |
| <div align="center"></div> |
| </p> |
|
|
| <p align="center"> |
| <a href=""> |
| <img src="teaser.png" width="100%"> |
| </a> |
| </p> |
| |
| <h5 align="left"> |
| <em>TL;DR:</em> State-of-the-Art image sequence generation models treat image sequences as large tensors of ordered frames. |
| In contrast, our method factorizes image sequence generation into two stages. First, we learn to model |
| the dynamics of the sequence at low resolution, treating the frames as subsampled image grids. Second, we |
| learn to super-resolve individual frames at high resolution. Using the DiT’s self-attention mechanism to model |
| dynamics across frames, and paired with our sampling strategy, our method yields superior synthesis quality |
| for sequences of arbitrary length while significantly reducing sampling time and training data requirements. |
| </h5> |
|
|
| ## Code and Execution Details |
|
|
| Please visit our [Github repository](https://github.com/snehalstomar/GriDiT). |
|
|
| ## Citation |
|
|
| Please cite our work as: |
|
|
| ``` |
| @article{ |
| tomar2026gridit, |
| title={GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation}, |
| author={Snehal Singh Tomar and Alexandros Graikos and Arjun Krishna and Dimitris Samaras and Klaus Mueller}, |
| journal={Transactions on Machine Learning Research}, |
| issn={2835-8856}, |
| year={2026}, |
| url={https://openreview.net/forum?id=QLD47Ou5lp}, |
| note={} |
| } |
| ``` |