Spatia: Video Generation with Updatable Spatial Memory
Long-horizon, spatially consistent video generation enabled by persistent 3D scene point clouds and dynamic-static disentanglement.
1The University of Sydney
2Microsoft Research
3HKUST
4University of Waterloo
*Equal Contribution
π Abstract
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory.
Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities.
Furthermore, Spatia enables applications such as:
- Explicit Camera Control
- 3D-Aware Interactive Editing
- Long-horizon Scene Exploration
Citation
If you find this project useful, please cite the paper.
@inproceedings{zhao2026spatia,
title={Spatia: Video Generation with Updatable Spatial Memory},
author={Zhao, Jinjing and Wei, Fangyun and Liu, Zhening and Zhang, Hongyang and Xu, Chang and Lu, Yan},
booktitle={Proceedings of the IEEE/cvf conference on computer vision and pattern recognition},
year={2026}
}
Β© 2025 Spatia Project. Licensed under CC BY-SA 4.0.
Model tree for Jinjing713/Spatia
Base model
Wan-AI/Wan2.2-TI2V-5B