File size: 5,415 Bytes
a03872e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: mit
pipeline_tag: image-to-image
library_name: diffusers
---

# ZeroStereo: Zero-shot Stereo Matching from Single Images

This repository hosts the **StereoGen** model, a key component of the ZeroStereo framework. ZeroStereo introduces a novel pipeline for zero-shot stereo matching, capable of synthesizing high-quality right images from arbitrary single images. It achieves this by leveraging pseudo disparities generated by a monocular depth estimation model and fine-tuning a diffusion inpainting model to recover missing details while preserving semantic structure.

## Paper

The model was presented in the paper [ZeroStereo: Zero-shot Stereo Matching from Single Images](https://huggingface.co/papers/2501.08654).

### Abstract

State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.

## Code

The official code, detailed instructions for fine-tuning, generation, training, and evaluation, can be found on the [GitHub repository](https://github.com/Windsrain/ZeroStereo).

![ZeroStereo](ZeroStereo.png)

## Pre-Trained Models

The following pre-trained models are available related to this project:

| Model | Link |
| :-: | :-: |
| SDv2I | [Download 🤗](https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/tree/main) |
| StereoGen | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/StereoGen) |
| Zero-RAFT-Stereo | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/Zero-RAFT-Stereo)|
| Zero-IGEV-Stereo | [Download 🤗](https://huggingface.co/Windsrain/ZeroStereo/tree/main/Zero-IGEV-Stereo)|

## Usage

You can load the StereoGen model using the `diffusers` library. Please note that for full inference functionality involving detailed pre-processing (e.g., input image, depth maps, masks, etc.), you should refer to the official GitHub repository as the process might involve multiple steps.

First, ensure you have the `diffusers` library and its dependencies installed:
```bash
pip install diffusers transformers accelerate torch
```

Here's a basic example to load the pipeline:

```python
import torch
from diffusers import DiffusionPipeline
from PIL import Image

# Load the StereoGen pipeline.
# This model synthesizes a right stereo image from a single left input image.
pipeline = DiffusionPipeline.from_pretrained("Windsrain/ZeroStereo", torch_dtype=torch.float16)

# Move pipeline to GPU if available
if torch.cuda.is_available():
    pipeline.to("cuda")

# Example placeholder for input image.
# Replace with your actual left input image.
# For full usage (e.g., generating required depth maps or masks),
# please refer to the project's GitHub repository.
# input_image = Image.open("path/to/your/left_image.png").convert("RGB")
input_image = Image.new('RGB', (512, 512), color = 'blue') # Dummy image for demonstration

print("Model loaded successfully. For detailed inference and generation scripts,")
print("refer to the official GitHub repository: https://github.com/Windsrain/ZeroStereo")

# The actual inference call to generate a stereo image might require specific inputs
# (e.g., `image`, `depth_map`, `mask_image`) depending on the pipeline's internal
# implementation as shown in the project's GitHub demo/generation scripts.
# Example inference might look like:
# generated_right_image = pipeline(image=input_image, depth_map=some_depth, mask_image=some_mask).images[0]
# generated_right_image.save("generated_stereo_right.png")
```

## Acknowledgement

This project is based on [MfS-Stereo](https://github.com/nianticlabs/stereo-from-mono), [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2), [Marigold](https://github.com/prs-eth/Marigold), [RAFT-Stereo](https://github.com/princeton-vl/RAFT-Stereo), and [IGEV-Stereo](https://github.com/gangweix/IGEV). We thank the original authors for their excellent works.

## Citation

If you find this work helpful, please cite the paper:

```bibtex
@article{wang2025zerostereo,
  title={ZeroStereo: Zero-shot Stereo Matching from Single Images},
  author={Wang, Xianqi and Yang, Hao and Xu, Gangwei and Cheng, Junda and Lin, Min and Deng, Yong and Zang, Jinliang and Chen, Yurui and Yang, Xin},
  journal={arXiv preprint arXiv:2501.08654},
  year={2025},
}
```