Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -12,15 +12,31 @@ tags:
|
|
| 12 |
- on-device
|
| 13 |
---
|
| 14 |
|
|
|
|
|
|
|
| 15 |
# TripoSR iOS (ONNX)
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
## Demo
|
| 24 |
|
| 25 |
<table>
|
| 26 |
<tr>
|
|
@@ -28,73 +44,22 @@ The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/Tr
|
|
| 28 |
<td align="center"><b>3D Output</b></td>
|
| 29 |
</tr>
|
| 30 |
<tr>
|
| 31 |
-
<td><img src="assets/input_photo.jpeg" width="300" /></td>
|
| 32 |
-
<td><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
|
| 33 |
</tr>
|
| 34 |
</table>
|
| 35 |
|
| 36 |
-
> Single photo of a dog on the beach → fully reconstructed 3D mesh, running on-device via ONNX Runtime.
|
| 37 |
-
|
| 38 |
-
---
|
| 39 |
-
|
| 40 |
-
## Model Overview
|
| 41 |
-
|
| 42 |
-
| Property | Value |
|
| 43 |
-
|:--|:--|
|
| 44 |
-
| **Model Size** | ~1.6 GB |
|
| 45 |
-
| **Parameters** | 419M |
|
| 46 |
-
| **Input** | RGB Image `(1, 3, 512, 512)` |
|
| 47 |
-
| **Output** | Triplane Scene Codes `(1, 3, 40, 64, 64)` |
|
| 48 |
-
| **ONNX Opset** | 18 |
|
| 49 |
-
| **Format** | ONNX + external weights |
|
| 50 |
-
| **License** | MIT |
|
| 51 |
-
|
| 52 |
-
---
|
| 53 |
-
|
| 54 |
-
## Architecture
|
| 55 |
-
|
| 56 |
-
TripoSR uses a feedforward transformer pipeline β no diffusion, no iterative denoising. One forward pass from image to 3D.
|
| 57 |
-
|
| 58 |
-
```mermaid
|
| 59 |
-
graph LR
|
| 60 |
-
A["π· Input Image<br/>(512Γ512 RGB)"] --> B["π DINO ViT-B/16<br/>Image Tokenizer"]
|
| 61 |
-
B --> C["π§ Transformer<br/>Decoder + Cross-Attention"]
|
| 62 |
-
C --> D["π Post Processor<br/>Triplane Features"]
|
| 63 |
-
D --> E["π§ Marching Cubes<br/>3D Mesh Output"]
|
| 64 |
-
|
| 65 |
-
style A fill:#4a9eff,stroke:#333,color:#fff
|
| 66 |
-
style B fill:#7c3aed,stroke:#333,color:#fff
|
| 67 |
-
style C fill:#7c3aed,stroke:#333,color:#fff
|
| 68 |
-
style D fill:#7c3aed,stroke:#333,color:#fff
|
| 69 |
-
style E fill:#10b981,stroke:#333,color:#fff
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
### Component Breakdown
|
| 73 |
-
|
| 74 |
-
```mermaid
|
| 75 |
-
pie title Parameter Distribution (419M total)
|
| 76 |
-
"DINO ViT-B/16 (Image Encoder)" : 86
|
| 77 |
-
"Transformer Decoder" : 268
|
| 78 |
-
"Triplane Post-Processor" : 65
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
| Component | Role | Details |
|
| 82 |
-
|:--|:--|:--|
|
| 83 |
-
| **Image Tokenizer** | Feature extraction | DINO ViT-B/16 pretrained vision transformer |
|
| 84 |
-
| **Backbone** | Scene understanding | Transformer decoder with cross-attention to image tokens |
|
| 85 |
-
| **Post Processor** | 3D representation | Converts transformer tokens β triplane features `(3Γ40Γ64Γ64)` |
|
| 86 |
-
|
| 87 |
---
|
| 88 |
|
| 89 |
## Benchmarks
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
### F-Score @ 0.1 (higher is better)
|
| 94 |
|
| 95 |

|
| 96 |
|
| 97 |
-
### Chamfer Distance (lower is better)
|
| 98 |
|
| 99 |

|
| 100 |
|
|
@@ -106,23 +71,22 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
|
|
| 106 |
|
| 107 |

|
| 108 |
|
| 109 |
-
### Full Results
|
| 110 |
-
|
| 111 |
<details>
|
| 112 |
-
<summary><b>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
| Method | CD β | FS@0.1 β | FS@0.2 β | FS@0.5 β |
|
| 115 |
|:--|:--|:--|:--|:--|
|
| 116 |
| One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
|
| 117 |
-
| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
|
| 118 |
| OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
|
|
|
|
| 119 |
| TGS | 0.122 | 0.637 | 0.846 | 0.968 |
|
| 120 |
| **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
|
| 121 |
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
<details>
|
| 125 |
-
<summary><b>OmniObject3D Dataset</b></summary>
|
| 126 |
|
| 127 |
| Method | CD β | FS@0.1 β | FS@0.2 β | FS@0.5 β |
|
| 128 |
|:--|:--|:--|:--|:--|
|
|
@@ -134,101 +98,73 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
|
|
| 134 |
|
| 135 |
</details>
|
| 136 |
|
| 137 |
-
> *Metrics sourced from [TripoSR: Fast 3D Object Reconstruction from a Single Image](https://arxiv.org/abs/2403.02151) (Tochilkin et al., 2024)*
|
| 138 |
-
|
| 139 |
---
|
| 140 |
|
| 141 |
-
##
|
|
|
|
|
|
|
| 142 |
|
| 143 |
```mermaid
|
| 144 |
-
graph
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
style A1 fill:#ef4444,stroke:#333,color:#fff
|
| 156 |
-
style A2 fill:#ef4444,stroke:#333,color:#fff
|
| 157 |
-
style A3 fill:#ef4444,stroke:#333,color:#fff
|
| 158 |
-
style B1 fill:#10b981,stroke:#333,color:#fff
|
| 159 |
-
style B2 fill:#10b981,stroke:#333,color:#fff
|
| 160 |
-
style B3 fill:#10b981,stroke:#333,color:#fff
|
| 161 |
```
|
| 162 |
|
| 163 |
-
| |
|
| 164 |
|:--|:--|:--|
|
| 165 |
-
| **
|
| 166 |
-
| **
|
| 167 |
-
| **
|
| 168 |
-
| **Dependencies** | torch, einops, transformers | onnxruntime only |
|
| 169 |
-
| **Deployment** | Cloud API | On-device, offline capable |
|
| 170 |
-
|
| 171 |
-
---
|
| 172 |
-
|
| 173 |
-
## Inference Pipeline
|
| 174 |
-
|
| 175 |
-
```mermaid
|
| 176 |
-
sequenceDiagram
|
| 177 |
-
participant User
|
| 178 |
-
participant App
|
| 179 |
-
participant Encoder as TripoSR Encoder (ONNX)
|
| 180 |
-
participant Decoder as Mesh Decoder
|
| 181 |
-
|
| 182 |
-
User->>App: Capture / Select Photo
|
| 183 |
-
App->>App: Preprocess (resize 512Γ512, normalize)
|
| 184 |
-
App->>Encoder: Input tensor (1, 3, 512, 512)
|
| 185 |
-
Encoder->>Encoder: DINO tokenize β Transformer decode β Triplane
|
| 186 |
-
Encoder-->>App: Scene codes (1, 3, 40, 64, 64)
|
| 187 |
-
App->>Decoder: Triplane features
|
| 188 |
-
Decoder->>Decoder: Marching cubes extraction
|
| 189 |
-
Decoder-->>App: 3D Mesh (vertices + faces)
|
| 190 |
-
App-->>User: Display 3D model
|
| 191 |
-
```
|
| 192 |
|
| 193 |
---
|
| 194 |
|
| 195 |
-
##
|
| 196 |
|
| 197 |
-
|
|
| 198 |
|:--|:--|:--|
|
| 199 |
-
|
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
---
|
| 203 |
|
| 204 |
-
##
|
| 205 |
|
| 206 |
-
|
|
|
|
| 207 |
|
| 208 |
```python
|
| 209 |
import onnxruntime as ort
|
| 210 |
import numpy as np
|
| 211 |
from PIL import Image
|
| 212 |
|
| 213 |
-
# Load the model
|
| 214 |
session = ort.InferenceSession(
|
| 215 |
"triposr_encoder.onnx",
|
| 216 |
-
providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider'
|
| 217 |
)
|
| 218 |
|
| 219 |
-
|
| 220 |
-
image = Image.open("your_image.png").convert("RGB").resize((512, 512))
|
| 221 |
input_array = np.array(image).astype(np.float32) / 255.0
|
| 222 |
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
|
| 223 |
|
| 224 |
-
# Run inference
|
| 225 |
scene_codes = session.run(None, {"input_image": input_array})[0]
|
| 226 |
-
|
| 227 |
```
|
| 228 |
|
| 229 |
-
|
| 230 |
|
| 231 |
-
|
|
|
|
| 232 |
|
| 233 |
```swift
|
| 234 |
import OnnxRuntimeBindings
|
|
@@ -247,6 +183,17 @@ let outputs = try session.run(
|
|
| 247 |
)
|
| 248 |
```
|
| 249 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
---
|
| 251 |
|
| 252 |
## Citation
|
|
@@ -254,14 +201,14 @@ let outputs = try session.run(
|
|
| 254 |
```bibtex
|
| 255 |
@article{TripoSR2024,
|
| 256 |
title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
|
| 257 |
-
author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
|
|
|
|
|
|
|
| 258 |
journal={arXiv preprint arXiv:2403.02151},
|
| 259 |
year={2024}
|
| 260 |
}
|
| 261 |
```
|
| 262 |
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
MIT License (same as original TripoSR)
|
|
|
|
| 12 |
- on-device
|
| 13 |
---
|
| 14 |
|
| 15 |
+
<div align="center">
|
| 16 |
+
|
| 17 |
# TripoSR iOS (ONNX)
|
| 18 |
|
| 19 |
+
**Single image → 3D mesh, on your iPhone.**
|
| 20 |
+
|
| 21 |
+
The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) by **Stability AI** × **Tripo AI** β optimized for on-device inference.
|
| 22 |
+
|
| 23 |
+
<br/>
|
| 24 |
+
|
| 25 |
+
<table>
|
| 26 |
+
<tr>
|
| 27 |
+
<td align="center"><b>419M</b><br/><sub>Parameters</sub></td>
|
| 28 |
+
<td align="center"><b>1.6 GB</b><br/><sub>Model Size</sub></td>
|
| 29 |
+
<td align="center"><b>< 0.5s</b><br/><sub>Inference (A100)</sub></td>
|
| 30 |
+
<td align="center"><b>ONNX</b><br/><sub>Format</sub></td>
|
| 31 |
+
<td align="center"><b>MIT</b><br/><sub>License</sub></td>
|
| 32 |
+
</tr>
|
| 33 |
+
</table>
|
| 34 |
|
| 35 |
+
</div>
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
+
## Demo
|
| 40 |
|
| 41 |
<table>
|
| 42 |
<tr>
|
|
|
|
| 44 |
<td align="center"><b>3D Output</b></td>
|
| 45 |
</tr>
|
| 46 |
<tr>
|
| 47 |
+
<td align="center"><img src="assets/input_photo.jpeg" width="300" /></td>
|
| 48 |
+
<td align="center"><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
|
| 49 |
</tr>
|
| 50 |
</table>
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
---
|
| 53 |
|
| 54 |
## Benchmarks
|
| 55 |
|
| 56 |
+
Evaluated on [GSO](https://goo.gl/datasets/GoogleScannedObjects) and [OmniObject3D](https://omniobject3d.github.io/) datasets. Results from the [TripoSR paper](https://arxiv.org/abs/2403.02151).
|
| 57 |
|
| 58 |
+
### F-Score @ 0.1 (higher is better)
|
| 59 |
|
| 60 |

|
| 61 |
|
| 62 |
+
### Chamfer Distance (lower is better)
|
| 63 |
|
| 64 |

|
| 65 |
|
|
|
|
| 71 |
|
| 72 |

|
| 73 |
|
|
|
|
|
|
|
| 74 |
<details>
|
| 75 |
+
<summary><b>Full Results Table</b></summary>
|
| 76 |
+
|
| 77 |
+
<br/>
|
| 78 |
+
|
| 79 |
+
**GSO Dataset**
|
| 80 |
|
| 81 |
| Method | CD β | FS@0.1 β | FS@0.2 β | FS@0.5 β |
|
| 82 |
|:--|:--|:--|:--|:--|
|
| 83 |
| One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
|
|
|
|
| 84 |
| OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
|
| 85 |
+
| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
|
| 86 |
| TGS | 0.122 | 0.637 | 0.846 | 0.968 |
|
| 87 |
| **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
|
| 88 |
|
| 89 |
+
**OmniObject3D Dataset**
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
| Method | CD β | FS@0.1 β | FS@0.2 β | FS@0.5 β |
|
| 92 |
|:--|:--|:--|:--|:--|
|
|
|
|
| 98 |
|
| 99 |
</details>
|
| 100 |
|
|
|
|
|
|
|
| 101 |
---
|
| 102 |
|
| 103 |
+
## Architecture
|
| 104 |
+
|
| 105 |
+
One forward pass β no diffusion, no iterative denoising.
|
| 106 |
|
| 107 |
```mermaid
|
| 108 |
+
graph LR
|
| 109 |
+
A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
|
| 110 |
+
B --> C["Transformer Decoder<br/>+ Cross-Attention"]
|
| 111 |
+
C --> D["Post Processor<br/>Triplane Features"]
|
| 112 |
+
D --> E["Marching Cubes<br/>3D Mesh"]
|
| 113 |
+
|
| 114 |
+
style A fill:#4a9eff,stroke:#30363d,color:#fff
|
| 115 |
+
style B fill:#7c3aed,stroke:#30363d,color:#fff
|
| 116 |
+
style C fill:#7c3aed,stroke:#30363d,color:#fff
|
| 117 |
+
style D fill:#7c3aed,stroke:#30363d,color:#fff
|
| 118 |
+
style E fill:#3fb950,stroke:#30363d,color:#fff
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
```
|
| 120 |
|
| 121 |
+
| Component | Parameters | Role |
|
| 122 |
|:--|:--|:--|
|
| 123 |
+
| **DINO ViT-B/16** | ~86M | Pretrained image encoder |
|
| 124 |
+
| **Transformer Decoder** | ~268M | Cross-attention to image tokens |
|
| 125 |
+
| **Triplane Post-Processor** | ~65M | Tokens → triplane features `(3x40x64x64)` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
---
|
| 128 |
|
| 129 |
+
## PyTorch vs. This Model
|
| 130 |
|
| 131 |
+
| | Original | This Conversion |
|
| 132 |
|:--|:--|:--|
|
| 133 |
+
| **Format** | PyTorch | ONNX |
|
| 134 |
+
| **Size** | ~3 GB+ | 1.6 GB |
|
| 135 |
+
| **Runs on** | GPU server | iPhone / iPad / Mac |
|
| 136 |
+
| **Dependencies** | torch, einops, transformers | onnxruntime |
|
| 137 |
+
| **Connectivity** | Cloud API | Fully offline |
|
| 138 |
|
| 139 |
---
|
| 140 |
|
| 141 |
+
## Quick Start
|
| 142 |
|
| 143 |
+
<details open>
|
| 144 |
+
<summary><b>Python</b></summary>
|
| 145 |
|
| 146 |
```python
|
| 147 |
import onnxruntime as ort
|
| 148 |
import numpy as np
|
| 149 |
from PIL import Image
|
| 150 |
|
|
|
|
| 151 |
session = ort.InferenceSession(
|
| 152 |
"triposr_encoder.onnx",
|
| 153 |
+
providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider'
|
| 154 |
)
|
| 155 |
|
| 156 |
+
image = Image.open("photo.png").convert("RGB").resize((512, 512))
|
|
|
|
| 157 |
input_array = np.array(image).astype(np.float32) / 255.0
|
| 158 |
input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
|
| 159 |
|
|
|
|
| 160 |
scene_codes = session.run(None, {"input_image": input_array})[0]
|
| 161 |
+
# scene_codes.shape == (1, 3, 40, 64, 64)
|
| 162 |
```
|
| 163 |
|
| 164 |
+
</details>
|
| 165 |
|
| 166 |
+
<details>
|
| 167 |
+
<summary><b>Swift (iOS)</b></summary>
|
| 168 |
|
| 169 |
```swift
|
| 170 |
import OnnxRuntimeBindings
|
|
|
|
| 183 |
)
|
| 184 |
```
|
| 185 |
|
| 186 |
+
</details>
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
## Files
|
| 191 |
+
|
| 192 |
+
| File | Size | Description |
|
| 193 |
+
|:--|:--|:--|
|
| 194 |
+
| `triposr_encoder.onnx` | 2.6 MB | Model graph |
|
| 195 |
+
| `triposr_encoder.onnx.data` | 1.6 GB | Weights |
|
| 196 |
+
|
| 197 |
---
|
| 198 |
|
| 199 |
## Citation
|
|
|
|
| 201 |
```bibtex
|
| 202 |
@article{TripoSR2024,
|
| 203 |
title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
|
| 204 |
+
author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
|
| 205 |
+
and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
|
| 206 |
+
and Jampani, Varun and Cao, Yan-Pei},
|
| 207 |
journal={arXiv preprint arXiv:2403.02151},
|
| 208 |
year={2024}
|
| 209 |
}
|
| 210 |
```
|
| 211 |
|
| 212 |
+
<div align="center">
|
| 213 |
+
<sub>MIT License • Based on <a href="https://github.com/VAST-AI-Research/TripoSR">TripoSR</a> by Stability AI × Tripo AI</sub>
|
| 214 |
+
</div>
|
|
|
|
|
|