jc-builds commited on
Commit
cc14274
Β·
verified Β·
1 Parent(s): 58c44e3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +82 -135
README.md CHANGED
@@ -12,15 +12,31 @@ tags:
12
  - on-device
13
  ---
14
 
 
 
15
  # TripoSR iOS (ONNX)
16
 
17
- > **Single image → 3D mesh, on your iPhone.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) β€” a fast feedforward 3D reconstruction model from **Stability AI** and **Tripo AI**, optimized for on-device inference.
20
 
21
  ---
22
 
23
- ## Demo β€” Photo to 3D
24
 
25
  <table>
26
  <tr>
@@ -28,73 +44,22 @@ The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/Tr
28
  <td align="center"><b>3D Output</b></td>
29
  </tr>
30
  <tr>
31
- <td><img src="assets/input_photo.jpeg" width="300" /></td>
32
- <td><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
33
  </tr>
34
  </table>
35
 
36
- > Single photo of a dog on the beach &rarr; fully reconstructed 3D mesh, running on-device via ONNX Runtime.
37
-
38
- ---
39
-
40
- ## Model Overview
41
-
42
- | Property | Value |
43
- |:--|:--|
44
- | **Model Size** | ~1.6 GB |
45
- | **Parameters** | 419M |
46
- | **Input** | RGB Image `(1, 3, 512, 512)` |
47
- | **Output** | Triplane Scene Codes `(1, 3, 40, 64, 64)` |
48
- | **ONNX Opset** | 18 |
49
- | **Format** | ONNX + external weights |
50
- | **License** | MIT |
51
-
52
- ---
53
-
54
- ## Architecture
55
-
56
- TripoSR uses a feedforward transformer pipeline β€” no diffusion, no iterative denoising. One forward pass from image to 3D.
57
-
58
- ```mermaid
59
- graph LR
60
- A["πŸ“· Input Image<br/>(512Γ—512 RGB)"] --> B["πŸ” DINO ViT-B/16<br/>Image Tokenizer"]
61
- B --> C["🧠 Transformer<br/>Decoder + Cross-Attention"]
62
- C --> D["πŸ“ Post Processor<br/>Triplane Features"]
63
- D --> E["🧊 Marching Cubes<br/>3D Mesh Output"]
64
-
65
- style A fill:#4a9eff,stroke:#333,color:#fff
66
- style B fill:#7c3aed,stroke:#333,color:#fff
67
- style C fill:#7c3aed,stroke:#333,color:#fff
68
- style D fill:#7c3aed,stroke:#333,color:#fff
69
- style E fill:#10b981,stroke:#333,color:#fff
70
- ```
71
-
72
- ### Component Breakdown
73
-
74
- ```mermaid
75
- pie title Parameter Distribution (419M total)
76
- "DINO ViT-B/16 (Image Encoder)" : 86
77
- "Transformer Decoder" : 268
78
- "Triplane Post-Processor" : 65
79
- ```
80
-
81
- | Component | Role | Details |
82
- |:--|:--|:--|
83
- | **Image Tokenizer** | Feature extraction | DINO ViT-B/16 pretrained vision transformer |
84
- | **Backbone** | Scene understanding | Transformer decoder with cross-attention to image tokens |
85
- | **Post Processor** | 3D representation | Converts transformer tokens β†’ triplane features `(3Γ—40Γ—64Γ—64)` |
86
-
87
  ---
88
 
89
  ## Benchmarks
90
 
91
- Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluated on the GSO and OmniObject3D datasets.
92
 
93
- ### F-Score @ 0.1 (higher is better)
94
 
95
  ![F-Score Comparison](assets/chart_fscore.png)
96
 
97
- ### Chamfer Distance (lower is better)
98
 
99
  ![Chamfer Distance Comparison](assets/chart_chamfer.png)
100
 
@@ -106,23 +71,22 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
106
 
107
  ![Grouped Comparison](assets/chart_grouped.png)
108
 
109
- ### Full Results
110
-
111
  <details>
112
- <summary><b>GSO Dataset</b></summary>
 
 
 
 
113
 
114
  | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
115
  |:--|:--|:--|:--|:--|
116
  | One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
117
- | ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
118
  | OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
 
119
  | TGS | 0.122 | 0.637 | 0.846 | 0.968 |
120
  | **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
121
 
122
- </details>
123
-
124
- <details>
125
- <summary><b>OmniObject3D Dataset</b></summary>
126
 
127
  | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
128
  |:--|:--|:--|:--|:--|
@@ -134,101 +98,73 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
134
 
135
  </details>
136
 
137
- > *Metrics sourced from [TripoSR: Fast 3D Object Reconstruction from a Single Image](https://arxiv.org/abs/2403.02151) (Tochilkin et al., 2024)*
138
-
139
  ---
140
 
141
- ## Why ONNX?
 
 
142
 
143
  ```mermaid
144
- graph TD
145
- subgraph "Original TripoSR"
146
- A1["PyTorch"] --> A2["GPU Server Required"]
147
- A2 --> A3["~3GB+ VRAM"]
148
- end
149
-
150
- subgraph "This Conversion"
151
- B1["ONNX"] --> B2["Runs on iPhone"]
152
- B2 --> B3["CoreML / CPU"]
153
- end
154
-
155
- style A1 fill:#ef4444,stroke:#333,color:#fff
156
- style A2 fill:#ef4444,stroke:#333,color:#fff
157
- style A3 fill:#ef4444,stroke:#333,color:#fff
158
- style B1 fill:#10b981,stroke:#333,color:#fff
159
- style B2 fill:#10b981,stroke:#333,color:#fff
160
- style B3 fill:#10b981,stroke:#333,color:#fff
161
  ```
162
 
163
- | | Original (PyTorch) | This Model (ONNX) |
164
  |:--|:--|:--|
165
- | **Runtime** | PyTorch + CUDA | ONNX Runtime |
166
- | **Platform** | Server / Desktop GPU | iPhone, iPad, Mac, any ONNX runtime |
167
- | **Size** | ~3 GB+ | ~1.6 GB |
168
- | **Dependencies** | torch, einops, transformers | onnxruntime only |
169
- | **Deployment** | Cloud API | On-device, offline capable |
170
-
171
- ---
172
-
173
- ## Inference Pipeline
174
-
175
- ```mermaid
176
- sequenceDiagram
177
- participant User
178
- participant App
179
- participant Encoder as TripoSR Encoder (ONNX)
180
- participant Decoder as Mesh Decoder
181
-
182
- User->>App: Capture / Select Photo
183
- App->>App: Preprocess (resize 512Γ—512, normalize)
184
- App->>Encoder: Input tensor (1, 3, 512, 512)
185
- Encoder->>Encoder: DINO tokenize β†’ Transformer decode β†’ Triplane
186
- Encoder-->>App: Scene codes (1, 3, 40, 64, 64)
187
- App->>Decoder: Triplane features
188
- Decoder->>Decoder: Marching cubes extraction
189
- Decoder-->>App: 3D Mesh (vertices + faces)
190
- App-->>User: Display 3D model
191
- ```
192
 
193
  ---
194
 
195
- ## Files
196
 
197
- | File | Size | Description |
198
  |:--|:--|:--|
199
- | `triposr_encoder.onnx` | 2.6 MB | ONNX model graph |
200
- | `triposr_encoder.onnx.data` | 1.6 GB | External model weights |
 
 
 
201
 
202
  ---
203
 
204
- ## Usage
205
 
206
- ### Python (ONNX Runtime)
 
207
 
208
  ```python
209
  import onnxruntime as ort
210
  import numpy as np
211
  from PIL import Image
212
 
213
- # Load the model
214
  session = ort.InferenceSession(
215
  "triposr_encoder.onnx",
216
- providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider' for iOS
217
  )
218
 
219
- # Preprocess image
220
- image = Image.open("your_image.png").convert("RGB").resize((512, 512))
221
  input_array = np.array(image).astype(np.float32) / 255.0
222
  input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
223
 
224
- # Run inference
225
  scene_codes = session.run(None, {"input_image": input_array})[0]
226
- print(f"Scene codes shape: {scene_codes.shape}") # (1, 3, 40, 64, 64)
227
  ```
228
 
229
- ### iOS (Swift + ONNX Runtime)
230
 
231
- Add ONNX Runtime via SPM, then:
 
232
 
233
  ```swift
234
  import OnnxRuntimeBindings
@@ -247,6 +183,17 @@ let outputs = try session.run(
247
  )
248
  ```
249
 
 
 
 
 
 
 
 
 
 
 
 
250
  ---
251
 
252
  ## Citation
@@ -254,14 +201,14 @@ let outputs = try session.run(
254
  ```bibtex
255
  @article{TripoSR2024,
256
  title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
257
- author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian and Jampani, Varun and Cao, Yan-Pei},
 
 
258
  journal={arXiv preprint arXiv:2403.02151},
259
  year={2024}
260
  }
261
  ```
262
 
263
- ---
264
-
265
- ## License
266
-
267
- MIT License (same as original TripoSR)
 
12
  - on-device
13
  ---
14
 
15
+ <div align="center">
16
+
17
  # TripoSR iOS (ONNX)
18
 
19
+ **Single image &rarr; 3D mesh, on your iPhone.**
20
+
21
+ The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) by **Stability AI** &times; **Tripo AI** β€” optimized for on-device inference.
22
+
23
+ <br/>
24
+
25
+ <table>
26
+ <tr>
27
+ <td align="center"><b>419M</b><br/><sub>Parameters</sub></td>
28
+ <td align="center"><b>1.6 GB</b><br/><sub>Model Size</sub></td>
29
+ <td align="center"><b>&lt; 0.5s</b><br/><sub>Inference (A100)</sub></td>
30
+ <td align="center"><b>ONNX</b><br/><sub>Format</sub></td>
31
+ <td align="center"><b>MIT</b><br/><sub>License</sub></td>
32
+ </tr>
33
+ </table>
34
 
35
+ </div>
36
 
37
  ---
38
 
39
+ ## Demo
40
 
41
  <table>
42
  <tr>
 
44
  <td align="center"><b>3D Output</b></td>
45
  </tr>
46
  <tr>
47
+ <td align="center"><img src="assets/input_photo.jpeg" width="300" /></td>
48
+ <td align="center"><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
49
  </tr>
50
  </table>
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ---
53
 
54
  ## Benchmarks
55
 
56
+ Evaluated on [GSO](https://goo.gl/datasets/GoogleScannedObjects) and [OmniObject3D](https://omniobject3d.github.io/) datasets. Results from the [TripoSR paper](https://arxiv.org/abs/2403.02151).
57
 
58
+ ### F-Score @ 0.1 &nbsp;(higher is better)
59
 
60
  ![F-Score Comparison](assets/chart_fscore.png)
61
 
62
+ ### Chamfer Distance &nbsp;(lower is better)
63
 
64
  ![Chamfer Distance Comparison](assets/chart_chamfer.png)
65
 
 
71
 
72
  ![Grouped Comparison](assets/chart_grouped.png)
73
 
 
 
74
  <details>
75
+ <summary><b>Full Results Table</b></summary>
76
+
77
+ <br/>
78
+
79
+ **GSO Dataset**
80
 
81
  | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
82
  |:--|:--|:--|:--|:--|
83
  | One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
 
84
  | OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
85
+ | ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
86
  | TGS | 0.122 | 0.637 | 0.846 | 0.968 |
87
  | **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
88
 
89
+ **OmniObject3D Dataset**
 
 
 
90
 
91
  | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
92
  |:--|:--|:--|:--|:--|
 
98
 
99
  </details>
100
 
 
 
101
  ---
102
 
103
+ ## Architecture
104
+
105
+ One forward pass β€” no diffusion, no iterative denoising.
106
 
107
  ```mermaid
108
+ graph LR
109
+ A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
110
+ B --> C["Transformer Decoder<br/>+ Cross-Attention"]
111
+ C --> D["Post Processor<br/>Triplane Features"]
112
+ D --> E["Marching Cubes<br/>3D Mesh"]
113
+
114
+ style A fill:#4a9eff,stroke:#30363d,color:#fff
115
+ style B fill:#7c3aed,stroke:#30363d,color:#fff
116
+ style C fill:#7c3aed,stroke:#30363d,color:#fff
117
+ style D fill:#7c3aed,stroke:#30363d,color:#fff
118
+ style E fill:#3fb950,stroke:#30363d,color:#fff
 
 
 
 
 
 
119
  ```
120
 
121
+ | Component | Parameters | Role |
122
  |:--|:--|:--|
123
+ | **DINO ViT-B/16** | ~86M | Pretrained image encoder |
124
+ | **Transformer Decoder** | ~268M | Cross-attention to image tokens |
125
+ | **Triplane Post-Processor** | ~65M | Tokens &rarr; triplane features `(3x40x64x64)` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
  ---
128
 
129
+ ## PyTorch vs. This Model
130
 
131
+ | | Original | This Conversion |
132
  |:--|:--|:--|
133
+ | **Format** | PyTorch | ONNX |
134
+ | **Size** | ~3 GB+ | 1.6 GB |
135
+ | **Runs on** | GPU server | iPhone / iPad / Mac |
136
+ | **Dependencies** | torch, einops, transformers | onnxruntime |
137
+ | **Connectivity** | Cloud API | Fully offline |
138
 
139
  ---
140
 
141
+ ## Quick Start
142
 
143
+ <details open>
144
+ <summary><b>Python</b></summary>
145
 
146
  ```python
147
  import onnxruntime as ort
148
  import numpy as np
149
  from PIL import Image
150
 
 
151
  session = ort.InferenceSession(
152
  "triposr_encoder.onnx",
153
+ providers=['CPUExecutionProvider'] # or 'CoreMLExecutionProvider'
154
  )
155
 
156
+ image = Image.open("photo.png").convert("RGB").resize((512, 512))
 
157
  input_array = np.array(image).astype(np.float32) / 255.0
158
  input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
159
 
 
160
  scene_codes = session.run(None, {"input_image": input_array})[0]
161
+ # scene_codes.shape == (1, 3, 40, 64, 64)
162
  ```
163
 
164
+ </details>
165
 
166
+ <details>
167
+ <summary><b>Swift (iOS)</b></summary>
168
 
169
  ```swift
170
  import OnnxRuntimeBindings
 
183
  )
184
  ```
185
 
186
+ </details>
187
+
188
+ ---
189
+
190
+ ## Files
191
+
192
+ | File | Size | Description |
193
+ |:--|:--|:--|
194
+ | `triposr_encoder.onnx` | 2.6 MB | Model graph |
195
+ | `triposr_encoder.onnx.data` | 1.6 GB | Weights |
196
+
197
  ---
198
 
199
  ## Citation
 
201
  ```bibtex
202
  @article{TripoSR2024,
203
  title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
204
+ author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
205
+ and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
206
+ and Jampani, Varun and Cao, Yan-Pei},
207
  journal={arXiv preprint arXiv:2403.02151},
208
  year={2024}
209
  }
210
  ```
211
 
212
+ <div align="center">
213
+ <sub>MIT License &bull; Based on <a href="https://github.com/VAST-AI-Research/TripoSR">TripoSR</a> by Stability AI &times; Tripo AI</sub>
214
+ </div>