jc-builds
/

triposr-ios

@@ -12,15 +12,31 @@ tags:
   - on-device
 ---
 # TripoSR iOS (ONNX)
-> **Single image &rarr; 3D mesh, on your iPhone.**
-The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) — a fast feedforward 3D reconstruction model from **Stability AI** and **Tripo AI**, optimized for on-device inference.
 ---
-## Demo — Photo to 3D
 <table>
 <tr>
@@ -28,73 +44,22 @@ The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/Tr
 <td align="center"><b>3D Output</b></td>
 </tr>
 <tr>
-<td><img src="assets/input_photo.jpeg" width="300" /></td>
-<td><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
 </tr>
 </table>
-> Single photo of a dog on the beach &rarr; fully reconstructed 3D mesh, running on-device via ONNX Runtime.
----
-## Model Overview
-| Property | Value |
-|:--|:--|
-| **Model Size** | ~1.6 GB |
-| **Parameters** | 419M |
-| **Input** | RGB Image `(1, 3, 512, 512)` |
-| **Output** | Triplane Scene Codes `(1, 3, 40, 64, 64)` |
-| **ONNX Opset** | 18 |
-| **Format** | ONNX + external weights |
-| **License** | MIT |
----
-## Architecture
-TripoSR uses a feedforward transformer pipeline — no diffusion, no iterative denoising. One forward pass from image to 3D.
-```mermaid
-graph LR
-    A["📷 Input Image<br/>(512×512 RGB)"] --> B["🔍 DINO ViT-B/16<br/>Image Tokenizer"]
-    B --> C["🧠 Transformer<br/>Decoder + Cross-Attention"]
-    C --> D["📐 Post Processor<br/>Triplane Features"]
-    D --> E["🧊 Marching Cubes<br/>3D Mesh Output"]
-    style A fill:#4a9eff,stroke:#333,color:#fff
-    style B fill:#7c3aed,stroke:#333,color:#fff
-    style C fill:#7c3aed,stroke:#333,color:#fff
-    style D fill:#7c3aed,stroke:#333,color:#fff
-    style E fill:#10b981,stroke:#333,color:#fff
-```
-### Component Breakdown
-```mermaid
-pie title Parameter Distribution (419M total)
-    "DINO ViT-B/16 (Image Encoder)" : 86
-    "Transformer Decoder" : 268
-    "Triplane Post-Processor" : 65
-```
-| Component | Role | Details |
-|:--|:--|:--|
-| **Image Tokenizer** | Feature extraction | DINO ViT-B/16 pretrained vision transformer |
-| **Backbone** | Scene understanding | Transformer decoder with cross-attention to image tokens |
-| **Post Processor** | 3D representation | Converts transformer tokens → triplane features `(3×40×64×64)` |
 ---
 ## Benchmarks
-Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluated on the GSO and OmniObject3D datasets.
-### F-Score @ 0.1 (higher is better)
 ![F-Score Comparison](assets/chart_fscore.png)
-### Chamfer Distance (lower is better)
 ![Chamfer Distance Comparison](assets/chart_chamfer.png)
@@ -106,23 +71,22 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
 ![Grouped Comparison](assets/chart_grouped.png)
-### Full Results
 <details>
-<summary><b>GSO Dataset</b></summary>
 | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
 |:--|:--|:--|:--|:--|
 | One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
-| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
 | OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
 | TGS | 0.122 | 0.637 | 0.846 | 0.968 |
 | **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
-</details>
-<details>
-<summary><b>OmniObject3D Dataset</b></summary>
 | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
 |:--|:--|:--|:--|:--|
@@ -134,101 +98,73 @@ Performance from the [TripoSR paper](https://arxiv.org/abs/2403.02151), evaluate
 </details>
-> *Metrics sourced from [TripoSR: Fast 3D Object Reconstruction from a Single Image](https://arxiv.org/abs/2403.02151) (Tochilkin et al., 2024)*
 ---
-## Why ONNX?
 ```mermaid
-graph TD
-    subgraph "Original TripoSR"
-        A1["PyTorch"] --> A2["GPU Server Required"]
-        A2 --> A3["~3GB+ VRAM"]
-    end
-    subgraph "This Conversion"
-        B1["ONNX"] --> B2["Runs on iPhone"]
-        B2 --> B3["CoreML / CPU"]
-    end
-    style A1 fill:#ef4444,stroke:#333,color:#fff
-    style A2 fill:#ef4444,stroke:#333,color:#fff
-    style A3 fill:#ef4444,stroke:#333,color:#fff
-    style B1 fill:#10b981,stroke:#333,color:#fff
-    style B2 fill:#10b981,stroke:#333,color:#fff
-    style B3 fill:#10b981,stroke:#333,color:#fff
 ```
-| | Original (PyTorch) | This Model (ONNX) |
 |:--|:--|:--|
-| **Runtime** | PyTorch + CUDA | ONNX Runtime |
-| **Platform** | Server / Desktop GPU | iPhone, iPad, Mac, any ONNX runtime |
-| **Size** | ~3 GB+ | ~1.6 GB |
-| **Dependencies** | torch, einops, transformers | onnxruntime only |
-| **Deployment** | Cloud API | On-device, offline capable |
----
-## Inference Pipeline
-```mermaid
-sequenceDiagram
-    participant User
-    participant App
-    participant Encoder as TripoSR Encoder (ONNX)
-    participant Decoder as Mesh Decoder
-    User->>App: Capture / Select Photo
-    App->>App: Preprocess (resize 512×512, normalize)
-    App->>Encoder: Input tensor (1, 3, 512, 512)
-    Encoder->>Encoder: DINO tokenize → Transformer decode → Triplane
-    Encoder-->>App: Scene codes (1, 3, 40, 64, 64)
-    App->>Decoder: Triplane features
-    Decoder->>Decoder: Marching cubes extraction
-    Decoder-->>App: 3D Mesh (vertices + faces)
-    App-->>User: Display 3D model
-```
 ---
-## Files
-| File | Size | Description |
 |:--|:--|:--|
-| `triposr_encoder.onnx` | 2.6 MB | ONNX model graph |
-| `triposr_encoder.onnx.data` | 1.6 GB | External model weights |
 ---
-## Usage
-### Python (ONNX Runtime)
 ```python
 import onnxruntime as ort
 import numpy as np
 from PIL import Image
-# Load the model
 session = ort.InferenceSession(
     "triposr_encoder.onnx",
-    providers=['CPUExecutionProvider']  # or 'CoreMLExecutionProvider' for iOS
 )
-# Preprocess image
-image = Image.open("your_image.png").convert("RGB").resize((512, 512))
 input_array = np.array(image).astype(np.float32) / 255.0
 input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
-# Run inference
 scene_codes = session.run(None, {"input_image": input_array})[0]
-print(f"Scene codes shape: {scene_codes.shape}")  # (1, 3, 40, 64, 64)
 ```
-### iOS (Swift + ONNX Runtime)
-Add ONNX Runtime via SPM, then:
 ```swift
 import OnnxRuntimeBindings
@@ -247,6 +183,17 @@ let outputs = try session.run(
 )
 ```
 ---
 ## Citation
@@ -254,14 +201,14 @@ let outputs = try session.run(
 ```bibtex
 @article{TripoSR2024,
   title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
-  author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian and Jampani, Varun and Cao, Yan-Pei},
   journal={arXiv preprint arXiv:2403.02151},
   year={2024}
 }
 ```
----
-## License
-MIT License (same as original TripoSR)

   - on-device
 ---
+<div align="center">
 # TripoSR iOS (ONNX)
+**Single image &rarr; 3D mesh, on your iPhone.**
+The ONNX-converted encoder from [TripoSR](https://github.com/VAST-AI-Research/TripoSR) by **Stability AI** &times; **Tripo AI** — optimized for on-device inference.
+<br/>
+<table>
+<tr>
+<td align="center"><b>419M</b><br/><sub>Parameters</sub></td>
+<td align="center"><b>1.6 GB</b><br/><sub>Model Size</sub></td>
+<td align="center"><b>&lt; 0.5s</b><br/><sub>Inference (A100)</sub></td>
+<td align="center"><b>ONNX</b><br/><sub>Format</sub></td>
+<td align="center"><b>MIT</b><br/><sub>License</sub></td>
+</tr>
+</table>
+</div>
 ---
+## Demo
 <table>
 <tr>
 <td align="center"><b>3D Output</b></td>
 </tr>
 <tr>
+<td align="center"><img src="assets/input_photo.jpeg" width="300" /></td>
+<td align="center"><video src="https://huggingface.co/jc-builds/triposr-ios/resolve/main/assets/3d_output.mp4" controls autoplay loop width="300"></video></td>
 </tr>
 </table>
 ---
 ## Benchmarks
+Evaluated on [GSO](https://goo.gl/datasets/GoogleScannedObjects) and [OmniObject3D](https://omniobject3d.github.io/) datasets. Results from the [TripoSR paper](https://arxiv.org/abs/2403.02151).
+### F-Score @ 0.1 &nbsp;(higher is better)
 ![F-Score Comparison](assets/chart_fscore.png)
+### Chamfer Distance &nbsp;(lower is better)
 ![Chamfer Distance Comparison](assets/chart_chamfer.png)
 ![Grouped Comparison](assets/chart_grouped.png)
 <details>
+<summary><b>Full Results Table</b></summary>
+<br/>
+**GSO Dataset**
 | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
 |:--|:--|:--|:--|:--|
 | One-2-3-45 | 0.227 | 0.382 | 0.630 | 0.878 |
 | OpenLRM | 0.180 | 0.430 | 0.698 | 0.938 |
+| ZeroShape | 0.160 | 0.489 | 0.757 | 0.952 |
 | TGS | 0.122 | 0.637 | 0.846 | 0.968 |
 | **TripoSR** | **0.111** | **0.651** | **0.871** | **0.980** |
+**OmniObject3D Dataset**
 | Method | CD ↓ | FS@0.1 ↑ | FS@0.2 ↑ | FS@0.5 ↑ |
 |:--|:--|:--|:--|:--|
 </details>
 ---
+## Architecture
+One forward pass — no diffusion, no iterative denoising.
 ```mermaid
+graph LR
+    A["Input Image<br/>(512x512)"] --> B["DINO ViT-B/16<br/>Image Tokenizer"]
+    B --> C["Transformer Decoder<br/>+ Cross-Attention"]
+    C --> D["Post Processor<br/>Triplane Features"]
+    D --> E["Marching Cubes<br/>3D Mesh"]
+    style A fill:#4a9eff,stroke:#30363d,color:#fff
+    style B fill:#7c3aed,stroke:#30363d,color:#fff
+    style C fill:#7c3aed,stroke:#30363d,color:#fff
+    style D fill:#7c3aed,stroke:#30363d,color:#fff
+    style E fill:#3fb950,stroke:#30363d,color:#fff
 ```
+| Component | Parameters | Role |
 |:--|:--|:--|
+| **DINO ViT-B/16** | ~86M | Pretrained image encoder |
+| **Transformer Decoder** | ~268M | Cross-attention to image tokens |
+| **Triplane Post-Processor** | ~65M | Tokens &rarr; triplane features `(3x40x64x64)` |
 ---
+## PyTorch vs. This Model
+| | Original | This Conversion |
 |:--|:--|:--|
+| **Format** | PyTorch | ONNX |
+| **Size** | ~3 GB+ | 1.6 GB |
+| **Runs on** | GPU server | iPhone / iPad / Mac |
+| **Dependencies** | torch, einops, transformers | onnxruntime |
+| **Connectivity** | Cloud API | Fully offline |
 ---
+## Quick Start
+<details open>
+<summary><b>Python</b></summary>
 ```python
 import onnxruntime as ort
 import numpy as np
 from PIL import Image
 session = ort.InferenceSession(
     "triposr_encoder.onnx",
+    providers=['CPUExecutionProvider']  # or 'CoreMLExecutionProvider'
 )
+image = Image.open("photo.png").convert("RGB").resize((512, 512))
 input_array = np.array(image).astype(np.float32) / 255.0
 input_array = input_array.transpose(2, 0, 1)[np.newaxis, ...]
 scene_codes = session.run(None, {"input_image": input_array})[0]
+# scene_codes.shape == (1, 3, 40, 64, 64)
 ```
+</details>
+<details>
+<summary><b>Swift (iOS)</b></summary>
 ```swift
 import OnnxRuntimeBindings
 )
 ```
+</details>
+---
+## Files
+| File | Size | Description |
+|:--|:--|:--|
+| `triposr_encoder.onnx` | 2.6 MB | Model graph |
+| `triposr_encoder.onnx.data` | 1.6 GB | Weights |
 ---
 ## Citation
 ```bibtex
 @article{TripoSR2024,
   title={TripoSR: Fast 3D Object Reconstruction from a Single Image},
+  author={Tochilkin, Dmitry and Pankratz, David and Liu, Zexiang and Huang, Zixuan
+          and Letts, Adam and Li, Yangguang and Liang, Ding and Laforte, Christian
+          and Jampani, Varun and Cao, Yan-Pei},
   journal={arXiv preprint arXiv:2403.02151},
   year={2024}
 }
 ```
+<div align="center">
+<sub>MIT License &bull; Based on <a href="https://github.com/VAST-AI-Research/TripoSR">TripoSR</a> by Stability AI &times; Tripo AI</sub>
+</div>