File size: 14,941 Bytes
fe16d35
 
 
 
 
 
 
 
 
 
d5b98e6
 
 
 
 
 
2bf3e55
e9c3301
d5b98e6
 
 
 
 
 
 
 
c24ac02
4254f01
 
 
 
c24ac02
637183f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c24ac02
637183f
 
 
 
 
 
 
 
 
4254f01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5b98e6
 
4254f01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5b98e6
 
 
4254f01
d5b98e6
4254f01
31ba421
 
d5b98e6
 
4254f01
 
 
 
 
 
d5b98e6
4254f01
d5b98e6
 
 
4254f01
d5b98e6
4254f01
31ba421
d5b98e6
 
4254f01
 
 
 
 
 
d5b98e6
 
 
 
4254f01
d5b98e6
4254f01
31ba421
d5b98e6
 
4254f01
 
 
 
 
d5b98e6
4254f01
d5b98e6
4254f01
 
d5b98e6
 
4254f01
d5b98e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
478ac96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5b98e6
 
 
31ba421
4254f01
 
 
 
 
 
31ba421
 
 
 
 
d5b98e6
 
 
 
 
 
 
 
 
 
 
 
 
4254f01
 
d5b98e6
4254f01
 
 
 
d5b98e6
 
 
4254f01
478ac96
 
 
 
 
 
 
 
d5b98e6
4254f01
 
d5b98e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5ad0d02
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
title: HF Model Ecosystem Visualizer
emoji: 🌐
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---

# Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

**Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg

**Research Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)

**Live Demo**: [https://huggingface.co/spaces/midah/hf-viz](https://huggingface.co/spaces/midah/hf-viz)

## Abstract

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

## About This Tool

This interactive latent space navigator visualizes ~1.84M models from the [modelbiome/ai_ecosystem_withmodelcards](https://huggingface.co/datasets/modelbiome/ai_ecosystem_withmodelcards) dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.


**Resources:**
- **GitHub Repository**: [bendlaufer/ai-ecosystem](https://github.com/bendlaufer/ai-ecosystem) - Original research repository with analysis notebooks and datasets
- **Hugging Face Project**: [modelbiome](https://huggingface.co/modelbiome) - Dataset and project page on Hugging Face Hub

## Quick Start (Pre-Computed Data)

This project now uses **pre-computed embeddings and coordinates** for instant startup:

### Option 1: Pre-Computed Data (Recommended - 10 seconds startup)

```bash
# 1. Generate pre-computed data (one-time, ~45 minutes)
cd backend
pip install -r config/requirements.txt
python scripts/precompute_data.py --sample-size 150000

# 2. Start backend (instant!)
uvicorn api.main:app --host 0.0.0.0 --port 8000

# 3. Start frontend
cd ../frontend
npm install && npm start
```

**Startup time:** ~5-10 seconds

### Option 2: Traditional Mode (Fallback)

If pre-computed data is not available, the backend will automatically fall back to traditional loading (slower but still functional).

**See:** 
- [`PRECOMPUTED_DATA.md`](PRECOMPUTED_DATA.md) - Detailed documentation
- [`DEPLOYMENT.md`](DEPLOYMENT.md) - Production deployment guide

## Project Structure

```
hf_viz/
β”œβ”€β”€ backend/              # FastAPI backend
β”‚   β”œβ”€β”€ api/             # API routes (main.py)
β”‚   β”œβ”€β”€ services/        # External services (arXiv, model tracking, scheduler)
β”‚   β”œβ”€β”€ utils/           # Utility modules (data loading, embeddings, etc.)
β”‚   β”œβ”€β”€ config/          # Configuration files
β”‚   └── cache/           # Backend cache directory
β”œβ”€β”€ frontend/            # React frontend
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/  # React components
β”‚   β”‚   β”œβ”€β”€ utils/       # Frontend utilities
β”‚   β”‚   └── workers/     # Web Workers
β”‚   └── public/          # Static assets
β”œβ”€β”€ cache/               # Shared cache directory
β”œβ”€β”€ deploy/              # Deployment configuration files
└── netlify-functions/   # Netlify serverless functions
```

## Features

### 3D Latent Space Visualization

- **Interactive 3D Scatter Plot** (Three.js/React Three Fiber):
  - Navigate 1.84M+ models in 3D space
  - Spatial sparsity filtering for better navigability
  - Frustum culling and adaptive sampling for performance
  - Instanced rendering for large datasets
  - Family tree visualization with connecting edges
  - Multiple color encoding options (library, pipeline, cluster, family depth, popularity)
  - Dynamic size encoding based on downloads/likes
  - Smooth camera animations
  - UV projection minimap for navigation

### 2D Visualizations (D3.js)

- **Enhanced Scatter Plot**: 
  - Brush selection for multi-model selection
  - Real-time tooltips with model details
  - Dynamic color and size encoding
  - Interactive zoom and pan
  - Click to view model details modal

- **Network Graph**: 
  - Force-directed layout showing model relationships
  - Connectivity based on latent space similarity
  - Draggable nodes
  - Color-coded by library
  - Node size based on popularity

- **Histograms**: 
  - Distribution analysis of downloads, likes, trending scores
  - Interactive bars with hover details
  - Dynamic attribute selection

- **UV Projection Minimap**:
  - 2D projection of 3D latent space (XY plane)
  - Click to navigate 3D view to specific regions
  - Shows current view center

### Advanced Features

- **Semantic Similarity Search**: Find models similar to a query model using embeddings
- **Base Models Filter**: View only root models (no parent) to see the base of family trees
- **Family Tree Visualization**: Click any model to see its family tree with parent-child relationships
- **Clustering**: Automatic K-means clustering reveals semantic groups
- **Model Details Modal**: 
  - Comprehensive model information
  - File tree browser
  - Color-coded tags and licenses
  - Links to Hugging Face Hub

### Model Tracking & Analytics

- **Live Model Count Tracking**: Track the number of models on Hugging Face Hub over time
- **Growth Statistics**: Calculate growth rates, daily averages, and trends
- **Historical Data**: Query historical model counts with breakdowns by library and pipeline
- **API Endpoints**: RESTful API for accessing tracking data

### Performance Optimizations

- **Real-time Updates**: 
  - Debounced search (300ms)
  - Instant filter updates
  - Dynamic visualization switching
- **Client-side Caching**: IndexedDB caching for API responses
- **Request Cancellation**: Prevents race conditions with concurrent requests
- **Adaptive Rendering**: Quality adjusts based on user interaction
- **Spatial Indexing**: Octree for efficient nearest neighbor queries

## Quick Start

**Start Backend:**
```bash
cd backend
pip install -r config/requirements.txt
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```

**Start Frontend:**
```bash
cd frontend
npm install
npm start
```

Opens at `http://localhost:3000` with full D3.js interactivity.

## Installation

**Backend:**
```bash
cd backend
pip install -r config/requirements.txt
```

**Frontend:**
```bash
cd frontend
npm install
```

## Usage

### Local Development

**Start Backend:**
```bash
cd backend
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```

The backend will:
1. Load a sample of 10,000 models from the dataset
2. Generate embeddings (first run takes ~2-3 minutes)
3. Reduce dimensions using UMAP
4. Serve the API at `http://localhost:8000`

**Start Frontend:**
```bash
cd frontend
npm start
```

The frontend will open at `http://localhost:3000`

### Using the Interface

1. **Filters**: Use the left sidebar to filter models by:
   - Search query (model ID or tags)
   - Minimum downloads
   - Minimum likes
   - Color mapping (library, pipeline, popularity)
   - Size mapping (downloads, likes, trending score)

2. **Exploration**: 
   - Hover over points to see model information
   - Zoom and pan to explore different regions
   - Use the legend to understand color coding

3. **Understanding the Space**:
   - Models closer together are more similar
   - Similarity is based on tags, pipeline type, library, and model card content

## Deployment

### Netlify (React Frontend)

The frontend is configured for deployment on Netlify. The `netlify.toml` file in the root directory contains the build configuration.

**Steps to Deploy:**

1. **Push your code to GitHub** (if not already):
   ```bash
   git add .
   git commit -m "Prepare for Netlify deployment"
   git push origin main
   ```

2. **Connect to Netlify**:
   - Go to [Netlify](https://app.netlify.com)
   - Click "Add new site" β†’ "Import an existing project"
   - Connect your GitHub repository
   - Netlify will auto-detect the `netlify.toml` configuration

3. **Configure Environment Variables**:
   - In Netlify dashboard, go to Site settings β†’ Environment variables
   - Add `REACT_APP_API_URL` with your backend URL (e.g., `https://your-backend.railway.app`)
   - If using Hugging Face API, add `REACT_APP_HF_TOKEN` (optional)

4. **Deploy Backend Separately**:
   - Netlify doesn't support Python/FastAPI backends
   - Deploy backend to one of these services:
     - **Railway**: Recommended, easy setup
     - **Render**: Free tier available
     - **Fly.io**: Good for Python apps
     - **Heroku**: Paid option
   - Update CORS in `backend/api/main.py` to include your Netlify URL

5. **Build Settings** (auto-detected from `netlify.toml`):
   - Base directory: `frontend`
   - Build command: `npm install && npm run build`
   - Publish directory: `frontend/build`
   - Node version: 18

**Backend Deployment (Railway Example):**

1. Create a new project on [Railway](https://railway.app)
2. Connect your GitHub repository
3. Set root directory to `backend`
4. Railway will auto-detect Python and install dependencies
5. Add environment variables if needed (HF_TOKEN, etc.)
6. Railway will provide a URL like `https://your-app.railway.app`
7. Use this URL as `REACT_APP_API_URL` in Netlify

**CORS Configuration:**

Update `backend/api/main.py` to allow your Netlify domain:
```python
from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "http://localhost:3000",  # Local development
        "https://your-site.netlify.app",  # Your Netlify URL
    ],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)
```

## Architecture

- **Backend** (`backend/api/main.py`): FastAPI server serving model data
- **Frontend** (`frontend/`): React app with D3.js visualizations
  - **Enhanced Scatter Plot**: D3.js scatter with brush selection, real-time tooltips
  - **Network Graph**: Force-directed graph showing model relationships and connectivity
  - **Histograms**: Distribution analysis of downloads, likes, trending scores
  - **Real-time Updates**: Debounced filtering, dynamic visualizations
  - **Interactive Features**: Click, brush, drag, zoom, pan
- **Data Loading** (`backend/utils/data_loader.py`): Loads dataset from Hugging Face Hub, handles filtering and preprocessing
- **Embedding Generation** (`backend/utils/embeddings.py`): Creates embeddings from model metadata using sentence transformers
- **Dimensionality Reduction** (`backend/utils/dimensionality_reduction.py`): Uses UMAP to reduce to 2D for visualization
- **Clustering** (`backend/utils/clustering.py`): K-Means clustering with automatic optimization for model grouping
- **Services** (`backend/services/`): External service integrations (arXiv API, model tracking, scheduler)

### Comparison with Hugging Face Dataset Viewer

This project uses a different approach than Hugging Face's built-in dataset viewer:

- **HF Dataset Viewer**: Tabular browser for exploring dataset rows (see [dataset-viewer](https://github.com/huggingface/dataset-viewer))
- **This Project**: Latent space visualization showing semantic relationships between models

The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.

## Design Decisions

The application uses:
- **3D visualization** for immersive exploration of latent space with **2D fallbacks** for accessibility
- **UMAP** for dimensionality reduction (better global structure than t-SNE, optimized parameters for structure preservation)
- **Sentence transformers** for efficient embedding generation
- **Smart sampling** with spatial sparsity to maintain interactivity with large datasets
- **Multi-level caching** (disk + IndexedDB) to avoid recomputation on filter changes
- **Adaptive rendering** with frustum culling and level-of-detail for smooth performance
- **Instanced rendering** for efficient GPU utilization with large point clouds

## Performance Notes

- **Full Dataset**: Loads all ~1.86 million models from the dataset
- **Backend Sampling**: Requests up to 500,000 models from backend (configurable via `max_points` API parameter)
- **Frontend Rendering**: 
  - For datasets >400K: Shows 30% of models (up to 200K visible)
  - For datasets 200K-400K: Shows 40% of models
  - For datasets 100K-200K: Shows 50% of models
  - For smaller datasets: Shows all models with adaptive spatial sparsity
  - Uses instanced rendering for datasets >5K points
  - Camera-based frustum culling and adaptive LOD for optimal performance
- **Embedding Model**: `all-MiniLM-L6-v2` (good balance of quality and speed)
- **Caching**: Embeddings and reduced dimensions are cached to disk for fast startup
- **Optimizations**: Index-based lookups, vectorized operations, response compression, and optimized top-k queries

## Requirements

- Python 3.8+
- ~2-4GB RAM for 10K models
- Internet connection for dataset download
- Optional: GPU for faster embedding generation (not required)

## Citation

If you use this tool or dataset, please cite:

```bibtex
@article{laufer2025anatomy,
  title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
  author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
  journal={arXiv preprint arXiv:2508.06811},
  year={2025},
  url={https://arxiv.org/abs/2508.06811}
}
```

**Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)