File size: 14,941 Bytes
fe16d35 d5b98e6 2bf3e55 e9c3301 d5b98e6 c24ac02 4254f01 c24ac02 637183f c24ac02 637183f 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 31ba421 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 31ba421 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 31ba421 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 478ac96 d5b98e6 31ba421 4254f01 31ba421 d5b98e6 4254f01 d5b98e6 4254f01 d5b98e6 4254f01 478ac96 d5b98e6 4254f01 d5b98e6 5ad0d02 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 | ---
title: HF Model Ecosystem Visualizer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---
# Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
**Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
**Research Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
**Live Demo**: [https://huggingface.co/spaces/midah/hf-viz](https://huggingface.co/spaces/midah/hf-viz)
## Abstract
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.
## About This Tool
This interactive latent space navigator visualizes ~1.84M models from the [modelbiome/ai_ecosystem_withmodelcards](https://huggingface.co/datasets/modelbiome/ai_ecosystem_withmodelcards) dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.
**Resources:**
- **GitHub Repository**: [bendlaufer/ai-ecosystem](https://github.com/bendlaufer/ai-ecosystem) - Original research repository with analysis notebooks and datasets
- **Hugging Face Project**: [modelbiome](https://huggingface.co/modelbiome) - Dataset and project page on Hugging Face Hub
## Quick Start (Pre-Computed Data)
This project now uses **pre-computed embeddings and coordinates** for instant startup:
### Option 1: Pre-Computed Data (Recommended - 10 seconds startup)
```bash
# 1. Generate pre-computed data (one-time, ~45 minutes)
cd backend
pip install -r config/requirements.txt
python scripts/precompute_data.py --sample-size 150000
# 2. Start backend (instant!)
uvicorn api.main:app --host 0.0.0.0 --port 8000
# 3. Start frontend
cd ../frontend
npm install && npm start
```
**Startup time:** ~5-10 seconds
### Option 2: Traditional Mode (Fallback)
If pre-computed data is not available, the backend will automatically fall back to traditional loading (slower but still functional).
**See:**
- [`PRECOMPUTED_DATA.md`](PRECOMPUTED_DATA.md) - Detailed documentation
- [`DEPLOYMENT.md`](DEPLOYMENT.md) - Production deployment guide
## Project Structure
```
hf_viz/
βββ backend/ # FastAPI backend
β βββ api/ # API routes (main.py)
β βββ services/ # External services (arXiv, model tracking, scheduler)
β βββ utils/ # Utility modules (data loading, embeddings, etc.)
β βββ config/ # Configuration files
β βββ cache/ # Backend cache directory
βββ frontend/ # React frontend
β βββ src/
β β βββ components/ # React components
β β βββ utils/ # Frontend utilities
β β βββ workers/ # Web Workers
β βββ public/ # Static assets
βββ cache/ # Shared cache directory
βββ deploy/ # Deployment configuration files
βββ netlify-functions/ # Netlify serverless functions
```
## Features
### 3D Latent Space Visualization
- **Interactive 3D Scatter Plot** (Three.js/React Three Fiber):
- Navigate 1.84M+ models in 3D space
- Spatial sparsity filtering for better navigability
- Frustum culling and adaptive sampling for performance
- Instanced rendering for large datasets
- Family tree visualization with connecting edges
- Multiple color encoding options (library, pipeline, cluster, family depth, popularity)
- Dynamic size encoding based on downloads/likes
- Smooth camera animations
- UV projection minimap for navigation
### 2D Visualizations (D3.js)
- **Enhanced Scatter Plot**:
- Brush selection for multi-model selection
- Real-time tooltips with model details
- Dynamic color and size encoding
- Interactive zoom and pan
- Click to view model details modal
- **Network Graph**:
- Force-directed layout showing model relationships
- Connectivity based on latent space similarity
- Draggable nodes
- Color-coded by library
- Node size based on popularity
- **Histograms**:
- Distribution analysis of downloads, likes, trending scores
- Interactive bars with hover details
- Dynamic attribute selection
- **UV Projection Minimap**:
- 2D projection of 3D latent space (XY plane)
- Click to navigate 3D view to specific regions
- Shows current view center
### Advanced Features
- **Semantic Similarity Search**: Find models similar to a query model using embeddings
- **Base Models Filter**: View only root models (no parent) to see the base of family trees
- **Family Tree Visualization**: Click any model to see its family tree with parent-child relationships
- **Clustering**: Automatic K-means clustering reveals semantic groups
- **Model Details Modal**:
- Comprehensive model information
- File tree browser
- Color-coded tags and licenses
- Links to Hugging Face Hub
### Model Tracking & Analytics
- **Live Model Count Tracking**: Track the number of models on Hugging Face Hub over time
- **Growth Statistics**: Calculate growth rates, daily averages, and trends
- **Historical Data**: Query historical model counts with breakdowns by library and pipeline
- **API Endpoints**: RESTful API for accessing tracking data
### Performance Optimizations
- **Real-time Updates**:
- Debounced search (300ms)
- Instant filter updates
- Dynamic visualization switching
- **Client-side Caching**: IndexedDB caching for API responses
- **Request Cancellation**: Prevents race conditions with concurrent requests
- **Adaptive Rendering**: Quality adjusts based on user interaction
- **Spatial Indexing**: Octree for efficient nearest neighbor queries
## Quick Start
**Start Backend:**
```bash
cd backend
pip install -r config/requirements.txt
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```
**Start Frontend:**
```bash
cd frontend
npm install
npm start
```
Opens at `http://localhost:3000` with full D3.js interactivity.
## Installation
**Backend:**
```bash
cd backend
pip install -r config/requirements.txt
```
**Frontend:**
```bash
cd frontend
npm install
```
## Usage
### Local Development
**Start Backend:**
```bash
cd backend
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
```
The backend will:
1. Load a sample of 10,000 models from the dataset
2. Generate embeddings (first run takes ~2-3 minutes)
3. Reduce dimensions using UMAP
4. Serve the API at `http://localhost:8000`
**Start Frontend:**
```bash
cd frontend
npm start
```
The frontend will open at `http://localhost:3000`
### Using the Interface
1. **Filters**: Use the left sidebar to filter models by:
- Search query (model ID or tags)
- Minimum downloads
- Minimum likes
- Color mapping (library, pipeline, popularity)
- Size mapping (downloads, likes, trending score)
2. **Exploration**:
- Hover over points to see model information
- Zoom and pan to explore different regions
- Use the legend to understand color coding
3. **Understanding the Space**:
- Models closer together are more similar
- Similarity is based on tags, pipeline type, library, and model card content
## Deployment
### Netlify (React Frontend)
The frontend is configured for deployment on Netlify. The `netlify.toml` file in the root directory contains the build configuration.
**Steps to Deploy:**
1. **Push your code to GitHub** (if not already):
```bash
git add .
git commit -m "Prepare for Netlify deployment"
git push origin main
```
2. **Connect to Netlify**:
- Go to [Netlify](https://app.netlify.com)
- Click "Add new site" β "Import an existing project"
- Connect your GitHub repository
- Netlify will auto-detect the `netlify.toml` configuration
3. **Configure Environment Variables**:
- In Netlify dashboard, go to Site settings β Environment variables
- Add `REACT_APP_API_URL` with your backend URL (e.g., `https://your-backend.railway.app`)
- If using Hugging Face API, add `REACT_APP_HF_TOKEN` (optional)
4. **Deploy Backend Separately**:
- Netlify doesn't support Python/FastAPI backends
- Deploy backend to one of these services:
- **Railway**: Recommended, easy setup
- **Render**: Free tier available
- **Fly.io**: Good for Python apps
- **Heroku**: Paid option
- Update CORS in `backend/api/main.py` to include your Netlify URL
5. **Build Settings** (auto-detected from `netlify.toml`):
- Base directory: `frontend`
- Build command: `npm install && npm run build`
- Publish directory: `frontend/build`
- Node version: 18
**Backend Deployment (Railway Example):**
1. Create a new project on [Railway](https://railway.app)
2. Connect your GitHub repository
3. Set root directory to `backend`
4. Railway will auto-detect Python and install dependencies
5. Add environment variables if needed (HF_TOKEN, etc.)
6. Railway will provide a URL like `https://your-app.railway.app`
7. Use this URL as `REACT_APP_API_URL` in Netlify
**CORS Configuration:**
Update `backend/api/main.py` to allow your Netlify domain:
```python
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=[
"http://localhost:3000", # Local development
"https://your-site.netlify.app", # Your Netlify URL
],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
```
## Architecture
- **Backend** (`backend/api/main.py`): FastAPI server serving model data
- **Frontend** (`frontend/`): React app with D3.js visualizations
- **Enhanced Scatter Plot**: D3.js scatter with brush selection, real-time tooltips
- **Network Graph**: Force-directed graph showing model relationships and connectivity
- **Histograms**: Distribution analysis of downloads, likes, trending scores
- **Real-time Updates**: Debounced filtering, dynamic visualizations
- **Interactive Features**: Click, brush, drag, zoom, pan
- **Data Loading** (`backend/utils/data_loader.py`): Loads dataset from Hugging Face Hub, handles filtering and preprocessing
- **Embedding Generation** (`backend/utils/embeddings.py`): Creates embeddings from model metadata using sentence transformers
- **Dimensionality Reduction** (`backend/utils/dimensionality_reduction.py`): Uses UMAP to reduce to 2D for visualization
- **Clustering** (`backend/utils/clustering.py`): K-Means clustering with automatic optimization for model grouping
- **Services** (`backend/services/`): External service integrations (arXiv API, model tracking, scheduler)
### Comparison with Hugging Face Dataset Viewer
This project uses a different approach than Hugging Face's built-in dataset viewer:
- **HF Dataset Viewer**: Tabular browser for exploring dataset rows (see [dataset-viewer](https://github.com/huggingface/dataset-viewer))
- **This Project**: Latent space visualization showing semantic relationships between models
The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.
## Design Decisions
The application uses:
- **3D visualization** for immersive exploration of latent space with **2D fallbacks** for accessibility
- **UMAP** for dimensionality reduction (better global structure than t-SNE, optimized parameters for structure preservation)
- **Sentence transformers** for efficient embedding generation
- **Smart sampling** with spatial sparsity to maintain interactivity with large datasets
- **Multi-level caching** (disk + IndexedDB) to avoid recomputation on filter changes
- **Adaptive rendering** with frustum culling and level-of-detail for smooth performance
- **Instanced rendering** for efficient GPU utilization with large point clouds
## Performance Notes
- **Full Dataset**: Loads all ~1.86 million models from the dataset
- **Backend Sampling**: Requests up to 500,000 models from backend (configurable via `max_points` API parameter)
- **Frontend Rendering**:
- For datasets >400K: Shows 30% of models (up to 200K visible)
- For datasets 200K-400K: Shows 40% of models
- For datasets 100K-200K: Shows 50% of models
- For smaller datasets: Shows all models with adaptive spatial sparsity
- Uses instanced rendering for datasets >5K points
- Camera-based frustum culling and adaptive LOD for optimal performance
- **Embedding Model**: `all-MiniLM-L6-v2` (good balance of quality and speed)
- **Caching**: Embeddings and reduced dimensions are cached to disk for fast startup
- **Optimizations**: Index-based lookups, vectorized operations, response compression, and optimized top-k queries
## Requirements
- Python 3.8+
- ~2-4GB RAM for 10K models
- Internet connection for dataset download
- Optional: GPU for faster embedding generation (not required)
## Citation
If you use this tool or dataset, please cite:
```bibtex
@article{laufer2025anatomy,
title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
journal={arXiv preprint arXiv:2508.06811},
year={2025},
url={https://arxiv.org/abs/2508.06811}
}
```
**Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
|