title: HF Model Ecosystem Visualizer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
Authors: Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
Research Paper: arXiv:2508.06811
Live Demo: https://huggingface.co/spaces/midah/hf-viz
Abstract
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.
About This Tool
This interactive latent space navigator visualizes ~1.84M models from the modelbiome/ai_ecosystem_withmodelcards dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.
Resources:
- GitHub Repository: bendlaufer/ai-ecosystem - Original research repository with analysis notebooks and datasets
- Hugging Face Project: modelbiome - Dataset and project page on Hugging Face Hub
Quick Start (Pre-Computed Data)
This project now uses pre-computed embeddings and coordinates for instant startup:
Option 1: Pre-Computed Data (Recommended - 10 seconds startup)
# 1. Generate pre-computed data (one-time, ~45 minutes)
cd backend
pip install -r config/requirements.txt
python scripts/precompute_data.py --sample-size 150000
# 2. Start backend (instant!)
uvicorn api.main:app --host 0.0.0.0 --port 8000
# 3. Start frontend
cd ../frontend
npm install && npm start
Startup time: ~5-10 seconds
Option 2: Traditional Mode (Fallback)
If pre-computed data is not available, the backend will automatically fall back to traditional loading (slower but still functional).
See:
PRECOMPUTED_DATA.md- Detailed documentationDEPLOYMENT.md- Production deployment guide
Project Structure
hf_viz/
βββ backend/ # FastAPI backend
β βββ api/ # API routes (main.py)
β βββ services/ # External services (arXiv, model tracking, scheduler)
β βββ utils/ # Utility modules (data loading, embeddings, etc.)
β βββ config/ # Configuration files
β βββ cache/ # Backend cache directory
βββ frontend/ # React frontend
β βββ src/
β β βββ components/ # React components
β β βββ utils/ # Frontend utilities
β β βββ workers/ # Web Workers
β βββ public/ # Static assets
βββ cache/ # Shared cache directory
βββ deploy/ # Deployment configuration files
βββ netlify-functions/ # Netlify serverless functions
Features
3D Latent Space Visualization
- Interactive 3D Scatter Plot (Three.js/React Three Fiber):
- Navigate 1.84M+ models in 3D space
- Spatial sparsity filtering for better navigability
- Frustum culling and adaptive sampling for performance
- Instanced rendering for large datasets
- Family tree visualization with connecting edges
- Multiple color encoding options (library, pipeline, cluster, family depth, popularity)
- Dynamic size encoding based on downloads/likes
- Smooth camera animations
- UV projection minimap for navigation
2D Visualizations (D3.js)
Enhanced Scatter Plot:
- Brush selection for multi-model selection
- Real-time tooltips with model details
- Dynamic color and size encoding
- Interactive zoom and pan
- Click to view model details modal
Network Graph:
- Force-directed layout showing model relationships
- Connectivity based on latent space similarity
- Draggable nodes
- Color-coded by library
- Node size based on popularity
Histograms:
- Distribution analysis of downloads, likes, trending scores
- Interactive bars with hover details
- Dynamic attribute selection
UV Projection Minimap:
- 2D projection of 3D latent space (XY plane)
- Click to navigate 3D view to specific regions
- Shows current view center
Advanced Features
- Semantic Similarity Search: Find models similar to a query model using embeddings
- Base Models Filter: View only root models (no parent) to see the base of family trees
- Family Tree Visualization: Click any model to see its family tree with parent-child relationships
- Clustering: Automatic K-means clustering reveals semantic groups
- Model Details Modal:
- Comprehensive model information
- File tree browser
- Color-coded tags and licenses
- Links to Hugging Face Hub
Model Tracking & Analytics
- Live Model Count Tracking: Track the number of models on Hugging Face Hub over time
- Growth Statistics: Calculate growth rates, daily averages, and trends
- Historical Data: Query historical model counts with breakdowns by library and pipeline
- API Endpoints: RESTful API for accessing tracking data
Performance Optimizations
- Real-time Updates:
- Debounced search (300ms)
- Instant filter updates
- Dynamic visualization switching
- Client-side Caching: IndexedDB caching for API responses
- Request Cancellation: Prevents race conditions with concurrent requests
- Adaptive Rendering: Quality adjusts based on user interaction
- Spatial Indexing: Octree for efficient nearest neighbor queries
Quick Start
Start Backend:
cd backend
pip install -r config/requirements.txt
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
Start Frontend:
cd frontend
npm install
npm start
Opens at http://localhost:3000 with full D3.js interactivity.
Installation
Backend:
cd backend
pip install -r config/requirements.txt
Frontend:
cd frontend
npm install
Usage
Local Development
Start Backend:
cd backend
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000
The backend will:
- Load a sample of 10,000 models from the dataset
- Generate embeddings (first run takes ~2-3 minutes)
- Reduce dimensions using UMAP
- Serve the API at
http://localhost:8000
Start Frontend:
cd frontend
npm start
The frontend will open at http://localhost:3000
Using the Interface
Filters: Use the left sidebar to filter models by:
- Search query (model ID or tags)
- Minimum downloads
- Minimum likes
- Color mapping (library, pipeline, popularity)
- Size mapping (downloads, likes, trending score)
Exploration:
- Hover over points to see model information
- Zoom and pan to explore different regions
- Use the legend to understand color coding
Understanding the Space:
- Models closer together are more similar
- Similarity is based on tags, pipeline type, library, and model card content
Deployment
Netlify (React Frontend)
The frontend is configured for deployment on Netlify. The netlify.toml file in the root directory contains the build configuration.
Steps to Deploy:
Push your code to GitHub (if not already):
git add . git commit -m "Prepare for Netlify deployment" git push origin mainConnect to Netlify:
- Go to Netlify
- Click "Add new site" β "Import an existing project"
- Connect your GitHub repository
- Netlify will auto-detect the
netlify.tomlconfiguration
Configure Environment Variables:
- In Netlify dashboard, go to Site settings β Environment variables
- Add
REACT_APP_API_URLwith your backend URL (e.g.,https://your-backend.railway.app) - If using Hugging Face API, add
REACT_APP_HF_TOKEN(optional)
Deploy Backend Separately:
- Netlify doesn't support Python/FastAPI backends
- Deploy backend to one of these services:
- Railway: Recommended, easy setup
- Render: Free tier available
- Fly.io: Good for Python apps
- Heroku: Paid option
- Update CORS in
backend/api/main.pyto include your Netlify URL
Build Settings (auto-detected from
netlify.toml):- Base directory:
frontend - Build command:
npm install && npm run build - Publish directory:
frontend/build - Node version: 18
- Base directory:
Backend Deployment (Railway Example):
- Create a new project on Railway
- Connect your GitHub repository
- Set root directory to
backend - Railway will auto-detect Python and install dependencies
- Add environment variables if needed (HF_TOKEN, etc.)
- Railway will provide a URL like
https://your-app.railway.app - Use this URL as
REACT_APP_API_URLin Netlify
CORS Configuration:
Update backend/api/main.py to allow your Netlify domain:
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=[
"http://localhost:3000", # Local development
"https://your-site.netlify.app", # Your Netlify URL
],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Architecture
- Backend (
backend/api/main.py): FastAPI server serving model data - Frontend (
frontend/): React app with D3.js visualizations- Enhanced Scatter Plot: D3.js scatter with brush selection, real-time tooltips
- Network Graph: Force-directed graph showing model relationships and connectivity
- Histograms: Distribution analysis of downloads, likes, trending scores
- Real-time Updates: Debounced filtering, dynamic visualizations
- Interactive Features: Click, brush, drag, zoom, pan
- Data Loading (
backend/utils/data_loader.py): Loads dataset from Hugging Face Hub, handles filtering and preprocessing - Embedding Generation (
backend/utils/embeddings.py): Creates embeddings from model metadata using sentence transformers - Dimensionality Reduction (
backend/utils/dimensionality_reduction.py): Uses UMAP to reduce to 2D for visualization - Clustering (
backend/utils/clustering.py): K-Means clustering with automatic optimization for model grouping - Services (
backend/services/): External service integrations (arXiv API, model tracking, scheduler)
Comparison with Hugging Face Dataset Viewer
This project uses a different approach than Hugging Face's built-in dataset viewer:
- HF Dataset Viewer: Tabular browser for exploring dataset rows (see dataset-viewer)
- This Project: Latent space visualization showing semantic relationships between models
The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.
Design Decisions
The application uses:
- 3D visualization for immersive exploration of latent space with 2D fallbacks for accessibility
- UMAP for dimensionality reduction (better global structure than t-SNE, optimized parameters for structure preservation)
- Sentence transformers for efficient embedding generation
- Smart sampling with spatial sparsity to maintain interactivity with large datasets
- Multi-level caching (disk + IndexedDB) to avoid recomputation on filter changes
- Adaptive rendering with frustum culling and level-of-detail for smooth performance
- Instanced rendering for efficient GPU utilization with large point clouds
Performance Notes
- Full Dataset: Loads all ~1.86 million models from the dataset
- Backend Sampling: Requests up to 500,000 models from backend (configurable via
max_pointsAPI parameter) - Frontend Rendering:
- For datasets >400K: Shows 30% of models (up to 200K visible)
- For datasets 200K-400K: Shows 40% of models
- For datasets 100K-200K: Shows 50% of models
- For smaller datasets: Shows all models with adaptive spatial sparsity
- Uses instanced rendering for datasets >5K points
- Camera-based frustum culling and adaptive LOD for optimal performance
- Embedding Model:
all-MiniLM-L6-v2(good balance of quality and speed) - Caching: Embeddings and reduced dimensions are cached to disk for fast startup
- Optimizations: Index-based lookups, vectorized operations, response compression, and optimized top-k queries
Requirements
- Python 3.8+
- ~2-4GB RAM for 10K models
- Internet connection for dataset download
- Optional: GPU for faster embedding generation (not required)
Citation
If you use this tool or dataset, please cite:
@article{laufer2025anatomy,
title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
journal={arXiv preprint arXiv:2508.06811},
year={2025},
url={https://arxiv.org/abs/2508.06811}
}
Paper: arXiv:2508.06811