Spaces:

midah
/

hf-viz

Sleeping

App Files Files Community

hf-viz / README.md

midah

Trigger rebuild

5ad0d02 3 months ago

preview code

raw

history blame contribute delete

14.9 kB

metadata

title: HF Model Ecosystem Visualizer
emoji: 🌐
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

Authors: Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg

Research Paper: arXiv:2508.06811

Live Demo: https://huggingface.co/spaces/midah/hf-viz

Abstract

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

About This Tool

This interactive latent space navigator visualizes ~1.84M models from the modelbiome/ai_ecosystem_withmodelcards dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.

Resources:

GitHub Repository: bendlaufer/ai-ecosystem - Original research repository with analysis notebooks and datasets
Hugging Face Project: modelbiome - Dataset and project page on Hugging Face Hub

Quick Start (Pre-Computed Data)

This project now uses pre-computed embeddings and coordinates for instant startup:

Option 1: Pre-Computed Data (Recommended - 10 seconds startup)

# 1. Generate pre-computed data (one-time, ~45 minutes)
cd backend
pip install -r config/requirements.txt
python scripts/precompute_data.py --sample-size 150000

# 2. Start backend (instant!)
uvicorn api.main:app --host 0.0.0.0 --port 8000

# 3. Start frontend
cd ../frontend
npm install && npm start

Startup time: ~5-10 seconds

Option 2: Traditional Mode (Fallback)

If pre-computed data is not available, the backend will automatically fall back to traditional loading (slower but still functional).

See:

PRECOMPUTED_DATA.md - Detailed documentation
DEPLOYMENT.md - Production deployment guide

Project Structure

hf_viz/
├── backend/              # FastAPI backend
│   ├── api/             # API routes (main.py)
│   ├── services/        # External services (arXiv, model tracking, scheduler)
│   ├── utils/           # Utility modules (data loading, embeddings, etc.)
│   ├── config/          # Configuration files
│   └── cache/           # Backend cache directory
├── frontend/            # React frontend
│   ├── src/
│   │   ├── components/  # React components
│   │   ├── utils/       # Frontend utilities
│   │   └── workers/     # Web Workers
│   └── public/          # Static assets
├── cache/               # Shared cache directory
├── deploy/              # Deployment configuration files
└── netlify-functions/   # Netlify serverless functions

Features

3D Latent Space Visualization

Interactive 3D Scatter Plot (Three.js/React Three Fiber):
- Navigate 1.84M+ models in 3D space
- Spatial sparsity filtering for better navigability
- Frustum culling and adaptive sampling for performance
- Instanced rendering for large datasets
- Family tree visualization with connecting edges
- Multiple color encoding options (library, pipeline, cluster, family depth, popularity)
- Dynamic size encoding based on downloads/likes
- Smooth camera animations
- UV projection minimap for navigation

2D Visualizations (D3.js)

Enhanced Scatter Plot:
- Brush selection for multi-model selection
- Real-time tooltips with model details
- Dynamic color and size encoding
- Interactive zoom and pan
- Click to view model details modal
Network Graph:
- Force-directed layout showing model relationships
- Connectivity based on latent space similarity
- Draggable nodes
- Color-coded by library
- Node size based on popularity
Histograms:
- Distribution analysis of downloads, likes, trending scores
- Interactive bars with hover details
- Dynamic attribute selection
UV Projection Minimap:
- 2D projection of 3D latent space (XY plane)
- Click to navigate 3D view to specific regions
- Shows current view center

Advanced Features

Semantic Similarity Search: Find models similar to a query model using embeddings
Base Models Filter: View only root models (no parent) to see the base of family trees
Family Tree Visualization: Click any model to see its family tree with parent-child relationships
Clustering: Automatic K-means clustering reveals semantic groups
Model Details Modal:
- Comprehensive model information
- File tree browser
- Color-coded tags and licenses
- Links to Hugging Face Hub

Model Tracking & Analytics

Live Model Count Tracking: Track the number of models on Hugging Face Hub over time
Growth Statistics: Calculate growth rates, daily averages, and trends
Historical Data: Query historical model counts with breakdowns by library and pipeline
API Endpoints: RESTful API for accessing tracking data

Performance Optimizations

Real-time Updates:
- Debounced search (300ms)
- Instant filter updates
- Dynamic visualization switching
Client-side Caching: IndexedDB caching for API responses
Request Cancellation: Prevents race conditions with concurrent requests
Adaptive Rendering: Quality adjusts based on user interaction
Spatial Indexing: Octree for efficient nearest neighbor queries

Quick Start

Start Backend:

cd backend
pip install -r config/requirements.txt
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

Start Frontend:

cd frontend
npm install
npm start

Opens at http://localhost:3000 with full D3.js interactivity.

Installation

Backend:

cd backend
pip install -r config/requirements.txt

Frontend:

cd frontend
npm install

Usage

Local Development

Start Backend:

cd backend
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

The backend will:

Load a sample of 10,000 models from the dataset
Generate embeddings (first run takes ~2-3 minutes)
Reduce dimensions using UMAP
Serve the API at http://localhost:8000

Start Frontend:

cd frontend
npm start

The frontend will open at http://localhost:3000

Using the Interface

Filters: Use the left sidebar to filter models by:
- Search query (model ID or tags)
- Minimum downloads
- Minimum likes
- Color mapping (library, pipeline, popularity)
- Size mapping (downloads, likes, trending score)
Exploration:
- Hover over points to see model information
- Zoom and pan to explore different regions
- Use the legend to understand color coding
Understanding the Space:
- Models closer together are more similar
- Similarity is based on tags, pipeline type, library, and model card content

Deployment

Netlify (React Frontend)

The frontend is configured for deployment on Netlify. The netlify.toml file in the root directory contains the build configuration.

Steps to Deploy:

Push your code to GitHub (if not already):

git add .
git commit -m "Prepare for Netlify deployment"
git push origin main

Connect to Netlify:
- Go to Netlify
- Click "Add new site" → "Import an existing project"
- Connect your GitHub repository
- Netlify will auto-detect the netlify.toml configuration
Configure Environment Variables:
- In Netlify dashboard, go to Site settings → Environment variables
- Add REACT_APP_API_URL with your backend URL (e.g., https://your-backend.railway.app)
- If using Hugging Face API, add REACT_APP_HF_TOKEN (optional)
Deploy Backend Separately:
- Netlify doesn't support Python/FastAPI backends
- Deploy backend to one of these services:
  - Railway: Recommended, easy setup
  - Render: Free tier available
  - Fly.io: Good for Python apps
  - Heroku: Paid option
- Update CORS in backend/api/main.py to include your Netlify URL
Build Settings (auto-detected from netlify.toml):
- Base directory: frontend
- Build command: npm install && npm run build
- Publish directory: frontend/build
- Node version: 18

Backend Deployment (Railway Example):

Create a new project on Railway
Connect your GitHub repository
Set root directory to backend
Railway will auto-detect Python and install dependencies
Add environment variables if needed (HF_TOKEN, etc.)
Railway will provide a URL like https://your-app.railway.app
Use this URL as REACT_APP_API_URL in Netlify

CORS Configuration:

Update backend/api/main.py to allow your Netlify domain:

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "http://localhost:3000",  # Local development
        "https://your-site.netlify.app",  # Your Netlify URL
    ],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Architecture

Backend (backend/api/main.py): FastAPI server serving model data
Frontend (frontend/): React app with D3.js visualizations
- Enhanced Scatter Plot: D3.js scatter with brush selection, real-time tooltips
- Network Graph: Force-directed graph showing model relationships and connectivity
- Histograms: Distribution analysis of downloads, likes, trending scores
- Real-time Updates: Debounced filtering, dynamic visualizations
- Interactive Features: Click, brush, drag, zoom, pan
Data Loading (backend/utils/data_loader.py): Loads dataset from Hugging Face Hub, handles filtering and preprocessing
Embedding Generation (backend/utils/embeddings.py): Creates embeddings from model metadata using sentence transformers
Dimensionality Reduction (backend/utils/dimensionality_reduction.py): Uses UMAP to reduce to 2D for visualization
Clustering (backend/utils/clustering.py): K-Means clustering with automatic optimization for model grouping
Services (backend/services/): External service integrations (arXiv API, model tracking, scheduler)

Comparison with Hugging Face Dataset Viewer

This project uses a different approach than Hugging Face's built-in dataset viewer:

HF Dataset Viewer: Tabular browser for exploring dataset rows (see dataset-viewer)
This Project: Latent space visualization showing semantic relationships between models

The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.

Design Decisions

The application uses:

3D visualization for immersive exploration of latent space with 2D fallbacks for accessibility
UMAP for dimensionality reduction (better global structure than t-SNE, optimized parameters for structure preservation)
Sentence transformers for efficient embedding generation
Smart sampling with spatial sparsity to maintain interactivity with large datasets
Multi-level caching (disk + IndexedDB) to avoid recomputation on filter changes
Adaptive rendering with frustum culling and level-of-detail for smooth performance
Instanced rendering for efficient GPU utilization with large point clouds

Performance Notes

Full Dataset: Loads all ~1.86 million models from the dataset
Backend Sampling: Requests up to 500,000 models from backend (configurable via max_points API parameter)
Frontend Rendering:
- For datasets >400K: Shows 30% of models (up to 200K visible)
- For datasets 200K-400K: Shows 40% of models
- For datasets 100K-200K: Shows 50% of models
- For smaller datasets: Shows all models with adaptive spatial sparsity
- Uses instanced rendering for datasets >5K points
- Camera-based frustum culling and adaptive LOD for optimal performance
Embedding Model: all-MiniLM-L6-v2 (good balance of quality and speed)
Caching: Embeddings and reduced dimensions are cached to disk for fast startup
Optimizations: Index-based lookups, vectorized operations, response compression, and optimized top-k queries

Requirements

Python 3.8+
~2-4GB RAM for 10K models
Internet connection for dataset download
Optional: GPU for faster embedding generation (not required)

Citation

If you use this tool or dataset, please cite:

@article{laufer2025anatomy,
  title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
  author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
  journal={arXiv preprint arXiv:2508.06811},
  year={2025},
  url={https://arxiv.org/abs/2508.06811}
}

Paper: arXiv:2508.06811