Commit ·
d5b98e6
0
Parent(s):
Initial commit: Hugging Face Model Ecosystem Navigator
Browse files- Interactive latent space visualization for 1.86M models
- Plotly + Gradio implementation for Hugging Face Spaces
- React + Visx implementation for custom deployment
- Embedding generation with sentence transformers
- UMAP dimensionality reduction
- Model detail modals with Hugging Face links
- Paper: Anatomy of a Machine Learning Ecosystem (arXiv:2508.06811)
- .gitignore +38 -0
- .nvmrc +2 -0
- README.md +164 -0
- app.py +451 -0
- backend/api.py +194 -0
- backend/requirements.txt +12 -0
- data_loader.py +128 -0
- dimensionality_reduction.py +78 -0
- embeddings.py +71 -0
- frontend/.gitignore +24 -0
- frontend/.nvmrc +2 -0
- frontend/_redirects +2 -0
- frontend/netlify.toml +13 -0
- frontend/package.json +50 -0
- frontend/public/_redirects +2 -0
- frontend/public/index.html +18 -0
- frontend/src/App.css +110 -0
- frontend/src/App.tsx +197 -0
- frontend/src/components/ModelModal.css +161 -0
- frontend/src/components/ModelModal.tsx +90 -0
- frontend/src/components/ScatterPlot.tsx +235 -0
- frontend/src/index.css +18 -0
- frontend/src/index.tsx +14 -0
- frontend/src/types.ts +20 -0
- frontend/tsconfig.json +27 -0
- netlify-functions/api.py +180 -0
- netlify-functions/models.py +23 -0
- netlify-functions/requirements.txt +9 -0
- netlify.toml +27 -0
- requirements.txt +12 -0
- test_local.py +13 -0
.gitignore
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*$py.class
|
| 5 |
+
*.so
|
| 6 |
+
.Python
|
| 7 |
+
env/
|
| 8 |
+
venv/
|
| 9 |
+
ENV/
|
| 10 |
+
.venv
|
| 11 |
+
|
| 12 |
+
# Caches
|
| 13 |
+
*.pkl
|
| 14 |
+
*.pickle
|
| 15 |
+
*.cache
|
| 16 |
+
embeddings_cache.pkl
|
| 17 |
+
reduced_embeddings_cache.pkl
|
| 18 |
+
*.npy
|
| 19 |
+
|
| 20 |
+
# IDE
|
| 21 |
+
.vscode/
|
| 22 |
+
.idea/
|
| 23 |
+
*.swp
|
| 24 |
+
*.swo
|
| 25 |
+
*~
|
| 26 |
+
|
| 27 |
+
# OS
|
| 28 |
+
.DS_Store
|
| 29 |
+
Thumbs.db
|
| 30 |
+
|
| 31 |
+
# Gradio
|
| 32 |
+
flagged/
|
| 33 |
+
|
| 34 |
+
# Data
|
| 35 |
+
*.parquet
|
| 36 |
+
*.csv
|
| 37 |
+
data/
|
| 38 |
+
|
.nvmrc
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
18
|
| 2 |
+
|
README.md
ADDED
|
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
|
| 2 |
+
|
| 3 |
+
**Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
|
| 4 |
+
|
| 5 |
+
**Research Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
|
| 6 |
+
|
| 7 |
+
## Abstract
|
| 8 |
+
|
| 9 |
+
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.
|
| 10 |
+
|
| 11 |
+
## About This Tool
|
| 12 |
+
|
| 13 |
+
This interactive latent space navigator visualizes ~1.84M models from the [modelbiome/ai_ecosystem_withmodelcards](https://huggingface.co/datasets/modelbiome/ai_ecosystem_withmodelcards) dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.
|
| 14 |
+
|
| 15 |
+
## Features
|
| 16 |
+
|
| 17 |
+
- **Latent Space Visualization**: 2D embedding visualization showing model relationships
|
| 18 |
+
- **Interactive Exploration**: Hover, click, and zoom to explore models
|
| 19 |
+
- **Smart Filtering**: Filter by library, pipeline tag, popularity, and more
|
| 20 |
+
- **Color & Size Encoding**: Visualize different attributes through color and size
|
| 21 |
+
- **Caching**: Efficient caching of embeddings and reduced dimensions
|
| 22 |
+
- **Performance Optimized**: Handles large datasets through smart sampling
|
| 23 |
+
|
| 24 |
+
## Quick Start
|
| 25 |
+
|
| 26 |
+
### Option 1: Plotly + Gradio (Hugging Face Spaces)
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
pip install -r requirements.txt
|
| 30 |
+
python app.py
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### Option 2: Visx + React (Netlify Deployment)
|
| 34 |
+
|
| 35 |
+
For Netlify deployment, deploy the frontend to Netlify and the backend to Railway or Render. Set the `REACT_APP_API_URL` environment variable to your backend URL.
|
| 36 |
+
|
| 37 |
+
## Installation
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
pip install -r requirements.txt
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Usage
|
| 44 |
+
|
| 45 |
+
### Local Development
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
python app.py
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
Or use the test script:
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
python test_local.py
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
The app will:
|
| 58 |
+
1. Load a sample of 10,000 models from the dataset
|
| 59 |
+
2. Generate embeddings (first run takes ~2-3 minutes)
|
| 60 |
+
3. Reduce dimensions using UMAP
|
| 61 |
+
4. Launch a Gradio interface at `http://localhost:7860`
|
| 62 |
+
|
| 63 |
+
### Using the Interface
|
| 64 |
+
|
| 65 |
+
1. **Filters**: Use the left sidebar to filter models by:
|
| 66 |
+
- Search query (model ID or tags)
|
| 67 |
+
- Minimum downloads
|
| 68 |
+
- Minimum likes
|
| 69 |
+
- Color mapping (library, pipeline, popularity)
|
| 70 |
+
- Size mapping (downloads, likes, trending score)
|
| 71 |
+
|
| 72 |
+
2. **Exploration**:
|
| 73 |
+
- Hover over points to see model information
|
| 74 |
+
- Zoom and pan to explore different regions
|
| 75 |
+
- Use the legend to understand color coding
|
| 76 |
+
|
| 77 |
+
3. **Understanding the Space**:
|
| 78 |
+
- Models closer together are more similar
|
| 79 |
+
- Similarity is based on tags, pipeline type, library, and model card content
|
| 80 |
+
|
| 81 |
+
## Deployment
|
| 82 |
+
|
| 83 |
+
### Hugging Face Spaces
|
| 84 |
+
|
| 85 |
+
1. Create a new Space on Hugging Face
|
| 86 |
+
2. Push this repository to the Space
|
| 87 |
+
3. Ensure `requirements.txt` and `app.py` are in the root
|
| 88 |
+
4. The app will automatically:
|
| 89 |
+
- Load the dataset from Hugging Face Hub
|
| 90 |
+
- Generate embeddings on first run (cached afterwards)
|
| 91 |
+
- Serve the interface via Gradio
|
| 92 |
+
|
| 93 |
+
**Note**: First load may take 2-3 minutes for embedding generation. Subsequent loads will be faster due to caching.
|
| 94 |
+
|
| 95 |
+
### Netlify (React Frontend)
|
| 96 |
+
|
| 97 |
+
1. Deploy frontend to Netlify (set base directory to `frontend`)
|
| 98 |
+
2. Deploy backend to Railway/Render (set root directory to `backend`)
|
| 99 |
+
3. Set `REACT_APP_API_URL` environment variable in Netlify to your backend URL
|
| 100 |
+
4. Update CORS in backend to include your Netlify URL
|
| 101 |
+
|
| 102 |
+
## Architecture
|
| 103 |
+
|
| 104 |
+
### Current Implementation (Plotly + Gradio)
|
| 105 |
+
|
| 106 |
+
- **Data Loading** (`data_loader.py`): Loads dataset from Hugging Face Hub, handles filtering and preprocessing
|
| 107 |
+
- **Embedding Generation** (`embeddings.py`): Creates embeddings from model metadata using sentence transformers
|
| 108 |
+
- **Dimensionality Reduction** (`dimensionality_reduction.py`): Uses UMAP to reduce to 2D for visualization
|
| 109 |
+
- **Main App** (`app.py`): Gradio interface with Plotly visualizations
|
| 110 |
+
|
| 111 |
+
### Alternative Implementation (Visx + React)
|
| 112 |
+
|
| 113 |
+
For better performance and customization, see the `frontend/` and `backend/` directories for a React + Visx implementation:
|
| 114 |
+
|
| 115 |
+
- **Backend** (`backend/api.py`): FastAPI server serving model data
|
| 116 |
+
- **Frontend** (`frontend/`): React app with Visx visualizations
|
| 117 |
+
|
| 118 |
+
### Comparison with Hugging Face Dataset Viewer
|
| 119 |
+
|
| 120 |
+
This project uses a different approach than Hugging Face's built-in dataset viewer:
|
| 121 |
+
|
| 122 |
+
- **HF Dataset Viewer**: Tabular browser for exploring dataset rows (see [dataset-viewer](https://github.com/huggingface/dataset-viewer))
|
| 123 |
+
- **This Project**: Latent space visualization showing semantic relationships between models
|
| 124 |
+
|
| 125 |
+
The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.
|
| 126 |
+
|
| 127 |
+
## Design Decisions
|
| 128 |
+
|
| 129 |
+
The application uses:
|
| 130 |
+
- **2D visualization** for simplicity and accessibility
|
| 131 |
+
- **UMAP** for dimensionality reduction (better global structure than t-SNE)
|
| 132 |
+
- **Sentence transformers** for efficient embedding generation
|
| 133 |
+
- **Smart sampling** to maintain interactivity with large datasets
|
| 134 |
+
- **Caching** to avoid recomputation on filter changes
|
| 135 |
+
|
| 136 |
+
## Performance Notes
|
| 137 |
+
|
| 138 |
+
- **Initial Sample**: 10,000 models (configurable in `app.py`)
|
| 139 |
+
- **Visualization Limit**: Maximum 5,000 points for smooth interaction
|
| 140 |
+
- **Embedding Model**: `all-MiniLM-L6-v2` (good balance of quality and speed)
|
| 141 |
+
- **Caching**: Embeddings and reduced dimensions are cached to disk
|
| 142 |
+
|
| 143 |
+
## Requirements
|
| 144 |
+
|
| 145 |
+
- Python 3.8+
|
| 146 |
+
- ~2-4GB RAM for 10K models
|
| 147 |
+
- Internet connection for dataset download
|
| 148 |
+
- Optional: GPU for faster embedding generation (not required)
|
| 149 |
+
|
| 150 |
+
## Citation
|
| 151 |
+
|
| 152 |
+
If you use this tool or dataset, please cite:
|
| 153 |
+
|
| 154 |
+
```bibtex
|
| 155 |
+
@article{laufer2025anatomy,
|
| 156 |
+
title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
|
| 157 |
+
author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
|
| 158 |
+
journal={arXiv preprint arXiv:2508.06811},
|
| 159 |
+
year={2025},
|
| 160 |
+
url={https://arxiv.org/abs/2508.06811}
|
| 161 |
+
}
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
**Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
|
app.py
ADDED
|
@@ -0,0 +1,451 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Main Gradio application for the Hugging Face Model Ecosystem Navigator.
|
| 3 |
+
"""
|
| 4 |
+
import gradio as gr
|
| 5 |
+
import plotly.graph_objects as go
|
| 6 |
+
import plotly.express as px
|
| 7 |
+
import pandas as pd
|
| 8 |
+
import numpy as np
|
| 9 |
+
from typing import Optional, Tuple
|
| 10 |
+
import os
|
| 11 |
+
|
| 12 |
+
from data_loader import ModelDataLoader
|
| 13 |
+
from embeddings import ModelEmbedder
|
| 14 |
+
from dimensionality_reduction import DimensionReducer
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class ModelNavigatorApp:
|
| 18 |
+
"""Main application class for the model navigator."""
|
| 19 |
+
|
| 20 |
+
def __init__(self):
|
| 21 |
+
self.data_loader = ModelDataLoader()
|
| 22 |
+
self.embedder: Optional[ModelEmbedder] = None
|
| 23 |
+
self.reducer: Optional[DimensionReducer] = None
|
| 24 |
+
self.df: Optional[pd.DataFrame] = None
|
| 25 |
+
self.embeddings: Optional[np.ndarray] = None
|
| 26 |
+
self.reduced_embeddings: Optional[np.ndarray] = None
|
| 27 |
+
self.current_filtered_df: Optional[pd.DataFrame] = None
|
| 28 |
+
|
| 29 |
+
def load_initial_data(self, sample_size: int = 10000):
|
| 30 |
+
"""Load initial sample of data."""
|
| 31 |
+
print("Loading initial data...")
|
| 32 |
+
self.df = self.data_loader.load_data(sample_size=sample_size)
|
| 33 |
+
self.df = self.data_loader.preprocess_for_embedding(self.df)
|
| 34 |
+
return f"Loaded {len(self.df)} models"
|
| 35 |
+
|
| 36 |
+
def generate_visualization(
|
| 37 |
+
self,
|
| 38 |
+
color_by: str = "library_name",
|
| 39 |
+
size_by: str = "downloads",
|
| 40 |
+
min_downloads: int = 0,
|
| 41 |
+
min_likes: int = 0,
|
| 42 |
+
search_query: str = "",
|
| 43 |
+
selected_libraries: list = None,
|
| 44 |
+
selected_pipeline_tags: list = None,
|
| 45 |
+
use_cache: bool = True
|
| 46 |
+
) -> Tuple[go.Figure, pd.DataFrame]:
|
| 47 |
+
"""
|
| 48 |
+
Generate interactive visualization.
|
| 49 |
+
|
| 50 |
+
Returns:
|
| 51 |
+
Plotly figure and filtered dataframe
|
| 52 |
+
"""
|
| 53 |
+
if self.df is None or len(self.df) == 0:
|
| 54 |
+
return go.Figure(), pd.DataFrame()
|
| 55 |
+
|
| 56 |
+
# Filter data
|
| 57 |
+
filtered_df = self.data_loader.filter_data(
|
| 58 |
+
df=self.df,
|
| 59 |
+
min_downloads=min_downloads,
|
| 60 |
+
min_likes=min_likes,
|
| 61 |
+
libraries=selected_libraries if selected_libraries else None,
|
| 62 |
+
pipeline_tags=selected_pipeline_tags if selected_pipeline_tags else None,
|
| 63 |
+
search_query=search_query if search_query else None
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
if len(filtered_df) == 0:
|
| 67 |
+
empty_fig = go.Figure()
|
| 68 |
+
empty_fig.add_annotation(
|
| 69 |
+
text="No models match the selected filters",
|
| 70 |
+
xref="paper", yref="paper",
|
| 71 |
+
x=0.5, y=0.5, showarrow=False
|
| 72 |
+
)
|
| 73 |
+
return empty_fig, filtered_df
|
| 74 |
+
|
| 75 |
+
# Limit to reasonable size for performance
|
| 76 |
+
max_points = 5000
|
| 77 |
+
if len(filtered_df) > max_points:
|
| 78 |
+
filtered_df = filtered_df.sample(n=max_points, random_state=42)
|
| 79 |
+
print(f"Sampled {max_points} models for visualization")
|
| 80 |
+
|
| 81 |
+
# Get indices for filtered data
|
| 82 |
+
filtered_indices = filtered_df.index.tolist()
|
| 83 |
+
|
| 84 |
+
# Generate or load embeddings
|
| 85 |
+
cache_file = "embeddings_cache.pkl"
|
| 86 |
+
if use_cache and os.path.exists(cache_file) and self.embeddings is None:
|
| 87 |
+
try:
|
| 88 |
+
if self.embedder is None:
|
| 89 |
+
self.embedder = ModelEmbedder()
|
| 90 |
+
self.embeddings = self.embedder.load_embeddings(cache_file)
|
| 91 |
+
except Exception as e:
|
| 92 |
+
print(f"Could not load cached embeddings: {e}")
|
| 93 |
+
pass
|
| 94 |
+
|
| 95 |
+
if self.embeddings is None:
|
| 96 |
+
if self.embedder is None:
|
| 97 |
+
self.embedder = ModelEmbedder()
|
| 98 |
+
|
| 99 |
+
# Generate embeddings for all data
|
| 100 |
+
texts = self.df['combined_text'].tolist()
|
| 101 |
+
self.embeddings = self.embedder.generate_embeddings(texts)
|
| 102 |
+
|
| 103 |
+
if use_cache:
|
| 104 |
+
self.embedder.save_embeddings(self.embeddings, cache_file)
|
| 105 |
+
|
| 106 |
+
# Get embeddings for filtered data
|
| 107 |
+
filtered_embeddings = self.embeddings[filtered_indices]
|
| 108 |
+
|
| 109 |
+
# Reduce dimensions
|
| 110 |
+
if self.reducer is None:
|
| 111 |
+
self.reducer = DimensionReducer(method="umap", n_components=2)
|
| 112 |
+
|
| 113 |
+
reduced_cache_file = "reduced_embeddings_cache.npy"
|
| 114 |
+
if use_cache and os.path.exists(reduced_cache_file):
|
| 115 |
+
try:
|
| 116 |
+
self.reduced_embeddings = np.load(reduced_cache_file, allow_pickle=True)
|
| 117 |
+
if len(self.reduced_embeddings) != len(self.df):
|
| 118 |
+
self.reduced_embeddings = None
|
| 119 |
+
except Exception as e:
|
| 120 |
+
print(f"Could not load cached reduced embeddings: {e}")
|
| 121 |
+
pass
|
| 122 |
+
|
| 123 |
+
if self.reduced_embeddings is None or len(self.reduced_embeddings) != len(self.df):
|
| 124 |
+
self.reduced_embeddings = self.reducer.fit_transform(self.embeddings)
|
| 125 |
+
if use_cache:
|
| 126 |
+
np.save(reduced_cache_file, self.reduced_embeddings)
|
| 127 |
+
|
| 128 |
+
filtered_reduced = self.reduced_embeddings[filtered_indices]
|
| 129 |
+
|
| 130 |
+
# Prepare data for plotting
|
| 131 |
+
plot_df = filtered_df.copy()
|
| 132 |
+
plot_df['x'] = filtered_reduced[:, 0]
|
| 133 |
+
plot_df['y'] = filtered_reduced[:, 1]
|
| 134 |
+
|
| 135 |
+
# Color mapping
|
| 136 |
+
if color_by in plot_df.columns:
|
| 137 |
+
color_values = plot_df[color_by].fillna('Unknown')
|
| 138 |
+
else:
|
| 139 |
+
color_values = pd.Series(['All Models'] * len(plot_df))
|
| 140 |
+
|
| 141 |
+
# Size mapping
|
| 142 |
+
if size_by and size_by != "None" and size_by in plot_df.columns:
|
| 143 |
+
size_values = plot_df[size_by].fillna(0)
|
| 144 |
+
# Normalize sizes
|
| 145 |
+
if size_values.max() > 0:
|
| 146 |
+
size_values = 5 + 15 * (size_values / size_values.max())
|
| 147 |
+
else:
|
| 148 |
+
size_values = pd.Series([10] * len(plot_df))
|
| 149 |
+
else:
|
| 150 |
+
size_values = pd.Series([10] * len(plot_df))
|
| 151 |
+
|
| 152 |
+
# Create hover text
|
| 153 |
+
hover_texts = []
|
| 154 |
+
for idx, row in plot_df.iterrows():
|
| 155 |
+
hover = f"<b>{row.get('model_id', 'Unknown')}</b><br>"
|
| 156 |
+
hover += f"Library: {row.get('library_name', 'N/A')}<br>"
|
| 157 |
+
hover += f"Pipeline: {row.get('pipeline_tag', 'N/A')}<br>"
|
| 158 |
+
hover += f"Downloads: {row.get('downloads', 0):,}<br>"
|
| 159 |
+
hover += f"Likes: {row.get('likes', 0):,}"
|
| 160 |
+
hover_texts.append(hover)
|
| 161 |
+
|
| 162 |
+
# Create plotly figure
|
| 163 |
+
fig = go.Figure()
|
| 164 |
+
|
| 165 |
+
# Store model IDs with indices for click handling
|
| 166 |
+
model_id_map = {i: row.get('model_id', 'Unknown') for i, row in plot_df.iterrows()}
|
| 167 |
+
|
| 168 |
+
# Group by color if categorical
|
| 169 |
+
is_categorical = len(color_values) > 0 and isinstance(color_values.iloc[0], str)
|
| 170 |
+
|
| 171 |
+
if is_categorical and color_by in plot_df.columns:
|
| 172 |
+
unique_colors = color_values.unique()
|
| 173 |
+
colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
|
| 174 |
+
color_map = {val: colors[i % len(colors)] for i, val in enumerate(unique_colors)}
|
| 175 |
+
|
| 176 |
+
for color_val in unique_colors:
|
| 177 |
+
mask = color_values == color_val
|
| 178 |
+
subset_df = plot_df[mask]
|
| 179 |
+
subset_hover = [hover_texts[i] for i, m in enumerate(mask) if m]
|
| 180 |
+
subset_sizes = size_values[mask]
|
| 181 |
+
|
| 182 |
+
# Create customdata with model IDs for click handling
|
| 183 |
+
subset_customdata = [
|
| 184 |
+
[int(idx), str(row.get('model_id', 'Unknown'))]
|
| 185 |
+
for idx, row in subset_df.iterrows()
|
| 186 |
+
]
|
| 187 |
+
|
| 188 |
+
fig.add_trace(go.Scatter(
|
| 189 |
+
x=subset_df['x'],
|
| 190 |
+
y=subset_df['y'],
|
| 191 |
+
mode='markers',
|
| 192 |
+
name=str(color_val)[:30], # Truncate long names
|
| 193 |
+
marker=dict(
|
| 194 |
+
size=subset_sizes.values,
|
| 195 |
+
color=color_map[color_val],
|
| 196 |
+
opacity=0.7,
|
| 197 |
+
line=dict(width=0.5, color='white')
|
| 198 |
+
),
|
| 199 |
+
text=subset_df['model_id'].tolist(),
|
| 200 |
+
customdata=subset_customdata,
|
| 201 |
+
hovertemplate='%{text}<br>Click for details<extra></extra>',
|
| 202 |
+
showlegend=True
|
| 203 |
+
))
|
| 204 |
+
else:
|
| 205 |
+
# Continuous color scale
|
| 206 |
+
customdata = [
|
| 207 |
+
[int(idx), str(row.get('model_id', 'Unknown'))]
|
| 208 |
+
for idx, row in plot_df.iterrows()
|
| 209 |
+
]
|
| 210 |
+
|
| 211 |
+
fig.add_trace(go.Scatter(
|
| 212 |
+
x=plot_df['x'],
|
| 213 |
+
y=plot_df['y'],
|
| 214 |
+
mode='markers',
|
| 215 |
+
marker=dict(
|
| 216 |
+
size=size_values.values,
|
| 217 |
+
color=color_values.values,
|
| 218 |
+
colorscale='Viridis',
|
| 219 |
+
opacity=0.7,
|
| 220 |
+
line=dict(width=0.5, color='white'),
|
| 221 |
+
colorbar=dict(title=color_by)
|
| 222 |
+
),
|
| 223 |
+
text=plot_df['model_id'].tolist(),
|
| 224 |
+
customdata=customdata,
|
| 225 |
+
hovertemplate='%{text}<br>Click for details<extra></extra>',
|
| 226 |
+
showlegend=False
|
| 227 |
+
))
|
| 228 |
+
|
| 229 |
+
# Update layout
|
| 230 |
+
fig.update_layout(
|
| 231 |
+
title={
|
| 232 |
+
'text': f'Model Latent Space Navigator ({len(plot_df)} models)',
|
| 233 |
+
'x': 0.5,
|
| 234 |
+
'xanchor': 'center'
|
| 235 |
+
},
|
| 236 |
+
xaxis_title="Dimension 1",
|
| 237 |
+
yaxis_title="Dimension 2",
|
| 238 |
+
hovermode='closest',
|
| 239 |
+
template='plotly_white',
|
| 240 |
+
height=700,
|
| 241 |
+
clickmode='event+select'
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
return fig, filtered_df
|
| 245 |
+
|
| 246 |
+
def get_model_details(self, model_id: str) -> str:
|
| 247 |
+
"""Get detailed information about a model."""
|
| 248 |
+
if self.df is None:
|
| 249 |
+
return "No data loaded"
|
| 250 |
+
|
| 251 |
+
model = self.df[self.df.get('model_id', '') == model_id]
|
| 252 |
+
if len(model) == 0:
|
| 253 |
+
return f"Model '{model_id}' not found"
|
| 254 |
+
|
| 255 |
+
model = model.iloc[0]
|
| 256 |
+
|
| 257 |
+
details = f"# {model.get('model_id', 'Unknown')}\n\n"
|
| 258 |
+
details += f"**Library:** {model.get('library_name', 'N/A')}\n\n"
|
| 259 |
+
details += f"**Pipeline Tag:** {model.get('pipeline_tag', 'N/A')}\n\n"
|
| 260 |
+
details += f"**Downloads:** {model.get('downloads', 0):,}\n\n"
|
| 261 |
+
details += f"**Likes:** {model.get('likes', 0):,}\n\n"
|
| 262 |
+
details += f"**Trending Score:** {model.get('trendingScore', 'N/A')}\n\n"
|
| 263 |
+
|
| 264 |
+
if pd.notna(model.get('tags')):
|
| 265 |
+
details += f"**Tags:** {model.get('tags', '')}\n\n"
|
| 266 |
+
|
| 267 |
+
if pd.notna(model.get('licenses')):
|
| 268 |
+
details += f"**License:** {model.get('licenses', '')}\n\n"
|
| 269 |
+
|
| 270 |
+
if pd.notna(model.get('parent_model')):
|
| 271 |
+
details += f"**Parent Model:** {model.get('parent_model', '')}\n\n"
|
| 272 |
+
|
| 273 |
+
return details
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
def create_interface():
|
| 277 |
+
"""Create and launch Gradio interface."""
|
| 278 |
+
app = ModelNavigatorApp()
|
| 279 |
+
|
| 280 |
+
# Load initial data
|
| 281 |
+
status = app.load_initial_data(sample_size=10000)
|
| 282 |
+
print(status)
|
| 283 |
+
|
| 284 |
+
with gr.Blocks(title="Anatomy of a Machine Learning Ecosystem", theme=gr.themes.Soft()) as demo:
|
| 285 |
+
gr.Markdown("""
|
| 286 |
+
# Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
|
| 287 |
+
|
| 288 |
+
**Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
|
| 289 |
+
|
| 290 |
+
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families.
|
| 291 |
+
|
| 292 |
+
**Read the full paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
**How to use this navigator:**
|
| 297 |
+
- Adjust filters to explore different subsets of models
|
| 298 |
+
- Hover over points to see model information
|
| 299 |
+
- Use color and size options to highlight different attributes
|
| 300 |
+
- Similar models appear closer together in the latent space
|
| 301 |
+
- Models are positioned based on their similarity (tags, pipeline, library, and model card content)
|
| 302 |
+
""")
|
| 303 |
+
|
| 304 |
+
with gr.Row():
|
| 305 |
+
with gr.Column(scale=1):
|
| 306 |
+
gr.Markdown("### Filters")
|
| 307 |
+
|
| 308 |
+
search_query = gr.Textbox(
|
| 309 |
+
label="Search Models",
|
| 310 |
+
placeholder="Search by model ID or tags...",
|
| 311 |
+
value=""
|
| 312 |
+
)
|
| 313 |
+
|
| 314 |
+
min_downloads = gr.Slider(
|
| 315 |
+
label="Min Downloads",
|
| 316 |
+
minimum=0,
|
| 317 |
+
maximum=1000000,
|
| 318 |
+
value=0,
|
| 319 |
+
step=1000
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
min_likes = gr.Slider(
|
| 323 |
+
label="Min Likes",
|
| 324 |
+
minimum=0,
|
| 325 |
+
maximum=10000,
|
| 326 |
+
value=0,
|
| 327 |
+
step=10
|
| 328 |
+
)
|
| 329 |
+
|
| 330 |
+
color_by = gr.Dropdown(
|
| 331 |
+
label="Color By",
|
| 332 |
+
choices=["library_name", "pipeline_tag", "downloads", "likes"],
|
| 333 |
+
value="library_name"
|
| 334 |
+
)
|
| 335 |
+
|
| 336 |
+
size_by = gr.Dropdown(
|
| 337 |
+
label="Size By",
|
| 338 |
+
choices=["downloads", "likes", "trendingScore", "None"],
|
| 339 |
+
value="downloads"
|
| 340 |
+
)
|
| 341 |
+
|
| 342 |
+
update_btn = gr.Button("Update Visualization", variant="primary")
|
| 343 |
+
|
| 344 |
+
with gr.Column(scale=3):
|
| 345 |
+
plot = gr.Plot(label="Model Latent Space")
|
| 346 |
+
model_details = gr.Markdown(
|
| 347 |
+
value="**Instructions:** Use the filters above to explore models. Hover over points to see details, **click on a point** to view full model information and link to Hugging Face.",
|
| 348 |
+
label="Model Details"
|
| 349 |
+
)
|
| 350 |
+
|
| 351 |
+
def handle_plot_click(evt: gr.SelectData):
|
| 352 |
+
"""Handle plot click and show model details."""
|
| 353 |
+
if evt is None or app.df is None:
|
| 354 |
+
return "**Click on a model point to see details**"
|
| 355 |
+
|
| 356 |
+
try:
|
| 357 |
+
# Get the point index from the click event
|
| 358 |
+
point_idx = evt.index
|
| 359 |
+
if point_idx is None:
|
| 360 |
+
return "**Click on a model point to see details**"
|
| 361 |
+
|
| 362 |
+
# Get the current filtered dataframe
|
| 363 |
+
if app.current_filtered_df is not None and len(app.current_filtered_df) > 0:
|
| 364 |
+
filtered_df = app.current_filtered_df
|
| 365 |
+
else:
|
| 366 |
+
# Fallback: use the full dataframe
|
| 367 |
+
filtered_df = app.df
|
| 368 |
+
|
| 369 |
+
# Limit to max_points if needed
|
| 370 |
+
if len(filtered_df) > 5000:
|
| 371 |
+
filtered_df = filtered_df.sample(n=5000, random_state=42)
|
| 372 |
+
|
| 373 |
+
if point_idx < len(filtered_df):
|
| 374 |
+
model_row = filtered_df.iloc[point_idx]
|
| 375 |
+
model_id = model_row.get('model_id', 'Unknown')
|
| 376 |
+
|
| 377 |
+
# Get full model details from the original dataframe
|
| 378 |
+
model = app.df[app.df.get('model_id', '') == model_id]
|
| 379 |
+
if len(model) == 0:
|
| 380 |
+
return f"**Model not found:** {model_id}"
|
| 381 |
+
|
| 382 |
+
model = model.iloc[0]
|
| 383 |
+
hf_url = f"https://huggingface.co/{model_id}"
|
| 384 |
+
|
| 385 |
+
details = f"""# {model_id}
|
| 386 |
+
|
| 387 |
+
**[View on Hugging Face]({hf_url})**
|
| 388 |
+
|
| 389 |
+
## Model Information
|
| 390 |
+
|
| 391 |
+
- **Library:** {model.get('library_name', 'N/A')}
|
| 392 |
+
- **Pipeline Tag:** {model.get('pipeline_tag', 'N/A')}
|
| 393 |
+
- **Downloads:** {model.get('downloads', 0):,}
|
| 394 |
+
- **Likes:** {model.get('likes', 0):,}
|
| 395 |
+
"""
|
| 396 |
+
if pd.notna(model.get('trendingScore')):
|
| 397 |
+
details += f"- **Trending Score:** {model.get('trendingScore', 0):.2f}\n\n"
|
| 398 |
+
else:
|
| 399 |
+
details += "\n"
|
| 400 |
+
|
| 401 |
+
if pd.notna(model.get('tags')):
|
| 402 |
+
details += f"**Tags:** {model.get('tags', '')}\n\n"
|
| 403 |
+
if pd.notna(model.get('licenses')):
|
| 404 |
+
details += f"**License:** {model.get('licenses', '')}\n\n"
|
| 405 |
+
if pd.notna(model.get('parent_model')):
|
| 406 |
+
details += f"**Parent Model:** {model.get('parent_model', '')}\n\n"
|
| 407 |
+
|
| 408 |
+
return details
|
| 409 |
+
else:
|
| 410 |
+
return f"**Point index out of range:** {point_idx}"
|
| 411 |
+
except Exception as e:
|
| 412 |
+
import traceback
|
| 413 |
+
return f"**Error loading model details:**\n```\n{str(e)}\n{traceback.format_exc()}\n```"
|
| 414 |
+
|
| 415 |
+
return "**Click on a model point to see details**"
|
| 416 |
+
|
| 417 |
+
def update_plot_and_store(color_by_val, size_by_val, min_dl, min_lk, search):
|
| 418 |
+
fig, df = app.generate_visualization(
|
| 419 |
+
color_by=color_by_val,
|
| 420 |
+
size_by=size_by_val,
|
| 421 |
+
min_downloads=int(min_dl),
|
| 422 |
+
min_likes=int(min_lk),
|
| 423 |
+
search_query=search
|
| 424 |
+
)
|
| 425 |
+
# Store the filtered dataframe for click handling
|
| 426 |
+
app.current_filtered_df = df
|
| 427 |
+
return fig
|
| 428 |
+
|
| 429 |
+
update_btn.click(
|
| 430 |
+
fn=update_plot_and_store,
|
| 431 |
+
inputs=[color_by, size_by, min_downloads, min_likes, search_query],
|
| 432 |
+
outputs=plot
|
| 433 |
+
)
|
| 434 |
+
|
| 435 |
+
# Handle plot clicks - Gradio's Plot component supports click events
|
| 436 |
+
plot.select(
|
| 437 |
+
fn=handle_plot_click,
|
| 438 |
+
outputs=model_details
|
| 439 |
+
)
|
| 440 |
+
|
| 441 |
+
# Initial plot
|
| 442 |
+
initial_fig, initial_df = app.generate_visualization()
|
| 443 |
+
plot.value = initial_fig
|
| 444 |
+
app.current_filtered_df = initial_df
|
| 445 |
+
|
| 446 |
+
return demo
|
| 447 |
+
|
| 448 |
+
|
| 449 |
+
if __name__ == "__main__":
|
| 450 |
+
demo = create_interface()
|
| 451 |
+
demo.launch(share=False, server_name="0.0.0.0", server_port=7860)
|
backend/api.py
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FastAPI backend for serving model data to React/Visx frontend.
|
| 3 |
+
"""
|
| 4 |
+
from fastapi import FastAPI, HTTPException, Query
|
| 5 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 6 |
+
from typing import Optional, List
|
| 7 |
+
import pandas as pd
|
| 8 |
+
import numpy as np
|
| 9 |
+
import os
|
| 10 |
+
from pydantic import BaseModel
|
| 11 |
+
|
| 12 |
+
from data_loader import ModelDataLoader
|
| 13 |
+
from embeddings import ModelEmbedder
|
| 14 |
+
from dimensionality_reduction import DimensionReducer
|
| 15 |
+
|
| 16 |
+
app = FastAPI(title="HF Model Ecosystem API")
|
| 17 |
+
|
| 18 |
+
# CORS middleware for React frontend
|
| 19 |
+
# Update allow_origins with your Netlify URL in production
|
| 20 |
+
# Note: Add your specific Netlify URL after deployment
|
| 21 |
+
FRONTEND_URL = os.getenv("FRONTEND_URL", "http://localhost:3000")
|
| 22 |
+
app.add_middleware(
|
| 23 |
+
CORSMiddleware,
|
| 24 |
+
allow_origins=[
|
| 25 |
+
"http://localhost:3000", # Local development
|
| 26 |
+
FRONTEND_URL, # Production frontend URL
|
| 27 |
+
# Add your Netlify URL here after deployment, e.g.:
|
| 28 |
+
# "https://your-app-name.netlify.app",
|
| 29 |
+
],
|
| 30 |
+
allow_credentials=True,
|
| 31 |
+
allow_methods=["*"],
|
| 32 |
+
allow_headers=["*"],
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
# Global state
|
| 36 |
+
data_loader = ModelDataLoader()
|
| 37 |
+
embedder: Optional[ModelEmbedder] = None
|
| 38 |
+
reducer: Optional[DimensionReducer] = None
|
| 39 |
+
df: Optional[pd.DataFrame] = None
|
| 40 |
+
embeddings: Optional[np.ndarray] = None
|
| 41 |
+
reduced_embeddings: Optional[np.ndarray] = None
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class FilterParams(BaseModel):
|
| 45 |
+
min_downloads: int = 0
|
| 46 |
+
min_likes: int = 0
|
| 47 |
+
search_query: Optional[str] = None
|
| 48 |
+
libraries: Optional[List[str]] = None
|
| 49 |
+
pipeline_tags: Optional[List[str]] = None
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class ModelPoint(BaseModel):
|
| 53 |
+
model_id: str
|
| 54 |
+
x: float
|
| 55 |
+
y: float
|
| 56 |
+
library_name: Optional[str]
|
| 57 |
+
pipeline_tag: Optional[str]
|
| 58 |
+
downloads: int
|
| 59 |
+
likes: int
|
| 60 |
+
trending_score: Optional[float]
|
| 61 |
+
tags: Optional[str]
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
@app.on_event("startup")
|
| 65 |
+
async def startup_event():
|
| 66 |
+
"""Initialize data and models on startup."""
|
| 67 |
+
global df, embedder, reducer
|
| 68 |
+
|
| 69 |
+
print("Loading data...")
|
| 70 |
+
df = data_loader.load_data(sample_size=10000)
|
| 71 |
+
df = data_loader.preprocess_for_embedding(df)
|
| 72 |
+
|
| 73 |
+
print("Initializing embedder...")
|
| 74 |
+
embedder = ModelEmbedder()
|
| 75 |
+
|
| 76 |
+
print("Initializing reducer...")
|
| 77 |
+
reducer = DimensionReducer(method="umap", n_components=2)
|
| 78 |
+
|
| 79 |
+
print("API ready!")
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
@app.get("/")
|
| 83 |
+
async def root():
|
| 84 |
+
return {"message": "HF Model Ecosystem API", "status": "running"}
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
@app.get("/api/models", response_model=List[ModelPoint])
|
| 88 |
+
async def get_models(
|
| 89 |
+
min_downloads: int = Query(0),
|
| 90 |
+
min_likes: int = Query(0),
|
| 91 |
+
search_query: Optional[str] = Query(None),
|
| 92 |
+
color_by: str = Query("library_name"),
|
| 93 |
+
size_by: str = Query("downloads"),
|
| 94 |
+
max_points: int = Query(5000)
|
| 95 |
+
):
|
| 96 |
+
"""
|
| 97 |
+
Get filtered models with 2D coordinates for visualization.
|
| 98 |
+
"""
|
| 99 |
+
global df, embedder, reducer, embeddings, reduced_embeddings
|
| 100 |
+
|
| 101 |
+
if df is None:
|
| 102 |
+
raise HTTPException(status_code=503, detail="Data not loaded")
|
| 103 |
+
|
| 104 |
+
# Filter data
|
| 105 |
+
filtered_df = data_loader.filter_data(
|
| 106 |
+
df=df,
|
| 107 |
+
min_downloads=min_downloads,
|
| 108 |
+
min_likes=min_likes,
|
| 109 |
+
search_query=search_query,
|
| 110 |
+
libraries=None, # Can be added as query params
|
| 111 |
+
pipeline_tags=None
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
if len(filtered_df) == 0:
|
| 115 |
+
return []
|
| 116 |
+
|
| 117 |
+
# Limit points
|
| 118 |
+
if len(filtered_df) > max_points:
|
| 119 |
+
filtered_df = filtered_df.sample(n=max_points, random_state=42)
|
| 120 |
+
|
| 121 |
+
# Generate embeddings if needed
|
| 122 |
+
if embeddings is None:
|
| 123 |
+
texts = df['combined_text'].tolist()
|
| 124 |
+
embeddings = embedder.generate_embeddings(texts)
|
| 125 |
+
|
| 126 |
+
# Reduce dimensions if needed
|
| 127 |
+
if reduced_embeddings is None:
|
| 128 |
+
reduced_embeddings = reducer.fit_transform(embeddings)
|
| 129 |
+
|
| 130 |
+
# Get coordinates for filtered data
|
| 131 |
+
filtered_indices = filtered_df.index.tolist()
|
| 132 |
+
filtered_reduced = reduced_embeddings[filtered_indices]
|
| 133 |
+
|
| 134 |
+
# Prepare response
|
| 135 |
+
models = []
|
| 136 |
+
for idx, (i, row) in enumerate(filtered_df.iterrows()):
|
| 137 |
+
models.append(ModelPoint(
|
| 138 |
+
model_id=row.get('model_id', 'Unknown'),
|
| 139 |
+
x=float(filtered_reduced[idx, 0]),
|
| 140 |
+
y=float(filtered_reduced[idx, 1]),
|
| 141 |
+
library_name=row.get('library_name'),
|
| 142 |
+
pipeline_tag=row.get('pipeline_tag'),
|
| 143 |
+
downloads=int(row.get('downloads', 0)),
|
| 144 |
+
likes=int(row.get('likes', 0)),
|
| 145 |
+
trending_score=float(row.get('trendingScore', 0)) if pd.notna(row.get('trendingScore')) else None,
|
| 146 |
+
tags=row.get('tags') if pd.notna(row.get('tags')) else None
|
| 147 |
+
))
|
| 148 |
+
|
| 149 |
+
return models
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
@app.get("/api/stats")
|
| 153 |
+
async def get_stats():
|
| 154 |
+
"""Get dataset statistics."""
|
| 155 |
+
if df is None:
|
| 156 |
+
raise HTTPException(status_code=503, detail="Data not loaded")
|
| 157 |
+
|
| 158 |
+
return {
|
| 159 |
+
"total_models": len(df),
|
| 160 |
+
"unique_libraries": df['library_name'].nunique() if 'library_name' in df.columns else 0,
|
| 161 |
+
"unique_pipelines": df['pipeline_tag'].nunique() if 'pipeline_tag' in df.columns else 0,
|
| 162 |
+
"avg_downloads": float(df['downloads'].mean()) if 'downloads' in df.columns else 0,
|
| 163 |
+
"avg_likes": float(df['likes'].mean()) if 'likes' in df.columns else 0
|
| 164 |
+
}
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
@app.get("/api/model/{model_id}")
|
| 168 |
+
async def get_model_details(model_id: str):
|
| 169 |
+
"""Get detailed information about a specific model."""
|
| 170 |
+
if df is None:
|
| 171 |
+
raise HTTPException(status_code=503, detail="Data not loaded")
|
| 172 |
+
|
| 173 |
+
model = df[df.get('model_id', '') == model_id]
|
| 174 |
+
if len(model) == 0:
|
| 175 |
+
raise HTTPException(status_code=404, detail="Model not found")
|
| 176 |
+
|
| 177 |
+
model = model.iloc[0]
|
| 178 |
+
return {
|
| 179 |
+
"model_id": model.get('model_id'),
|
| 180 |
+
"library_name": model.get('library_name'),
|
| 181 |
+
"pipeline_tag": model.get('pipeline_tag'),
|
| 182 |
+
"downloads": int(model.get('downloads', 0)),
|
| 183 |
+
"likes": int(model.get('likes', 0)),
|
| 184 |
+
"trending_score": float(model.get('trendingScore', 0)) if pd.notna(model.get('trendingScore')) else None,
|
| 185 |
+
"tags": model.get('tags') if pd.notna(model.get('tags')) else None,
|
| 186 |
+
"licenses": model.get('licenses') if pd.notna(model.get('licenses')) else None,
|
| 187 |
+
"parent_model": model.get('parent_model') if pd.notna(model.get('parent_model')) else None
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
if __name__ == "__main__":
|
| 192 |
+
import uvicorn
|
| 193 |
+
uvicorn.run(app, host="0.0.0.0", port=8000)
|
| 194 |
+
|
backend/requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi>=0.104.0
|
| 2 |
+
uvicorn[standard]>=0.24.0
|
| 3 |
+
pydantic>=2.0.0
|
| 4 |
+
pandas>=2.0.0
|
| 5 |
+
numpy>=1.24.0
|
| 6 |
+
sentence-transformers>=2.2.0
|
| 7 |
+
umap-learn>=0.5.4
|
| 8 |
+
scikit-learn>=1.3.0
|
| 9 |
+
datasets>=2.14.0
|
| 10 |
+
huggingface-hub>=0.17.0
|
| 11 |
+
tqdm>=4.66.0
|
| 12 |
+
|
data_loader.py
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Data loading and preprocessing for the Hugging Face model ecosystem dataset.
|
| 3 |
+
"""
|
| 4 |
+
import pandas as pd
|
| 5 |
+
from datasets import load_dataset
|
| 6 |
+
from typing import Optional, Dict, List
|
| 7 |
+
import numpy as np
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
class ModelDataLoader:
|
| 11 |
+
"""Load and preprocess model data from Hugging Face dataset."""
|
| 12 |
+
|
| 13 |
+
def __init__(self, dataset_name: str = "modelbiome/ai_ecosystem_withmodelcards"):
|
| 14 |
+
self.dataset_name = dataset_name
|
| 15 |
+
self.df: Optional[pd.DataFrame] = None
|
| 16 |
+
|
| 17 |
+
def load_data(self, sample_size: Optional[int] = None, split: str = "train") -> pd.DataFrame:
|
| 18 |
+
"""
|
| 19 |
+
Load dataset from Hugging Face Hub.
|
| 20 |
+
|
| 21 |
+
Args:
|
| 22 |
+
sample_size: If provided, randomly sample this many rows
|
| 23 |
+
split: Dataset split to load
|
| 24 |
+
|
| 25 |
+
Returns:
|
| 26 |
+
DataFrame with model data
|
| 27 |
+
"""
|
| 28 |
+
print(f"Loading dataset {self.dataset_name}...")
|
| 29 |
+
dataset = load_dataset(self.dataset_name, split=split)
|
| 30 |
+
|
| 31 |
+
if sample_size and len(dataset) > sample_size:
|
| 32 |
+
print(f"Sampling {sample_size} models from {len(dataset)} total...")
|
| 33 |
+
dataset = dataset.shuffle(seed=42).select(range(sample_size))
|
| 34 |
+
|
| 35 |
+
self.df = dataset.to_pandas()
|
| 36 |
+
print(f"Loaded {len(self.df)} models")
|
| 37 |
+
|
| 38 |
+
return self.df
|
| 39 |
+
|
| 40 |
+
def preprocess_for_embedding(self, df: Optional[pd.DataFrame] = None) -> pd.DataFrame:
|
| 41 |
+
"""
|
| 42 |
+
Preprocess data for embedding generation.
|
| 43 |
+
Combines text fields into a single representation.
|
| 44 |
+
|
| 45 |
+
Args:
|
| 46 |
+
df: DataFrame to process (uses self.df if None)
|
| 47 |
+
|
| 48 |
+
Returns:
|
| 49 |
+
DataFrame with combined text field
|
| 50 |
+
"""
|
| 51 |
+
if df is None:
|
| 52 |
+
df = self.df.copy()
|
| 53 |
+
else:
|
| 54 |
+
df = df.copy()
|
| 55 |
+
|
| 56 |
+
# Fill NaN values
|
| 57 |
+
text_fields = ['tags', 'pipeline_tag', 'library_name', 'modelCard']
|
| 58 |
+
for field in text_fields:
|
| 59 |
+
if field in df.columns:
|
| 60 |
+
df[field] = df[field].fillna('')
|
| 61 |
+
|
| 62 |
+
# Combine text fields for embedding
|
| 63 |
+
df['combined_text'] = (
|
| 64 |
+
df.get('tags', '').astype(str) + ' ' +
|
| 65 |
+
df.get('pipeline_tag', '').astype(str) + ' ' +
|
| 66 |
+
df.get('library_name', '').astype(str) + ' ' +
|
| 67 |
+
df['modelCard'].astype(str).str[:500] # Limit modelCard to first 500 chars
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
return df
|
| 71 |
+
|
| 72 |
+
def filter_data(
|
| 73 |
+
self,
|
| 74 |
+
df: Optional[pd.DataFrame] = None,
|
| 75 |
+
min_downloads: Optional[int] = None,
|
| 76 |
+
min_likes: Optional[int] = None,
|
| 77 |
+
libraries: Optional[List[str]] = None,
|
| 78 |
+
pipeline_tags: Optional[List[str]] = None,
|
| 79 |
+
search_query: Optional[str] = None
|
| 80 |
+
) -> pd.DataFrame:
|
| 81 |
+
"""
|
| 82 |
+
Filter dataset based on criteria.
|
| 83 |
+
|
| 84 |
+
Args:
|
| 85 |
+
df: DataFrame to filter (uses self.df if None)
|
| 86 |
+
min_downloads: Minimum download count
|
| 87 |
+
min_likes: Minimum like count
|
| 88 |
+
libraries: List of library names to include
|
| 89 |
+
pipeline_tags: List of pipeline tags to include
|
| 90 |
+
search_query: Text search in model_id or tags
|
| 91 |
+
|
| 92 |
+
Returns:
|
| 93 |
+
Filtered DataFrame
|
| 94 |
+
"""
|
| 95 |
+
if df is None:
|
| 96 |
+
df = self.df.copy()
|
| 97 |
+
else:
|
| 98 |
+
df = df.copy()
|
| 99 |
+
|
| 100 |
+
if min_downloads is not None:
|
| 101 |
+
df = df[df.get('downloads', 0) >= min_downloads]
|
| 102 |
+
|
| 103 |
+
if min_likes is not None:
|
| 104 |
+
df = df[df.get('likes', 0) >= min_likes]
|
| 105 |
+
|
| 106 |
+
if libraries:
|
| 107 |
+
df = df[df.get('library_name', '').isin(libraries)]
|
| 108 |
+
|
| 109 |
+
if pipeline_tags:
|
| 110 |
+
df = df[df.get('pipeline_tag', '').isin(pipeline_tags)]
|
| 111 |
+
|
| 112 |
+
if search_query:
|
| 113 |
+
query_lower = search_query.lower()
|
| 114 |
+
mask = (
|
| 115 |
+
df.get('model_id', '').astype(str).str.lower().str.contains(query_lower) |
|
| 116 |
+
df.get('tags', '').astype(str).str.lower().str.contains(query_lower)
|
| 117 |
+
)
|
| 118 |
+
df = df[mask]
|
| 119 |
+
|
| 120 |
+
return df
|
| 121 |
+
|
| 122 |
+
def get_unique_values(self, column: str) -> List[str]:
|
| 123 |
+
"""Get unique non-null values from a column."""
|
| 124 |
+
if self.df is None:
|
| 125 |
+
return []
|
| 126 |
+
values = self.df[column].dropna().unique().tolist()
|
| 127 |
+
return sorted([str(v) for v in values if v and str(v) != 'nan'])
|
| 128 |
+
|
dimensionality_reduction.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Dimensionality reduction for visualization (UMAP, t-SNE).
|
| 3 |
+
"""
|
| 4 |
+
import numpy as np
|
| 5 |
+
from umap import UMAP
|
| 6 |
+
from sklearn.manifold import TSNE
|
| 7 |
+
from typing import Optional
|
| 8 |
+
import pickle
|
| 9 |
+
import os
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class DimensionReducer:
|
| 13 |
+
"""Reduce high-dimensional embeddings to 2D/3D for visualization."""
|
| 14 |
+
|
| 15 |
+
def __init__(self, method: str = "umap", n_components: int = 2):
|
| 16 |
+
"""
|
| 17 |
+
Initialize reducer.
|
| 18 |
+
|
| 19 |
+
Args:
|
| 20 |
+
method: 'umap' or 'tsne'
|
| 21 |
+
n_components: Number of dimensions (2 or 3)
|
| 22 |
+
"""
|
| 23 |
+
self.method = method.lower()
|
| 24 |
+
self.n_components = n_components
|
| 25 |
+
|
| 26 |
+
if self.method == "umap":
|
| 27 |
+
self.reducer = UMAP(
|
| 28 |
+
n_components=n_components,
|
| 29 |
+
n_neighbors=15,
|
| 30 |
+
min_dist=0.1,
|
| 31 |
+
metric='cosine',
|
| 32 |
+
random_state=42
|
| 33 |
+
)
|
| 34 |
+
elif self.method == "tsne":
|
| 35 |
+
self.reducer = TSNE(
|
| 36 |
+
n_components=n_components,
|
| 37 |
+
perplexity=30,
|
| 38 |
+
random_state=42,
|
| 39 |
+
n_iter=1000
|
| 40 |
+
)
|
| 41 |
+
else:
|
| 42 |
+
raise ValueError(f"Unknown method: {method}. Use 'umap' or 'tsne'")
|
| 43 |
+
|
| 44 |
+
def fit_transform(self, embeddings: np.ndarray) -> np.ndarray:
|
| 45 |
+
"""
|
| 46 |
+
Fit reducer and transform embeddings.
|
| 47 |
+
|
| 48 |
+
Args:
|
| 49 |
+
embeddings: High-dimensional embeddings (n_samples, embedding_dim)
|
| 50 |
+
|
| 51 |
+
Returns:
|
| 52 |
+
Reduced embeddings (n_samples, n_components)
|
| 53 |
+
"""
|
| 54 |
+
print(f"Reducing dimensions using {self.method.upper()}...")
|
| 55 |
+
reduced = self.reducer.fit_transform(embeddings)
|
| 56 |
+
print(f"Reduced to {self.n_components}D: shape {reduced.shape}")
|
| 57 |
+
return reduced
|
| 58 |
+
|
| 59 |
+
def transform(self, embeddings: np.ndarray) -> np.ndarray:
|
| 60 |
+
"""Transform new embeddings (only for UMAP, t-SNE doesn't support this)."""
|
| 61 |
+
if self.method == "umap":
|
| 62 |
+
return self.reducer.transform(embeddings)
|
| 63 |
+
else:
|
| 64 |
+
raise ValueError("t-SNE doesn't support transform. Use fit_transform instead.")
|
| 65 |
+
|
| 66 |
+
def save_reducer(self, filepath: str):
|
| 67 |
+
"""Save fitted reducer to disk."""
|
| 68 |
+
os.makedirs(os.path.dirname(filepath) if os.path.dirname(filepath) else '.', exist_ok=True)
|
| 69 |
+
with open(filepath, 'wb') as f:
|
| 70 |
+
pickle.dump(self.reducer, f)
|
| 71 |
+
print(f"Reducer saved to {filepath}")
|
| 72 |
+
|
| 73 |
+
def load_reducer(self, filepath: str):
|
| 74 |
+
"""Load fitted reducer from disk."""
|
| 75 |
+
with open(filepath, 'rb') as f:
|
| 76 |
+
self.reducer = pickle.load(f)
|
| 77 |
+
print(f"Reducer loaded from {filepath}")
|
| 78 |
+
|
embeddings.py
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Generate embeddings for models using sentence transformers.
|
| 3 |
+
"""
|
| 4 |
+
import numpy as np
|
| 5 |
+
from sentence_transformers import SentenceTransformer
|
| 6 |
+
from typing import List, Optional
|
| 7 |
+
import pickle
|
| 8 |
+
import os
|
| 9 |
+
from tqdm import tqdm
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class ModelEmbedder:
|
| 13 |
+
"""Generate embeddings for model descriptions."""
|
| 14 |
+
|
| 15 |
+
def __init__(self, model_name: str = "all-MiniLM-L6-v2", cache_dir: Optional[str] = None):
|
| 16 |
+
"""
|
| 17 |
+
Initialize embedder.
|
| 18 |
+
|
| 19 |
+
Args:
|
| 20 |
+
model_name: Sentence transformer model name
|
| 21 |
+
cache_dir: Directory to cache embeddings
|
| 22 |
+
"""
|
| 23 |
+
self.model_name = model_name
|
| 24 |
+
self.cache_dir = cache_dir
|
| 25 |
+
print(f"Loading embedding model: {model_name}...")
|
| 26 |
+
self.model = SentenceTransformer(model_name)
|
| 27 |
+
print("Embedding model loaded!")
|
| 28 |
+
|
| 29 |
+
def generate_embeddings(
|
| 30 |
+
self,
|
| 31 |
+
texts: List[str],
|
| 32 |
+
batch_size: int = 32,
|
| 33 |
+
show_progress: bool = True
|
| 34 |
+
) -> np.ndarray:
|
| 35 |
+
"""
|
| 36 |
+
Generate embeddings for a list of texts.
|
| 37 |
+
|
| 38 |
+
Args:
|
| 39 |
+
texts: List of text strings to embed
|
| 40 |
+
batch_size: Batch size for encoding
|
| 41 |
+
show_progress: Whether to show progress bar
|
| 42 |
+
|
| 43 |
+
Returns:
|
| 44 |
+
numpy array of embeddings (n_samples, embedding_dim)
|
| 45 |
+
"""
|
| 46 |
+
if show_progress:
|
| 47 |
+
print(f"Generating embeddings for {len(texts)} models...")
|
| 48 |
+
|
| 49 |
+
embeddings = self.model.encode(
|
| 50 |
+
texts,
|
| 51 |
+
batch_size=batch_size,
|
| 52 |
+
show_progress_bar=show_progress,
|
| 53 |
+
convert_to_numpy=True
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
return embeddings
|
| 57 |
+
|
| 58 |
+
def save_embeddings(self, embeddings: np.ndarray, filepath: str):
|
| 59 |
+
"""Save embeddings to disk."""
|
| 60 |
+
os.makedirs(os.path.dirname(filepath) if os.path.dirname(filepath) else '.', exist_ok=True)
|
| 61 |
+
with open(filepath, 'wb') as f:
|
| 62 |
+
pickle.dump(embeddings, f)
|
| 63 |
+
print(f"Embeddings saved to {filepath}")
|
| 64 |
+
|
| 65 |
+
def load_embeddings(self, filepath: str) -> np.ndarray:
|
| 66 |
+
"""Load embeddings from disk."""
|
| 67 |
+
with open(filepath, 'rb') as f:
|
| 68 |
+
embeddings = pickle.load(f)
|
| 69 |
+
print(f"Embeddings loaded from {filepath}")
|
| 70 |
+
return embeddings
|
| 71 |
+
|
frontend/.gitignore
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
|
| 2 |
+
|
| 3 |
+
# dependencies
|
| 4 |
+
/node_modules
|
| 5 |
+
/.pnp
|
| 6 |
+
.pnp.js
|
| 7 |
+
|
| 8 |
+
# testing
|
| 9 |
+
/coverage
|
| 10 |
+
|
| 11 |
+
# production
|
| 12 |
+
/build
|
| 13 |
+
|
| 14 |
+
# misc
|
| 15 |
+
.DS_Store
|
| 16 |
+
.env.local
|
| 17 |
+
.env.development.local
|
| 18 |
+
.env.test.local
|
| 19 |
+
.env.production.local
|
| 20 |
+
|
| 21 |
+
npm-debug.log*
|
| 22 |
+
yarn-debug.log*
|
| 23 |
+
yarn-error.log*
|
| 24 |
+
|
frontend/.nvmrc
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
18
|
| 2 |
+
|
frontend/_redirects
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/* /index.html 200
|
| 2 |
+
|
frontend/netlify.toml
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build]
|
| 2 |
+
publish = "build"
|
| 3 |
+
command = "npm run build"
|
| 4 |
+
|
| 5 |
+
[build.environment]
|
| 6 |
+
NODE_VERSION = "18"
|
| 7 |
+
REACT_APP_API_URL = "https://your-backend-url.railway.app"
|
| 8 |
+
|
| 9 |
+
[[redirects]]
|
| 10 |
+
from = "/*"
|
| 11 |
+
to = "/index.html"
|
| 12 |
+
status = 200
|
| 13 |
+
|
frontend/package.json
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"name": "hf-model-navigator-frontend",
|
| 3 |
+
"version": "1.0.0",
|
| 4 |
+
"description": "React frontend with Visx for HF Model Ecosystem Navigator",
|
| 5 |
+
"private": true,
|
| 6 |
+
"dependencies": {
|
| 7 |
+
"@visx/axis": "^3.0.0",
|
| 8 |
+
"@visx/brush": "^3.0.0",
|
| 9 |
+
"@visx/event": "^3.0.0",
|
| 10 |
+
"@visx/gradient": "^3.0.0",
|
| 11 |
+
"@visx/group": "^3.0.0",
|
| 12 |
+
"@visx/legend": "^3.0.0",
|
| 13 |
+
"@visx/point": "^3.0.0",
|
| 14 |
+
"@visx/scale": "^3.0.0",
|
| 15 |
+
"@visx/shape": "^3.0.0",
|
| 16 |
+
"@visx/tooltip": "^3.0.0",
|
| 17 |
+
"@visx/visx": "^3.0.0",
|
| 18 |
+
"react": "^18.2.0",
|
| 19 |
+
"react-dom": "^18.2.0",
|
| 20 |
+
"react-scripts": "5.0.1",
|
| 21 |
+
"typescript": "^5.0.0",
|
| 22 |
+
"@types/react": "^18.2.0",
|
| 23 |
+
"@types/react-dom": "^18.2.0",
|
| 24 |
+
"axios": "^1.6.0"
|
| 25 |
+
},
|
| 26 |
+
"scripts": {
|
| 27 |
+
"start": "react-scripts start",
|
| 28 |
+
"build": "react-scripts build",
|
| 29 |
+
"test": "react-scripts test",
|
| 30 |
+
"eject": "react-scripts eject"
|
| 31 |
+
},
|
| 32 |
+
"eslintConfig": {
|
| 33 |
+
"extends": [
|
| 34 |
+
"react-app"
|
| 35 |
+
]
|
| 36 |
+
},
|
| 37 |
+
"browserslist": {
|
| 38 |
+
"production": [
|
| 39 |
+
">0.2%",
|
| 40 |
+
"not dead",
|
| 41 |
+
"not op_mini all"
|
| 42 |
+
],
|
| 43 |
+
"development": [
|
| 44 |
+
"last 1 chrome version",
|
| 45 |
+
"last 1 firefox version",
|
| 46 |
+
"last 1 safari version"
|
| 47 |
+
]
|
| 48 |
+
}
|
| 49 |
+
}
|
| 50 |
+
|
frontend/public/_redirects
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/* /index.html 200
|
| 2 |
+
|
frontend/public/index.html
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8" />
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
| 6 |
+
<meta name="theme-color" content="#000000" />
|
| 7 |
+
<meta
|
| 8 |
+
name="description"
|
| 9 |
+
content="Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face. Analysis of 1.86 million models on Hugging Face, revealing fine-tuning lineages and model family structures using evolutionary biology methods."
|
| 10 |
+
/>
|
| 11 |
+
<title>Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face</title>
|
| 12 |
+
</head>
|
| 13 |
+
<body>
|
| 14 |
+
<noscript>You need to enable JavaScript to run this app.</noscript>
|
| 15 |
+
<div id="root"></div>
|
| 16 |
+
</body>
|
| 17 |
+
</html>
|
| 18 |
+
|
frontend/src/App.css
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.App {
|
| 2 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
|
| 3 |
+
'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
|
| 4 |
+
sans-serif;
|
| 5 |
+
-webkit-font-smoothing: antialiased;
|
| 6 |
+
-moz-osx-font-smoothing: grayscale;
|
| 7 |
+
}
|
| 8 |
+
|
| 9 |
+
.App-header {
|
| 10 |
+
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
| 11 |
+
color: white;
|
| 12 |
+
padding: 2rem;
|
| 13 |
+
text-align: center;
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
.App-header h1 {
|
| 17 |
+
margin: 0 0 1rem 0;
|
| 18 |
+
font-size: 2rem;
|
| 19 |
+
font-weight: 600;
|
| 20 |
+
}
|
| 21 |
+
|
| 22 |
+
.App-header p {
|
| 23 |
+
margin: 0;
|
| 24 |
+
opacity: 0.9;
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
.App-header a {
|
| 28 |
+
color: white;
|
| 29 |
+
text-decoration: underline;
|
| 30 |
+
opacity: 0.9;
|
| 31 |
+
transition: opacity 0.2s;
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
.App-header a:hover {
|
| 35 |
+
opacity: 1;
|
| 36 |
+
text-decoration: none;
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
.stats {
|
| 40 |
+
display: flex;
|
| 41 |
+
gap: 2rem;
|
| 42 |
+
justify-content: center;
|
| 43 |
+
margin-top: 1rem;
|
| 44 |
+
font-size: 0.9rem;
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
.main-content {
|
| 48 |
+
display: flex;
|
| 49 |
+
height: calc(100vh - 200px);
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
.sidebar {
|
| 53 |
+
width: 300px;
|
| 54 |
+
padding: 2rem;
|
| 55 |
+
background: #f5f5f5;
|
| 56 |
+
overflow-y: auto;
|
| 57 |
+
border-right: 1px solid #e0e0e0;
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
.sidebar h2 {
|
| 61 |
+
margin-top: 0;
|
| 62 |
+
font-size: 1.5rem;
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
.sidebar label {
|
| 66 |
+
display: block;
|
| 67 |
+
margin-bottom: 1.5rem;
|
| 68 |
+
font-weight: 500;
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
.sidebar input[type="text"],
|
| 72 |
+
.sidebar select {
|
| 73 |
+
width: 100%;
|
| 74 |
+
padding: 0.5rem;
|
| 75 |
+
margin-top: 0.5rem;
|
| 76 |
+
border: 1px solid #ccc;
|
| 77 |
+
border-radius: 4px;
|
| 78 |
+
font-size: 0.9rem;
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
.sidebar input[type="range"] {
|
| 82 |
+
width: 100%;
|
| 83 |
+
margin-top: 0.5rem;
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
.visualization {
|
| 87 |
+
flex: 1;
|
| 88 |
+
padding: 2rem;
|
| 89 |
+
display: flex;
|
| 90 |
+
align-items: center;
|
| 91 |
+
justify-content: center;
|
| 92 |
+
background: white;
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
.loading,
|
| 96 |
+
.error,
|
| 97 |
+
.empty {
|
| 98 |
+
text-align: center;
|
| 99 |
+
padding: 2rem;
|
| 100 |
+
font-size: 1.2rem;
|
| 101 |
+
}
|
| 102 |
+
|
| 103 |
+
.error {
|
| 104 |
+
color: #d32f2f;
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
.empty {
|
| 108 |
+
color: #666;
|
| 109 |
+
}
|
| 110 |
+
|
frontend/src/App.tsx
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Main React app component using Visx for visualization.
|
| 3 |
+
*/
|
| 4 |
+
import React, { useState, useEffect, useCallback } from 'react';
|
| 5 |
+
import ScatterPlot from './components/ScatterPlot';
|
| 6 |
+
import ModelModal from './components/ModelModal';
|
| 7 |
+
import { ModelPoint, Stats } from './types';
|
| 8 |
+
import './App.css';
|
| 9 |
+
|
| 10 |
+
const API_BASE = process.env.REACT_APP_API_URL || 'http://localhost:8000';
|
| 11 |
+
|
| 12 |
+
function App() {
|
| 13 |
+
const [data, setData] = useState<ModelPoint[]>([]);
|
| 14 |
+
const [stats, setStats] = useState<Stats | null>(null);
|
| 15 |
+
const [loading, setLoading] = useState(true);
|
| 16 |
+
const [error, setError] = useState<string | null>(null);
|
| 17 |
+
const [selectedModel, setSelectedModel] = useState<ModelPoint | null>(null);
|
| 18 |
+
const [isModalOpen, setIsModalOpen] = useState(false);
|
| 19 |
+
|
| 20 |
+
// Filters
|
| 21 |
+
const [minDownloads, setMinDownloads] = useState(0);
|
| 22 |
+
const [minLikes, setMinLikes] = useState(0);
|
| 23 |
+
const [searchQuery, setSearchQuery] = useState('');
|
| 24 |
+
const [colorBy, setColorBy] = useState('library_name');
|
| 25 |
+
const [sizeBy, setSizeBy] = useState('downloads');
|
| 26 |
+
|
| 27 |
+
// Dimensions
|
| 28 |
+
const [width, setWidth] = useState(window.innerWidth * 0.7);
|
| 29 |
+
const [height, setHeight] = useState(window.innerHeight * 0.7);
|
| 30 |
+
|
| 31 |
+
useEffect(() => {
|
| 32 |
+
const handleResize = () => {
|
| 33 |
+
setWidth(window.innerWidth * 0.7);
|
| 34 |
+
setHeight(window.innerHeight * 0.7);
|
| 35 |
+
};
|
| 36 |
+
window.addEventListener('resize', handleResize);
|
| 37 |
+
return () => window.removeEventListener('resize', handleResize);
|
| 38 |
+
}, []);
|
| 39 |
+
|
| 40 |
+
const fetchData = useCallback(async () => {
|
| 41 |
+
setLoading(true);
|
| 42 |
+
setError(null);
|
| 43 |
+
try {
|
| 44 |
+
const params = new URLSearchParams({
|
| 45 |
+
min_downloads: minDownloads.toString(),
|
| 46 |
+
min_likes: minLikes.toString(),
|
| 47 |
+
color_by: colorBy,
|
| 48 |
+
size_by: sizeBy,
|
| 49 |
+
max_points: '5000',
|
| 50 |
+
});
|
| 51 |
+
if (searchQuery) {
|
| 52 |
+
params.append('search_query', searchQuery);
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
const response = await fetch(`${API_BASE}/api/models?${params}`);
|
| 56 |
+
if (!response.ok) throw new Error('Failed to fetch models');
|
| 57 |
+
const models = await response.json();
|
| 58 |
+
setData(models);
|
| 59 |
+
} catch (err) {
|
| 60 |
+
setError(err instanceof Error ? err.message : 'Unknown error');
|
| 61 |
+
} finally {
|
| 62 |
+
setLoading(false);
|
| 63 |
+
}
|
| 64 |
+
}, [minDownloads, minLikes, searchQuery, colorBy, sizeBy]);
|
| 65 |
+
|
| 66 |
+
useEffect(() => {
|
| 67 |
+
fetchData();
|
| 68 |
+
}, [fetchData]);
|
| 69 |
+
|
| 70 |
+
useEffect(() => {
|
| 71 |
+
// Fetch stats once
|
| 72 |
+
fetch(`${API_BASE}/api/stats`)
|
| 73 |
+
.then(res => res.json())
|
| 74 |
+
.then(setStats)
|
| 75 |
+
.catch(console.error);
|
| 76 |
+
}, []);
|
| 77 |
+
|
| 78 |
+
return (
|
| 79 |
+
<div className="App">
|
| 80 |
+
<header className="App-header">
|
| 81 |
+
<h1>Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face</h1>
|
| 82 |
+
<p style={{ maxWidth: '900px', margin: '0 auto', lineHeight: '1.6' }}>
|
| 83 |
+
Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens, we measure genetic similarity and mutation of traits over model families.
|
| 84 |
+
{' '}
|
| 85 |
+
<a
|
| 86 |
+
href="https://arxiv.org/abs/2508.06811"
|
| 87 |
+
target="_blank"
|
| 88 |
+
rel="noopener noreferrer"
|
| 89 |
+
style={{ color: 'white', textDecoration: 'underline', fontWeight: '500' }}
|
| 90 |
+
>
|
| 91 |
+
Read the full paper →
|
| 92 |
+
</a>
|
| 93 |
+
</p>
|
| 94 |
+
<p style={{ marginTop: '0.5rem', fontSize: '0.9rem', opacity: 0.9 }}>
|
| 95 |
+
<strong>Authors:</strong> Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
|
| 96 |
+
</p>
|
| 97 |
+
{stats && (
|
| 98 |
+
<div className="stats">
|
| 99 |
+
<span>Total Models: {stats.total_models.toLocaleString()}</span>
|
| 100 |
+
<span>Libraries: {stats.unique_libraries}</span>
|
| 101 |
+
<span>Pipelines: {stats.unique_pipelines}</span>
|
| 102 |
+
</div>
|
| 103 |
+
)}
|
| 104 |
+
</header>
|
| 105 |
+
|
| 106 |
+
<div className="main-content">
|
| 107 |
+
<aside className="sidebar">
|
| 108 |
+
<h2>Filters</h2>
|
| 109 |
+
|
| 110 |
+
<label>
|
| 111 |
+
Search:
|
| 112 |
+
<input
|
| 113 |
+
type="text"
|
| 114 |
+
value={searchQuery}
|
| 115 |
+
onChange={(e) => setSearchQuery(e.target.value)}
|
| 116 |
+
placeholder="Model ID or tags..."
|
| 117 |
+
/>
|
| 118 |
+
</label>
|
| 119 |
+
|
| 120 |
+
<label>
|
| 121 |
+
Min Downloads: {minDownloads.toLocaleString()}
|
| 122 |
+
<input
|
| 123 |
+
type="range"
|
| 124 |
+
min="0"
|
| 125 |
+
max="1000000"
|
| 126 |
+
step="1000"
|
| 127 |
+
value={minDownloads}
|
| 128 |
+
onChange={(e) => setMinDownloads(Number(e.target.value))}
|
| 129 |
+
/>
|
| 130 |
+
</label>
|
| 131 |
+
|
| 132 |
+
<label>
|
| 133 |
+
Min Likes: {minLikes.toLocaleString()}
|
| 134 |
+
<input
|
| 135 |
+
type="range"
|
| 136 |
+
min="0"
|
| 137 |
+
max="10000"
|
| 138 |
+
step="10"
|
| 139 |
+
value={minLikes}
|
| 140 |
+
onChange={(e) => setMinLikes(Number(e.target.value))}
|
| 141 |
+
/>
|
| 142 |
+
</label>
|
| 143 |
+
|
| 144 |
+
<label>
|
| 145 |
+
Color By:
|
| 146 |
+
<select value={colorBy} onChange={(e) => setColorBy(e.target.value)}>
|
| 147 |
+
<option value="library_name">Library</option>
|
| 148 |
+
<option value="pipeline_tag">Pipeline</option>
|
| 149 |
+
<option value="downloads">Downloads</option>
|
| 150 |
+
<option value="likes">Likes</option>
|
| 151 |
+
</select>
|
| 152 |
+
</label>
|
| 153 |
+
|
| 154 |
+
<label>
|
| 155 |
+
Size By:
|
| 156 |
+
<select value={sizeBy} onChange={(e) => setSizeBy(e.target.value)}>
|
| 157 |
+
<option value="downloads">Downloads</option>
|
| 158 |
+
<option value="likes">Likes</option>
|
| 159 |
+
<option value="trendingScore">Trending Score</option>
|
| 160 |
+
<option value="none">None</option>
|
| 161 |
+
</select>
|
| 162 |
+
</label>
|
| 163 |
+
</aside>
|
| 164 |
+
|
| 165 |
+
<main className="visualization">
|
| 166 |
+
{loading && <div className="loading">Loading models...</div>}
|
| 167 |
+
{error && <div className="error">Error: {error}</div>}
|
| 168 |
+
{!loading && !error && data.length === 0 && (
|
| 169 |
+
<div className="empty">No models match the filters</div>
|
| 170 |
+
)}
|
| 171 |
+
{!loading && !error && data.length > 0 && (
|
| 172 |
+
<ScatterPlot
|
| 173 |
+
width={width}
|
| 174 |
+
height={height}
|
| 175 |
+
data={data}
|
| 176 |
+
colorBy={colorBy}
|
| 177 |
+
sizeBy={sizeBy}
|
| 178 |
+
onPointClick={(model) => {
|
| 179 |
+
setSelectedModel(model);
|
| 180 |
+
setIsModalOpen(true);
|
| 181 |
+
}}
|
| 182 |
+
/>
|
| 183 |
+
)}
|
| 184 |
+
</main>
|
| 185 |
+
|
| 186 |
+
<ModelModal
|
| 187 |
+
model={selectedModel}
|
| 188 |
+
isOpen={isModalOpen}
|
| 189 |
+
onClose={() => setIsModalOpen(false)}
|
| 190 |
+
/>
|
| 191 |
+
</div>
|
| 192 |
+
</div>
|
| 193 |
+
);
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
export default App;
|
| 197 |
+
|
frontend/src/components/ModelModal.css
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.modal-overlay {
|
| 2 |
+
position: fixed;
|
| 3 |
+
top: 0;
|
| 4 |
+
left: 0;
|
| 5 |
+
right: 0;
|
| 6 |
+
bottom: 0;
|
| 7 |
+
background: rgba(0, 0, 0, 0.7);
|
| 8 |
+
display: flex;
|
| 9 |
+
align-items: center;
|
| 10 |
+
justify-content: center;
|
| 11 |
+
z-index: 1000;
|
| 12 |
+
padding: 2rem;
|
| 13 |
+
animation: fadeIn 0.2s ease-in;
|
| 14 |
+
}
|
| 15 |
+
|
| 16 |
+
@keyframes fadeIn {
|
| 17 |
+
from {
|
| 18 |
+
opacity: 0;
|
| 19 |
+
}
|
| 20 |
+
to {
|
| 21 |
+
opacity: 1;
|
| 22 |
+
}
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
.modal-content {
|
| 26 |
+
background: white;
|
| 27 |
+
border-radius: 12px;
|
| 28 |
+
max-width: 600px;
|
| 29 |
+
width: 100%;
|
| 30 |
+
max-height: 90vh;
|
| 31 |
+
overflow-y: auto;
|
| 32 |
+
padding: 2rem;
|
| 33 |
+
position: relative;
|
| 34 |
+
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
|
| 35 |
+
animation: slideUp 0.3s ease-out;
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
@keyframes slideUp {
|
| 39 |
+
from {
|
| 40 |
+
transform: translateY(20px);
|
| 41 |
+
opacity: 0;
|
| 42 |
+
}
|
| 43 |
+
to {
|
| 44 |
+
transform: translateY(0);
|
| 45 |
+
opacity: 1;
|
| 46 |
+
}
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
.modal-close {
|
| 50 |
+
position: absolute;
|
| 51 |
+
top: 1rem;
|
| 52 |
+
right: 1rem;
|
| 53 |
+
background: none;
|
| 54 |
+
border: none;
|
| 55 |
+
font-size: 2rem;
|
| 56 |
+
line-height: 1;
|
| 57 |
+
cursor: pointer;
|
| 58 |
+
color: #666;
|
| 59 |
+
padding: 0;
|
| 60 |
+
width: 32px;
|
| 61 |
+
height: 32px;
|
| 62 |
+
display: flex;
|
| 63 |
+
align-items: center;
|
| 64 |
+
justify-content: center;
|
| 65 |
+
border-radius: 50%;
|
| 66 |
+
transition: all 0.2s;
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
.modal-close:hover {
|
| 70 |
+
background: #f0f0f0;
|
| 71 |
+
color: #000;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
.modal-content h2 {
|
| 75 |
+
margin: 0 0 1.5rem 0;
|
| 76 |
+
font-size: 1.5rem;
|
| 77 |
+
color: #333;
|
| 78 |
+
word-break: break-word;
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
.modal-section {
|
| 82 |
+
margin-bottom: 1.5rem;
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
.modal-section:last-child {
|
| 86 |
+
margin-bottom: 0;
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
.modal-section h3 {
|
| 90 |
+
margin: 0 0 0.75rem 0;
|
| 91 |
+
font-size: 1rem;
|
| 92 |
+
font-weight: 600;
|
| 93 |
+
color: #555;
|
| 94 |
+
text-transform: uppercase;
|
| 95 |
+
letter-spacing: 0.5px;
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
.modal-info-grid {
|
| 99 |
+
display: grid;
|
| 100 |
+
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
|
| 101 |
+
gap: 1rem;
|
| 102 |
+
}
|
| 103 |
+
|
| 104 |
+
.modal-info-item {
|
| 105 |
+
display: flex;
|
| 106 |
+
flex-direction: column;
|
| 107 |
+
gap: 0.25rem;
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
.modal-info-item strong {
|
| 111 |
+
font-size: 0.875rem;
|
| 112 |
+
color: #666;
|
| 113 |
+
font-weight: 500;
|
| 114 |
+
}
|
| 115 |
+
|
| 116 |
+
.modal-info-item span {
|
| 117 |
+
font-size: 1rem;
|
| 118 |
+
color: #333;
|
| 119 |
+
font-weight: 500;
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
.modal-tags {
|
| 123 |
+
margin: 0;
|
| 124 |
+
padding: 0.75rem;
|
| 125 |
+
background: #f5f5f5;
|
| 126 |
+
border-radius: 6px;
|
| 127 |
+
color: #333;
|
| 128 |
+
font-size: 0.9rem;
|
| 129 |
+
line-height: 1.5;
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
.modal-link {
|
| 133 |
+
display: inline-flex;
|
| 134 |
+
align-items: center;
|
| 135 |
+
gap: 0.5rem;
|
| 136 |
+
padding: 0.75rem 1.5rem;
|
| 137 |
+
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
| 138 |
+
color: white;
|
| 139 |
+
text-decoration: none;
|
| 140 |
+
border-radius: 6px;
|
| 141 |
+
font-weight: 500;
|
| 142 |
+
transition: all 0.2s;
|
| 143 |
+
margin-top: 0.5rem;
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
.modal-link:hover {
|
| 147 |
+
transform: translateY(-2px);
|
| 148 |
+
box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
|
| 149 |
+
}
|
| 150 |
+
|
| 151 |
+
@media (max-width: 768px) {
|
| 152 |
+
.modal-content {
|
| 153 |
+
padding: 1.5rem;
|
| 154 |
+
max-width: 100%;
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
.modal-info-grid {
|
| 158 |
+
grid-template-columns: 1fr;
|
| 159 |
+
}
|
| 160 |
+
}
|
| 161 |
+
|
frontend/src/components/ModelModal.tsx
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Modal component for displaying detailed model information.
|
| 3 |
+
*/
|
| 4 |
+
import React from 'react';
|
| 5 |
+
import { ModelPoint } from '../types';
|
| 6 |
+
import './ModelModal.css';
|
| 7 |
+
|
| 8 |
+
interface ModelModalProps {
|
| 9 |
+
model: ModelPoint | null;
|
| 10 |
+
isOpen: boolean;
|
| 11 |
+
onClose: () => void;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
export default function ModelModal({ model, isOpen, onClose }: ModelModalProps) {
|
| 15 |
+
if (!isOpen || !model) return null;
|
| 16 |
+
|
| 17 |
+
const hfUrl = `https://huggingface.co/${model.model_id}`;
|
| 18 |
+
|
| 19 |
+
return (
|
| 20 |
+
<div className="modal-overlay" onClick={onClose}>
|
| 21 |
+
<div className="modal-content" onClick={(e) => e.stopPropagation()}>
|
| 22 |
+
<button className="modal-close" onClick={onClose}>×</button>
|
| 23 |
+
|
| 24 |
+
<h2>{model.model_id}</h2>
|
| 25 |
+
|
| 26 |
+
<div className="modal-section">
|
| 27 |
+
<h3>Model Information</h3>
|
| 28 |
+
<div className="modal-info-grid">
|
| 29 |
+
<div className="modal-info-item">
|
| 30 |
+
<strong>Library:</strong>
|
| 31 |
+
<span>{model.library_name || 'N/A'}</span>
|
| 32 |
+
</div>
|
| 33 |
+
<div className="modal-info-item">
|
| 34 |
+
<strong>Pipeline Tag:</strong>
|
| 35 |
+
<span>{model.pipeline_tag || 'N/A'}</span>
|
| 36 |
+
</div>
|
| 37 |
+
<div className="modal-info-item">
|
| 38 |
+
<strong>Downloads:</strong>
|
| 39 |
+
<span>{model.downloads.toLocaleString()}</span>
|
| 40 |
+
</div>
|
| 41 |
+
<div className="modal-info-item">
|
| 42 |
+
<strong>Likes:</strong>
|
| 43 |
+
<span>{model.likes.toLocaleString()}</span>
|
| 44 |
+
</div>
|
| 45 |
+
{model.trending_score !== null && (
|
| 46 |
+
<div className="modal-info-item">
|
| 47 |
+
<strong>Trending Score:</strong>
|
| 48 |
+
<span>{model.trending_score.toFixed(2)}</span>
|
| 49 |
+
</div>
|
| 50 |
+
)}
|
| 51 |
+
</div>
|
| 52 |
+
</div>
|
| 53 |
+
|
| 54 |
+
{model.tags && (
|
| 55 |
+
<div className="modal-section">
|
| 56 |
+
<h3>Tags</h3>
|
| 57 |
+
<p className="modal-tags">{model.tags}</p>
|
| 58 |
+
</div>
|
| 59 |
+
)}
|
| 60 |
+
|
| 61 |
+
<div className="modal-section">
|
| 62 |
+
<h3>Links</h3>
|
| 63 |
+
<a
|
| 64 |
+
href={hfUrl}
|
| 65 |
+
target="_blank"
|
| 66 |
+
rel="noopener noreferrer"
|
| 67 |
+
className="modal-link"
|
| 68 |
+
>
|
| 69 |
+
View on Hugging Face →
|
| 70 |
+
</a>
|
| 71 |
+
</div>
|
| 72 |
+
|
| 73 |
+
<div className="modal-section">
|
| 74 |
+
<h3>Position in Latent Space</h3>
|
| 75 |
+
<div className="modal-info-grid">
|
| 76 |
+
<div className="modal-info-item">
|
| 77 |
+
<strong>Dimension 1:</strong>
|
| 78 |
+
<span>{model.x.toFixed(4)}</span>
|
| 79 |
+
</div>
|
| 80 |
+
<div className="modal-info-item">
|
| 81 |
+
<strong>Dimension 2:</strong>
|
| 82 |
+
<span>{model.y.toFixed(4)}</span>
|
| 83 |
+
</div>
|
| 84 |
+
</div>
|
| 85 |
+
</div>
|
| 86 |
+
</div>
|
| 87 |
+
</div>
|
| 88 |
+
);
|
| 89 |
+
}
|
| 90 |
+
|
frontend/src/components/ScatterPlot.tsx
ADDED
|
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Visx-based scatter plot component for model visualization.
|
| 3 |
+
* Based on visx gallery examples: https://visx.airbnb.tech/gallery
|
| 4 |
+
*/
|
| 5 |
+
import React, { useMemo, useCallback } from 'react';
|
| 6 |
+
import { Group } from '@visx/group';
|
| 7 |
+
import { scaleLinear, scaleOrdinal } from '@visx/scale';
|
| 8 |
+
import { AxisBottom, AxisLeft } from '@visx/axis';
|
| 9 |
+
import { GridRows, GridColumns } from '@visx/grid';
|
| 10 |
+
import { Tooltip, useTooltip } from '@visx/tooltip';
|
| 11 |
+
import { LegendOrdinal } from '@visx/legend';
|
| 12 |
+
// Using circle elements directly instead of Point component
|
| 13 |
+
// Color schemes - using a predefined palette
|
| 14 |
+
const colorPalette = [
|
| 15 |
+
'#8dd3c7', '#ffffb3', '#bebada', '#fb8072', '#80b1d3',
|
| 16 |
+
'#fdb462', '#b3de69', '#fccde5', '#d9d9d9', '#bc80bd',
|
| 17 |
+
'#ccebc5', '#ffed6f'
|
| 18 |
+
];
|
| 19 |
+
import { ModelPoint } from '../types';
|
| 20 |
+
|
| 21 |
+
interface ScatterPlotProps {
|
| 22 |
+
width: number;
|
| 23 |
+
height: number;
|
| 24 |
+
data: ModelPoint[];
|
| 25 |
+
colorBy: string;
|
| 26 |
+
sizeBy: string;
|
| 27 |
+
margin?: { top: number; right: number; bottom: number; left: number };
|
| 28 |
+
onPointClick?: (model: ModelPoint) => void;
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
const defaultMargin = { top: 40, right: 40, bottom: 60, left: 60 };
|
| 32 |
+
|
| 33 |
+
export default function ScatterPlot({
|
| 34 |
+
width,
|
| 35 |
+
height,
|
| 36 |
+
data,
|
| 37 |
+
colorBy,
|
| 38 |
+
sizeBy,
|
| 39 |
+
margin = defaultMargin,
|
| 40 |
+
onPointClick,
|
| 41 |
+
}: ScatterPlotProps) {
|
| 42 |
+
const {
|
| 43 |
+
tooltipData,
|
| 44 |
+
tooltipLeft,
|
| 45 |
+
tooltipTop,
|
| 46 |
+
tooltipOpen,
|
| 47 |
+
showTooltip,
|
| 48 |
+
hideTooltip,
|
| 49 |
+
} = useTooltip<ModelPoint>();
|
| 50 |
+
|
| 51 |
+
// Bounds
|
| 52 |
+
const xMax = width - margin.left - margin.right;
|
| 53 |
+
const yMax = height - margin.top - margin.bottom;
|
| 54 |
+
|
| 55 |
+
// Scales
|
| 56 |
+
const xScale = useMemo(
|
| 57 |
+
() =>
|
| 58 |
+
scaleLinear<number>({
|
| 59 |
+
domain: [Math.min(...data.map(d => d.x)), Math.max(...data.map(d => d.x))],
|
| 60 |
+
range: [0, xMax],
|
| 61 |
+
nice: true,
|
| 62 |
+
}),
|
| 63 |
+
[data, xMax]
|
| 64 |
+
);
|
| 65 |
+
|
| 66 |
+
const yScale = useMemo(
|
| 67 |
+
() =>
|
| 68 |
+
scaleLinear<number>({
|
| 69 |
+
domain: [Math.min(...data.map(d => d.y)), Math.max(...data.map(d => d.y))],
|
| 70 |
+
range: [yMax, 0],
|
| 71 |
+
nice: true,
|
| 72 |
+
}),
|
| 73 |
+
[data, yMax]
|
| 74 |
+
);
|
| 75 |
+
|
| 76 |
+
// Color scale
|
| 77 |
+
const getColorValue = (d: ModelPoint) => {
|
| 78 |
+
if (colorBy === 'library_name') return d.library_name || 'Unknown';
|
| 79 |
+
if (colorBy === 'pipeline_tag') return d.pipeline_tag || 'Unknown';
|
| 80 |
+
if (colorBy === 'downloads') return d.downloads;
|
| 81 |
+
if (colorBy === 'likes') return d.likes;
|
| 82 |
+
return 'All';
|
| 83 |
+
};
|
| 84 |
+
|
| 85 |
+
const colorValues = useMemo(() => data.map(getColorValue), [data, colorBy]);
|
| 86 |
+
const isCategorical = colorBy === 'library_name' || colorBy === 'pipeline_tag';
|
| 87 |
+
|
| 88 |
+
const colorScale = useMemo(() => {
|
| 89 |
+
if (isCategorical) {
|
| 90 |
+
const uniqueValues = Array.from(new Set(colorValues));
|
| 91 |
+
return scaleOrdinal<string, string>({
|
| 92 |
+
domain: uniqueValues,
|
| 93 |
+
range: colorPalette,
|
| 94 |
+
});
|
| 95 |
+
} else {
|
| 96 |
+
// For continuous, we'll use a linear scale with a color interpolator
|
| 97 |
+
const min = Math.min(...(colorValues as number[]));
|
| 98 |
+
const max = Math.max(...(colorValues as number[]));
|
| 99 |
+
return scaleLinear<number, string>({
|
| 100 |
+
domain: [min, max],
|
| 101 |
+
range: ['#440154', '#fde725'], // Viridis-like colors
|
| 102 |
+
});
|
| 103 |
+
}
|
| 104 |
+
}, [colorValues, isCategorical]);
|
| 105 |
+
|
| 106 |
+
// Size scale
|
| 107 |
+
const getSizeValue = (d: ModelPoint) => {
|
| 108 |
+
if (sizeBy === 'downloads') return d.downloads;
|
| 109 |
+
if (sizeBy === 'likes') return d.likes;
|
| 110 |
+
if (sizeBy === 'trendingScore' && d.trending_score) return d.trending_score;
|
| 111 |
+
return 10;
|
| 112 |
+
};
|
| 113 |
+
|
| 114 |
+
const sizeValues = useMemo(() => data.map(getSizeValue), [data, sizeBy]);
|
| 115 |
+
const minSize = Math.min(...sizeValues);
|
| 116 |
+
const maxSize = Math.max(...sizeValues);
|
| 117 |
+
|
| 118 |
+
const sizeScale = useMemo(
|
| 119 |
+
() =>
|
| 120 |
+
scaleLinear<number>({
|
| 121 |
+
domain: [minSize, maxSize],
|
| 122 |
+
range: [5, 20],
|
| 123 |
+
}),
|
| 124 |
+
[minSize, maxSize]
|
| 125 |
+
);
|
| 126 |
+
|
| 127 |
+
// Handle point hover
|
| 128 |
+
const handleMouseOver = useCallback(
|
| 129 |
+
(event: React.MouseEvent, datum: ModelPoint) => {
|
| 130 |
+
const coords = { x: event.clientX, y: event.clientY };
|
| 131 |
+
showTooltip({
|
| 132 |
+
tooltipLeft: coords.x,
|
| 133 |
+
tooltipTop: coords.y,
|
| 134 |
+
tooltipData: datum,
|
| 135 |
+
});
|
| 136 |
+
},
|
| 137 |
+
[showTooltip]
|
| 138 |
+
);
|
| 139 |
+
|
| 140 |
+
return (
|
| 141 |
+
<div style={{ position: 'relative' }}>
|
| 142 |
+
<svg width={width} height={height}>
|
| 143 |
+
<Group left={margin.left} top={margin.top}>
|
| 144 |
+
{/* Grid */}
|
| 145 |
+
<GridRows scale={yScale} width={xMax} strokeDasharray="3,3" stroke="#e0e0e0" />
|
| 146 |
+
<GridColumns scale={xScale} height={yMax} strokeDasharray="3,3" stroke="#e0e0e0" />
|
| 147 |
+
|
| 148 |
+
{/* Points */}
|
| 149 |
+
{data.map((d, i) => {
|
| 150 |
+
const x = xScale(d.x);
|
| 151 |
+
const y = yScale(d.y);
|
| 152 |
+
const color = isCategorical
|
| 153 |
+
? colorScale(getColorValue(d) as string)
|
| 154 |
+
: colorScale(getColorValue(d) as number);
|
| 155 |
+
const size = sizeScale(getSizeValue(d));
|
| 156 |
+
|
| 157 |
+
return (
|
| 158 |
+
<circle
|
| 159 |
+
key={`point-${i}`}
|
| 160 |
+
cx={x}
|
| 161 |
+
cy={y}
|
| 162 |
+
r={size / 2}
|
| 163 |
+
fill={color}
|
| 164 |
+
opacity={0.7}
|
| 165 |
+
stroke="white"
|
| 166 |
+
strokeWidth={0.5}
|
| 167 |
+
onMouseOver={(e) => handleMouseOver(e, d)}
|
| 168 |
+
onMouseOut={hideTooltip}
|
| 169 |
+
onClick={() => onPointClick && onPointClick(d)}
|
| 170 |
+
style={{ cursor: 'pointer' }}
|
| 171 |
+
/>
|
| 172 |
+
);
|
| 173 |
+
})}
|
| 174 |
+
|
| 175 |
+
{/* Axes */}
|
| 176 |
+
<AxisBottom
|
| 177 |
+
top={yMax}
|
| 178 |
+
scale={xScale}
|
| 179 |
+
numTicks={5}
|
| 180 |
+
label="Dimension 1"
|
| 181 |
+
stroke="#333"
|
| 182 |
+
tickStroke="#333"
|
| 183 |
+
/>
|
| 184 |
+
<AxisLeft
|
| 185 |
+
scale={yScale}
|
| 186 |
+
numTicks={5}
|
| 187 |
+
label="Dimension 2"
|
| 188 |
+
stroke="#333"
|
| 189 |
+
tickStroke="#333"
|
| 190 |
+
/>
|
| 191 |
+
</Group>
|
| 192 |
+
</svg>
|
| 193 |
+
|
| 194 |
+
{/* Tooltip */}
|
| 195 |
+
{tooltipOpen && tooltipData && (
|
| 196 |
+
<Tooltip
|
| 197 |
+
top={tooltipTop}
|
| 198 |
+
left={tooltipLeft}
|
| 199 |
+
style={{
|
| 200 |
+
backgroundColor: 'rgba(0, 0, 0, 0.9)',
|
| 201 |
+
color: 'white',
|
| 202 |
+
padding: '8px',
|
| 203 |
+
borderRadius: '4px',
|
| 204 |
+
fontSize: '12px',
|
| 205 |
+
}}
|
| 206 |
+
>
|
| 207 |
+
<div>
|
| 208 |
+
<strong>{tooltipData.model_id}</strong>
|
| 209 |
+
<br />
|
| 210 |
+
Library: {tooltipData.library_name || 'N/A'}
|
| 211 |
+
<br />
|
| 212 |
+
Pipeline: {tooltipData.pipeline_tag || 'N/A'}
|
| 213 |
+
<br />
|
| 214 |
+
Downloads: {tooltipData.downloads.toLocaleString()}
|
| 215 |
+
<br />
|
| 216 |
+
Likes: {tooltipData.likes.toLocaleString()}
|
| 217 |
+
</div>
|
| 218 |
+
</Tooltip>
|
| 219 |
+
)}
|
| 220 |
+
|
| 221 |
+
{/* Legend */}
|
| 222 |
+
{isCategorical && (
|
| 223 |
+
<div style={{ position: 'absolute', top: 10, right: 10 }}>
|
| 224 |
+
<LegendOrdinal
|
| 225 |
+
scale={colorScale as any}
|
| 226 |
+
labelFormat={(label) => label}
|
| 227 |
+
direction="column"
|
| 228 |
+
style={{ fontSize: '12px' }}
|
| 229 |
+
/>
|
| 230 |
+
</div>
|
| 231 |
+
)}
|
| 232 |
+
</div>
|
| 233 |
+
);
|
| 234 |
+
}
|
| 235 |
+
|
frontend/src/index.css
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
body {
|
| 2 |
+
margin: 0;
|
| 3 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
|
| 4 |
+
'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
|
| 5 |
+
sans-serif;
|
| 6 |
+
-webkit-font-smoothing: antialiased;
|
| 7 |
+
-moz-osx-font-smoothing: grayscale;
|
| 8 |
+
}
|
| 9 |
+
|
| 10 |
+
code {
|
| 11 |
+
font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New',
|
| 12 |
+
monospace;
|
| 13 |
+
}
|
| 14 |
+
|
| 15 |
+
* {
|
| 16 |
+
box-sizing: border-box;
|
| 17 |
+
}
|
| 18 |
+
|
frontend/src/index.tsx
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import React from 'react';
|
| 2 |
+
import ReactDOM from 'react-dom/client';
|
| 3 |
+
import './index.css';
|
| 4 |
+
import App from './App';
|
| 5 |
+
|
| 6 |
+
const root = ReactDOM.createRoot(
|
| 7 |
+
document.getElementById('root') as HTMLElement
|
| 8 |
+
);
|
| 9 |
+
root.render(
|
| 10 |
+
<React.StrictMode>
|
| 11 |
+
<App />
|
| 12 |
+
</React.StrictMode>
|
| 13 |
+
);
|
| 14 |
+
|
frontend/src/types.ts
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
export interface ModelPoint {
|
| 2 |
+
model_id: string;
|
| 3 |
+
x: number;
|
| 4 |
+
y: number;
|
| 5 |
+
library_name: string | null;
|
| 6 |
+
pipeline_tag: string | null;
|
| 7 |
+
downloads: number;
|
| 8 |
+
likes: number;
|
| 9 |
+
trending_score: number | null;
|
| 10 |
+
tags: string | null;
|
| 11 |
+
}
|
| 12 |
+
|
| 13 |
+
export interface Stats {
|
| 14 |
+
total_models: number;
|
| 15 |
+
unique_libraries: number;
|
| 16 |
+
unique_pipelines: number;
|
| 17 |
+
avg_downloads: number;
|
| 18 |
+
avg_likes: number;
|
| 19 |
+
}
|
| 20 |
+
|
frontend/tsconfig.json
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"compilerOptions": {
|
| 3 |
+
"target": "es5",
|
| 4 |
+
"lib": [
|
| 5 |
+
"dom",
|
| 6 |
+
"dom.iterable",
|
| 7 |
+
"esnext"
|
| 8 |
+
],
|
| 9 |
+
"allowJs": true,
|
| 10 |
+
"skipLibCheck": true,
|
| 11 |
+
"esModuleInterop": true,
|
| 12 |
+
"allowSyntheticDefaultImports": true,
|
| 13 |
+
"strict": true,
|
| 14 |
+
"forceConsistentCasingInFileNames": true,
|
| 15 |
+
"noFallthroughCasesInSwitch": true,
|
| 16 |
+
"module": "esnext",
|
| 17 |
+
"moduleResolution": "node",
|
| 18 |
+
"resolveJsonModule": true,
|
| 19 |
+
"isolatedModules": true,
|
| 20 |
+
"noEmit": true,
|
| 21 |
+
"jsx": "react-jsx"
|
| 22 |
+
},
|
| 23 |
+
"include": [
|
| 24 |
+
"src"
|
| 25 |
+
]
|
| 26 |
+
}
|
| 27 |
+
|
netlify-functions/api.py
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Netlify Serverless Function for model data API.
|
| 3 |
+
This is a simplified version that works with Netlify Functions.
|
| 4 |
+
"""
|
| 5 |
+
import json
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
|
| 9 |
+
# Add parent directory to path to import modules
|
| 10 |
+
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
|
| 11 |
+
|
| 12 |
+
from data_loader import ModelDataLoader
|
| 13 |
+
from embeddings import ModelEmbedder
|
| 14 |
+
from dimensionality_reduction import DimensionReducer
|
| 15 |
+
import pandas as pd
|
| 16 |
+
import numpy as np
|
| 17 |
+
|
| 18 |
+
# Global state (persists across invocations in serverless)
|
| 19 |
+
data_loader = ModelDataLoader()
|
| 20 |
+
embedder = None
|
| 21 |
+
reducer = None
|
| 22 |
+
df = None
|
| 23 |
+
embeddings = None
|
| 24 |
+
reduced_embeddings = None
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def handler(event, context):
|
| 28 |
+
"""
|
| 29 |
+
Netlify serverless function handler.
|
| 30 |
+
"""
|
| 31 |
+
global embedder, reducer, df, embeddings, reduced_embeddings
|
| 32 |
+
|
| 33 |
+
# Parse query parameters
|
| 34 |
+
query_params = event.get('queryStringParameters') or {}
|
| 35 |
+
path = event.get('path', '')
|
| 36 |
+
|
| 37 |
+
# CORS headers
|
| 38 |
+
headers = {
|
| 39 |
+
'Access-Control-Allow-Origin': '*',
|
| 40 |
+
'Access-Control-Allow-Headers': 'Content-Type',
|
| 41 |
+
'Access-Control-Allow-Methods': 'GET, OPTIONS',
|
| 42 |
+
'Content-Type': 'application/json',
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
# Handle OPTIONS (CORS preflight)
|
| 46 |
+
if event.get('httpMethod') == 'OPTIONS':
|
| 47 |
+
return {
|
| 48 |
+
'statusCode': 200,
|
| 49 |
+
'headers': headers,
|
| 50 |
+
'body': ''
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
# Initialize data on first request
|
| 54 |
+
if df is None:
|
| 55 |
+
try:
|
| 56 |
+
print("Loading data...")
|
| 57 |
+
df = data_loader.load_data(sample_size=10000)
|
| 58 |
+
df = data_loader.preprocess_for_embedding(df)
|
| 59 |
+
print(f"Loaded {len(df)} models")
|
| 60 |
+
except Exception as e:
|
| 61 |
+
return {
|
| 62 |
+
'statusCode': 500,
|
| 63 |
+
'headers': headers,
|
| 64 |
+
'body': json.dumps({'error': f'Failed to load data: {str(e)}'})
|
| 65 |
+
}
|
| 66 |
+
|
| 67 |
+
# Route requests
|
| 68 |
+
if path.endswith('/api/models') or '/api/models' in path:
|
| 69 |
+
return get_models(query_params, headers)
|
| 70 |
+
elif path.endswith('/api/stats') or '/api/stats' in path:
|
| 71 |
+
return get_stats(headers)
|
| 72 |
+
else:
|
| 73 |
+
return {
|
| 74 |
+
'statusCode': 404,
|
| 75 |
+
'headers': headers,
|
| 76 |
+
'body': json.dumps({'error': 'Not found'})
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def get_models(query_params, headers):
|
| 81 |
+
"""Get filtered models."""
|
| 82 |
+
global df, embedder, reducer, embeddings, reduced_embeddings
|
| 83 |
+
|
| 84 |
+
try:
|
| 85 |
+
min_downloads = int(query_params.get('min_downloads', 0))
|
| 86 |
+
min_likes = int(query_params.get('min_likes', 0))
|
| 87 |
+
search_query = query_params.get('search_query')
|
| 88 |
+
max_points = int(query_params.get('max_points', 5000))
|
| 89 |
+
|
| 90 |
+
# Filter data
|
| 91 |
+
filtered_df = data_loader.filter_data(
|
| 92 |
+
df=df,
|
| 93 |
+
min_downloads=min_downloads,
|
| 94 |
+
min_likes=min_likes,
|
| 95 |
+
search_query=search_query
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
if len(filtered_df) == 0:
|
| 99 |
+
return {
|
| 100 |
+
'statusCode': 200,
|
| 101 |
+
'headers': headers,
|
| 102 |
+
'body': json.dumps([])
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
# Limit points
|
| 106 |
+
if len(filtered_df) > max_points:
|
| 107 |
+
filtered_df = filtered_df.sample(n=max_points, random_state=42)
|
| 108 |
+
|
| 109 |
+
# Generate embeddings if needed
|
| 110 |
+
if embedder is None:
|
| 111 |
+
embedder = ModelEmbedder()
|
| 112 |
+
|
| 113 |
+
if embeddings is None:
|
| 114 |
+
texts = df['combined_text'].tolist()
|
| 115 |
+
embeddings = embedder.generate_embeddings(texts)
|
| 116 |
+
|
| 117 |
+
# Reduce dimensions if needed
|
| 118 |
+
if reducer is None:
|
| 119 |
+
reducer = DimensionReducer(method="umap", n_components=2)
|
| 120 |
+
|
| 121 |
+
if reduced_embeddings is None:
|
| 122 |
+
reduced_embeddings = reducer.fit_transform(embeddings)
|
| 123 |
+
|
| 124 |
+
# Get coordinates
|
| 125 |
+
filtered_indices = filtered_df.index.tolist()
|
| 126 |
+
filtered_reduced = reduced_embeddings[filtered_indices]
|
| 127 |
+
|
| 128 |
+
# Prepare response
|
| 129 |
+
models = []
|
| 130 |
+
for idx, (i, row) in enumerate(filtered_df.iterrows()):
|
| 131 |
+
models.append({
|
| 132 |
+
'model_id': row.get('model_id', 'Unknown'),
|
| 133 |
+
'x': float(filtered_reduced[idx, 0]),
|
| 134 |
+
'y': float(filtered_reduced[idx, 1]),
|
| 135 |
+
'library_name': row.get('library_name'),
|
| 136 |
+
'pipeline_tag': row.get('pipeline_tag'),
|
| 137 |
+
'downloads': int(row.get('downloads', 0)),
|
| 138 |
+
'likes': int(row.get('likes', 0)),
|
| 139 |
+
'trending_score': float(row.get('trendingScore', 0)) if pd.notna(row.get('trendingScore')) else None,
|
| 140 |
+
'tags': row.get('tags') if pd.notna(row.get('tags')) else None
|
| 141 |
+
})
|
| 142 |
+
|
| 143 |
+
return {
|
| 144 |
+
'statusCode': 200,
|
| 145 |
+
'headers': headers,
|
| 146 |
+
'body': json.dumps(models)
|
| 147 |
+
}
|
| 148 |
+
except Exception as e:
|
| 149 |
+
return {
|
| 150 |
+
'statusCode': 500,
|
| 151 |
+
'headers': headers,
|
| 152 |
+
'body': json.dumps({'error': str(e)})
|
| 153 |
+
}
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def get_stats(headers):
|
| 157 |
+
"""Get dataset statistics."""
|
| 158 |
+
global df
|
| 159 |
+
|
| 160 |
+
if df is None:
|
| 161 |
+
return {
|
| 162 |
+
'statusCode': 503,
|
| 163 |
+
'headers': headers,
|
| 164 |
+
'body': json.dumps({'error': 'Data not loaded'})
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
stats = {
|
| 168 |
+
'total_models': len(df),
|
| 169 |
+
'unique_libraries': df['library_name'].nunique() if 'library_name' in df.columns else 0,
|
| 170 |
+
'unique_pipelines': df['pipeline_tag'].nunique() if 'pipeline_tag' in df.columns else 0,
|
| 171 |
+
'avg_downloads': float(df['downloads'].mean()) if 'downloads' in df.columns else 0,
|
| 172 |
+
'avg_likes': float(df['likes'].mean()) if 'likes' in df.columns else 0
|
| 173 |
+
}
|
| 174 |
+
|
| 175 |
+
return {
|
| 176 |
+
'statusCode': 200,
|
| 177 |
+
'headers': headers,
|
| 178 |
+
'body': json.dumps(stats)
|
| 179 |
+
}
|
| 180 |
+
|
netlify-functions/models.py
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Netlify serverless function wrapper for models API.
|
| 3 |
+
This file is the entry point for Netlify Functions.
|
| 4 |
+
"""
|
| 5 |
+
import sys
|
| 6 |
+
import os
|
| 7 |
+
|
| 8 |
+
# Add parent directories to path
|
| 9 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
|
| 10 |
+
|
| 11 |
+
from api import handler
|
| 12 |
+
|
| 13 |
+
# Netlify Functions expects a handler function
|
| 14 |
+
def lambda_handler(event, context):
|
| 15 |
+
"""
|
| 16 |
+
AWS Lambda/Netlify Functions handler.
|
| 17 |
+
Converts Netlify event format to our handler format.
|
| 18 |
+
"""
|
| 19 |
+
# Convert Netlify event to our format
|
| 20 |
+
# Netlify passes path in event['path']
|
| 21 |
+
# Query params are in event['queryStringParameters']
|
| 22 |
+
return handler(event, context)
|
| 23 |
+
|
netlify-functions/requirements.txt
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
pandas>=2.0.0
|
| 2 |
+
numpy>=1.24.0
|
| 3 |
+
sentence-transformers>=2.2.0
|
| 4 |
+
umap-learn>=0.5.4
|
| 5 |
+
scikit-learn>=1.3.0
|
| 6 |
+
datasets>=2.14.0
|
| 7 |
+
huggingface-hub>=0.17.0
|
| 8 |
+
tqdm>=4.66.0
|
| 9 |
+
|
netlify.toml
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build]
|
| 2 |
+
base = "frontend"
|
| 3 |
+
publish = "frontend/build"
|
| 4 |
+
command = "npm install && npm run build"
|
| 5 |
+
|
| 6 |
+
[build.environment]
|
| 7 |
+
NODE_VERSION = "18"
|
| 8 |
+
|
| 9 |
+
# Redirect all routes to index.html for React Router
|
| 10 |
+
[[redirects]]
|
| 11 |
+
from = "/*"
|
| 12 |
+
to = "/index.html"
|
| 13 |
+
status = 200
|
| 14 |
+
|
| 15 |
+
# Netlify Functions (if using serverless backend)
|
| 16 |
+
[functions]
|
| 17 |
+
directory = "netlify-functions"
|
| 18 |
+
node_bundler = "esbuild"
|
| 19 |
+
|
| 20 |
+
# Headers for API routes
|
| 21 |
+
[[headers]]
|
| 22 |
+
for = "/.netlify/functions/*"
|
| 23 |
+
[headers.values]
|
| 24 |
+
Access-Control-Allow-Origin = "*"
|
| 25 |
+
Access-Control-Allow-Headers = "Content-Type"
|
| 26 |
+
Access-Control-Allow-Methods = "GET, POST, OPTIONS"
|
| 27 |
+
|
requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.0.0
|
| 2 |
+
plotly>=5.18.0
|
| 3 |
+
pandas>=2.0.0
|
| 4 |
+
numpy>=1.24.0
|
| 5 |
+
sentence-transformers>=2.2.0
|
| 6 |
+
umap-learn>=0.5.4
|
| 7 |
+
scikit-learn>=1.3.0
|
| 8 |
+
datasets>=2.14.0
|
| 9 |
+
huggingface-hub>=0.17.0
|
| 10 |
+
tqdm>=4.66.0
|
| 11 |
+
python-dotenv>=1.0.0
|
| 12 |
+
|
test_local.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Quick test script to verify the application works locally.
|
| 3 |
+
Run this before deploying to Hugging Face Spaces.
|
| 4 |
+
"""
|
| 5 |
+
import sys
|
| 6 |
+
from app import create_interface
|
| 7 |
+
|
| 8 |
+
if __name__ == "__main__":
|
| 9 |
+
print("Creating interface...")
|
| 10 |
+
demo = create_interface()
|
| 11 |
+
print("Launching demo...")
|
| 12 |
+
demo.launch(share=False, server_name="127.0.0.1", server_port=7860)
|
| 13 |
+
|