midah commited on
Commit
d5b98e6
·
0 Parent(s):

Initial commit: Hugging Face Model Ecosystem Navigator

Browse files

- Interactive latent space visualization for 1.86M models
- Plotly + Gradio implementation for Hugging Face Spaces
- React + Visx implementation for custom deployment
- Embedding generation with sentence transformers
- UMAP dimensionality reduction
- Model detail modals with Hugging Face links
- Paper: Anatomy of a Machine Learning Ecosystem (arXiv:2508.06811)

.gitignore ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ env/
8
+ venv/
9
+ ENV/
10
+ .venv
11
+
12
+ # Caches
13
+ *.pkl
14
+ *.pickle
15
+ *.cache
16
+ embeddings_cache.pkl
17
+ reduced_embeddings_cache.pkl
18
+ *.npy
19
+
20
+ # IDE
21
+ .vscode/
22
+ .idea/
23
+ *.swp
24
+ *.swo
25
+ *~
26
+
27
+ # OS
28
+ .DS_Store
29
+ Thumbs.db
30
+
31
+ # Gradio
32
+ flagged/
33
+
34
+ # Data
35
+ *.parquet
36
+ *.csv
37
+ data/
38
+
.nvmrc ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ 18
2
+
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
2
+
3
+ **Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
4
+
5
+ **Research Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
6
+
7
+ ## Abstract
8
+
9
+ Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.
10
+
11
+ ## About This Tool
12
+
13
+ This interactive latent space navigator visualizes ~1.84M models from the [modelbiome/ai_ecosystem_withmodelcards](https://huggingface.co/datasets/modelbiome/ai_ecosystem_withmodelcards) dataset in a 2D space where similar models appear closer together, allowing you to explore the relationships and family structures described in the paper.
14
+
15
+ ## Features
16
+
17
+ - **Latent Space Visualization**: 2D embedding visualization showing model relationships
18
+ - **Interactive Exploration**: Hover, click, and zoom to explore models
19
+ - **Smart Filtering**: Filter by library, pipeline tag, popularity, and more
20
+ - **Color & Size Encoding**: Visualize different attributes through color and size
21
+ - **Caching**: Efficient caching of embeddings and reduced dimensions
22
+ - **Performance Optimized**: Handles large datasets through smart sampling
23
+
24
+ ## Quick Start
25
+
26
+ ### Option 1: Plotly + Gradio (Hugging Face Spaces)
27
+
28
+ ```bash
29
+ pip install -r requirements.txt
30
+ python app.py
31
+ ```
32
+
33
+ ### Option 2: Visx + React (Netlify Deployment)
34
+
35
+ For Netlify deployment, deploy the frontend to Netlify and the backend to Railway or Render. Set the `REACT_APP_API_URL` environment variable to your backend URL.
36
+
37
+ ## Installation
38
+
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ ## Usage
44
+
45
+ ### Local Development
46
+
47
+ ```bash
48
+ python app.py
49
+ ```
50
+
51
+ Or use the test script:
52
+
53
+ ```bash
54
+ python test_local.py
55
+ ```
56
+
57
+ The app will:
58
+ 1. Load a sample of 10,000 models from the dataset
59
+ 2. Generate embeddings (first run takes ~2-3 minutes)
60
+ 3. Reduce dimensions using UMAP
61
+ 4. Launch a Gradio interface at `http://localhost:7860`
62
+
63
+ ### Using the Interface
64
+
65
+ 1. **Filters**: Use the left sidebar to filter models by:
66
+ - Search query (model ID or tags)
67
+ - Minimum downloads
68
+ - Minimum likes
69
+ - Color mapping (library, pipeline, popularity)
70
+ - Size mapping (downloads, likes, trending score)
71
+
72
+ 2. **Exploration**:
73
+ - Hover over points to see model information
74
+ - Zoom and pan to explore different regions
75
+ - Use the legend to understand color coding
76
+
77
+ 3. **Understanding the Space**:
78
+ - Models closer together are more similar
79
+ - Similarity is based on tags, pipeline type, library, and model card content
80
+
81
+ ## Deployment
82
+
83
+ ### Hugging Face Spaces
84
+
85
+ 1. Create a new Space on Hugging Face
86
+ 2. Push this repository to the Space
87
+ 3. Ensure `requirements.txt` and `app.py` are in the root
88
+ 4. The app will automatically:
89
+ - Load the dataset from Hugging Face Hub
90
+ - Generate embeddings on first run (cached afterwards)
91
+ - Serve the interface via Gradio
92
+
93
+ **Note**: First load may take 2-3 minutes for embedding generation. Subsequent loads will be faster due to caching.
94
+
95
+ ### Netlify (React Frontend)
96
+
97
+ 1. Deploy frontend to Netlify (set base directory to `frontend`)
98
+ 2. Deploy backend to Railway/Render (set root directory to `backend`)
99
+ 3. Set `REACT_APP_API_URL` environment variable in Netlify to your backend URL
100
+ 4. Update CORS in backend to include your Netlify URL
101
+
102
+ ## Architecture
103
+
104
+ ### Current Implementation (Plotly + Gradio)
105
+
106
+ - **Data Loading** (`data_loader.py`): Loads dataset from Hugging Face Hub, handles filtering and preprocessing
107
+ - **Embedding Generation** (`embeddings.py`): Creates embeddings from model metadata using sentence transformers
108
+ - **Dimensionality Reduction** (`dimensionality_reduction.py`): Uses UMAP to reduce to 2D for visualization
109
+ - **Main App** (`app.py`): Gradio interface with Plotly visualizations
110
+
111
+ ### Alternative Implementation (Visx + React)
112
+
113
+ For better performance and customization, see the `frontend/` and `backend/` directories for a React + Visx implementation:
114
+
115
+ - **Backend** (`backend/api.py`): FastAPI server serving model data
116
+ - **Frontend** (`frontend/`): React app with Visx visualizations
117
+
118
+ ### Comparison with Hugging Face Dataset Viewer
119
+
120
+ This project uses a different approach than Hugging Face's built-in dataset viewer:
121
+
122
+ - **HF Dataset Viewer**: Tabular browser for exploring dataset rows (see [dataset-viewer](https://github.com/huggingface/dataset-viewer))
123
+ - **This Project**: Latent space visualization showing semantic relationships between models
124
+
125
+ The HF viewer is optimized for browsing data structure, while this tool focuses on understanding model relationships through embeddings and spatial visualization.
126
+
127
+ ## Design Decisions
128
+
129
+ The application uses:
130
+ - **2D visualization** for simplicity and accessibility
131
+ - **UMAP** for dimensionality reduction (better global structure than t-SNE)
132
+ - **Sentence transformers** for efficient embedding generation
133
+ - **Smart sampling** to maintain interactivity with large datasets
134
+ - **Caching** to avoid recomputation on filter changes
135
+
136
+ ## Performance Notes
137
+
138
+ - **Initial Sample**: 10,000 models (configurable in `app.py`)
139
+ - **Visualization Limit**: Maximum 5,000 points for smooth interaction
140
+ - **Embedding Model**: `all-MiniLM-L6-v2` (good balance of quality and speed)
141
+ - **Caching**: Embeddings and reduced dimensions are cached to disk
142
+
143
+ ## Requirements
144
+
145
+ - Python 3.8+
146
+ - ~2-4GB RAM for 10K models
147
+ - Internet connection for dataset download
148
+ - Optional: GPU for faster embedding generation (not required)
149
+
150
+ ## Citation
151
+
152
+ If you use this tool or dataset, please cite:
153
+
154
+ ```bibtex
155
+ @article{laufer2025anatomy,
156
+ title={Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face},
157
+ author={Laufer, Benjamin and Oderinwale, Hamidah and Kleinberg, Jon},
158
+ journal={arXiv preprint arXiv:2508.06811},
159
+ year={2025},
160
+ url={https://arxiv.org/abs/2508.06811}
161
+ }
162
+ ```
163
+
164
+ **Paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
app.py ADDED
@@ -0,0 +1,451 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main Gradio application for the Hugging Face Model Ecosystem Navigator.
3
+ """
4
+ import gradio as gr
5
+ import plotly.graph_objects as go
6
+ import plotly.express as px
7
+ import pandas as pd
8
+ import numpy as np
9
+ from typing import Optional, Tuple
10
+ import os
11
+
12
+ from data_loader import ModelDataLoader
13
+ from embeddings import ModelEmbedder
14
+ from dimensionality_reduction import DimensionReducer
15
+
16
+
17
+ class ModelNavigatorApp:
18
+ """Main application class for the model navigator."""
19
+
20
+ def __init__(self):
21
+ self.data_loader = ModelDataLoader()
22
+ self.embedder: Optional[ModelEmbedder] = None
23
+ self.reducer: Optional[DimensionReducer] = None
24
+ self.df: Optional[pd.DataFrame] = None
25
+ self.embeddings: Optional[np.ndarray] = None
26
+ self.reduced_embeddings: Optional[np.ndarray] = None
27
+ self.current_filtered_df: Optional[pd.DataFrame] = None
28
+
29
+ def load_initial_data(self, sample_size: int = 10000):
30
+ """Load initial sample of data."""
31
+ print("Loading initial data...")
32
+ self.df = self.data_loader.load_data(sample_size=sample_size)
33
+ self.df = self.data_loader.preprocess_for_embedding(self.df)
34
+ return f"Loaded {len(self.df)} models"
35
+
36
+ def generate_visualization(
37
+ self,
38
+ color_by: str = "library_name",
39
+ size_by: str = "downloads",
40
+ min_downloads: int = 0,
41
+ min_likes: int = 0,
42
+ search_query: str = "",
43
+ selected_libraries: list = None,
44
+ selected_pipeline_tags: list = None,
45
+ use_cache: bool = True
46
+ ) -> Tuple[go.Figure, pd.DataFrame]:
47
+ """
48
+ Generate interactive visualization.
49
+
50
+ Returns:
51
+ Plotly figure and filtered dataframe
52
+ """
53
+ if self.df is None or len(self.df) == 0:
54
+ return go.Figure(), pd.DataFrame()
55
+
56
+ # Filter data
57
+ filtered_df = self.data_loader.filter_data(
58
+ df=self.df,
59
+ min_downloads=min_downloads,
60
+ min_likes=min_likes,
61
+ libraries=selected_libraries if selected_libraries else None,
62
+ pipeline_tags=selected_pipeline_tags if selected_pipeline_tags else None,
63
+ search_query=search_query if search_query else None
64
+ )
65
+
66
+ if len(filtered_df) == 0:
67
+ empty_fig = go.Figure()
68
+ empty_fig.add_annotation(
69
+ text="No models match the selected filters",
70
+ xref="paper", yref="paper",
71
+ x=0.5, y=0.5, showarrow=False
72
+ )
73
+ return empty_fig, filtered_df
74
+
75
+ # Limit to reasonable size for performance
76
+ max_points = 5000
77
+ if len(filtered_df) > max_points:
78
+ filtered_df = filtered_df.sample(n=max_points, random_state=42)
79
+ print(f"Sampled {max_points} models for visualization")
80
+
81
+ # Get indices for filtered data
82
+ filtered_indices = filtered_df.index.tolist()
83
+
84
+ # Generate or load embeddings
85
+ cache_file = "embeddings_cache.pkl"
86
+ if use_cache and os.path.exists(cache_file) and self.embeddings is None:
87
+ try:
88
+ if self.embedder is None:
89
+ self.embedder = ModelEmbedder()
90
+ self.embeddings = self.embedder.load_embeddings(cache_file)
91
+ except Exception as e:
92
+ print(f"Could not load cached embeddings: {e}")
93
+ pass
94
+
95
+ if self.embeddings is None:
96
+ if self.embedder is None:
97
+ self.embedder = ModelEmbedder()
98
+
99
+ # Generate embeddings for all data
100
+ texts = self.df['combined_text'].tolist()
101
+ self.embeddings = self.embedder.generate_embeddings(texts)
102
+
103
+ if use_cache:
104
+ self.embedder.save_embeddings(self.embeddings, cache_file)
105
+
106
+ # Get embeddings for filtered data
107
+ filtered_embeddings = self.embeddings[filtered_indices]
108
+
109
+ # Reduce dimensions
110
+ if self.reducer is None:
111
+ self.reducer = DimensionReducer(method="umap", n_components=2)
112
+
113
+ reduced_cache_file = "reduced_embeddings_cache.npy"
114
+ if use_cache and os.path.exists(reduced_cache_file):
115
+ try:
116
+ self.reduced_embeddings = np.load(reduced_cache_file, allow_pickle=True)
117
+ if len(self.reduced_embeddings) != len(self.df):
118
+ self.reduced_embeddings = None
119
+ except Exception as e:
120
+ print(f"Could not load cached reduced embeddings: {e}")
121
+ pass
122
+
123
+ if self.reduced_embeddings is None or len(self.reduced_embeddings) != len(self.df):
124
+ self.reduced_embeddings = self.reducer.fit_transform(self.embeddings)
125
+ if use_cache:
126
+ np.save(reduced_cache_file, self.reduced_embeddings)
127
+
128
+ filtered_reduced = self.reduced_embeddings[filtered_indices]
129
+
130
+ # Prepare data for plotting
131
+ plot_df = filtered_df.copy()
132
+ plot_df['x'] = filtered_reduced[:, 0]
133
+ plot_df['y'] = filtered_reduced[:, 1]
134
+
135
+ # Color mapping
136
+ if color_by in plot_df.columns:
137
+ color_values = plot_df[color_by].fillna('Unknown')
138
+ else:
139
+ color_values = pd.Series(['All Models'] * len(plot_df))
140
+
141
+ # Size mapping
142
+ if size_by and size_by != "None" and size_by in plot_df.columns:
143
+ size_values = plot_df[size_by].fillna(0)
144
+ # Normalize sizes
145
+ if size_values.max() > 0:
146
+ size_values = 5 + 15 * (size_values / size_values.max())
147
+ else:
148
+ size_values = pd.Series([10] * len(plot_df))
149
+ else:
150
+ size_values = pd.Series([10] * len(plot_df))
151
+
152
+ # Create hover text
153
+ hover_texts = []
154
+ for idx, row in plot_df.iterrows():
155
+ hover = f"<b>{row.get('model_id', 'Unknown')}</b><br>"
156
+ hover += f"Library: {row.get('library_name', 'N/A')}<br>"
157
+ hover += f"Pipeline: {row.get('pipeline_tag', 'N/A')}<br>"
158
+ hover += f"Downloads: {row.get('downloads', 0):,}<br>"
159
+ hover += f"Likes: {row.get('likes', 0):,}"
160
+ hover_texts.append(hover)
161
+
162
+ # Create plotly figure
163
+ fig = go.Figure()
164
+
165
+ # Store model IDs with indices for click handling
166
+ model_id_map = {i: row.get('model_id', 'Unknown') for i, row in plot_df.iterrows()}
167
+
168
+ # Group by color if categorical
169
+ is_categorical = len(color_values) > 0 and isinstance(color_values.iloc[0], str)
170
+
171
+ if is_categorical and color_by in plot_df.columns:
172
+ unique_colors = color_values.unique()
173
+ colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
174
+ color_map = {val: colors[i % len(colors)] for i, val in enumerate(unique_colors)}
175
+
176
+ for color_val in unique_colors:
177
+ mask = color_values == color_val
178
+ subset_df = plot_df[mask]
179
+ subset_hover = [hover_texts[i] for i, m in enumerate(mask) if m]
180
+ subset_sizes = size_values[mask]
181
+
182
+ # Create customdata with model IDs for click handling
183
+ subset_customdata = [
184
+ [int(idx), str(row.get('model_id', 'Unknown'))]
185
+ for idx, row in subset_df.iterrows()
186
+ ]
187
+
188
+ fig.add_trace(go.Scatter(
189
+ x=subset_df['x'],
190
+ y=subset_df['y'],
191
+ mode='markers',
192
+ name=str(color_val)[:30], # Truncate long names
193
+ marker=dict(
194
+ size=subset_sizes.values,
195
+ color=color_map[color_val],
196
+ opacity=0.7,
197
+ line=dict(width=0.5, color='white')
198
+ ),
199
+ text=subset_df['model_id'].tolist(),
200
+ customdata=subset_customdata,
201
+ hovertemplate='%{text}<br>Click for details<extra></extra>',
202
+ showlegend=True
203
+ ))
204
+ else:
205
+ # Continuous color scale
206
+ customdata = [
207
+ [int(idx), str(row.get('model_id', 'Unknown'))]
208
+ for idx, row in plot_df.iterrows()
209
+ ]
210
+
211
+ fig.add_trace(go.Scatter(
212
+ x=plot_df['x'],
213
+ y=plot_df['y'],
214
+ mode='markers',
215
+ marker=dict(
216
+ size=size_values.values,
217
+ color=color_values.values,
218
+ colorscale='Viridis',
219
+ opacity=0.7,
220
+ line=dict(width=0.5, color='white'),
221
+ colorbar=dict(title=color_by)
222
+ ),
223
+ text=plot_df['model_id'].tolist(),
224
+ customdata=customdata,
225
+ hovertemplate='%{text}<br>Click for details<extra></extra>',
226
+ showlegend=False
227
+ ))
228
+
229
+ # Update layout
230
+ fig.update_layout(
231
+ title={
232
+ 'text': f'Model Latent Space Navigator ({len(plot_df)} models)',
233
+ 'x': 0.5,
234
+ 'xanchor': 'center'
235
+ },
236
+ xaxis_title="Dimension 1",
237
+ yaxis_title="Dimension 2",
238
+ hovermode='closest',
239
+ template='plotly_white',
240
+ height=700,
241
+ clickmode='event+select'
242
+ )
243
+
244
+ return fig, filtered_df
245
+
246
+ def get_model_details(self, model_id: str) -> str:
247
+ """Get detailed information about a model."""
248
+ if self.df is None:
249
+ return "No data loaded"
250
+
251
+ model = self.df[self.df.get('model_id', '') == model_id]
252
+ if len(model) == 0:
253
+ return f"Model '{model_id}' not found"
254
+
255
+ model = model.iloc[0]
256
+
257
+ details = f"# {model.get('model_id', 'Unknown')}\n\n"
258
+ details += f"**Library:** {model.get('library_name', 'N/A')}\n\n"
259
+ details += f"**Pipeline Tag:** {model.get('pipeline_tag', 'N/A')}\n\n"
260
+ details += f"**Downloads:** {model.get('downloads', 0):,}\n\n"
261
+ details += f"**Likes:** {model.get('likes', 0):,}\n\n"
262
+ details += f"**Trending Score:** {model.get('trendingScore', 'N/A')}\n\n"
263
+
264
+ if pd.notna(model.get('tags')):
265
+ details += f"**Tags:** {model.get('tags', '')}\n\n"
266
+
267
+ if pd.notna(model.get('licenses')):
268
+ details += f"**License:** {model.get('licenses', '')}\n\n"
269
+
270
+ if pd.notna(model.get('parent_model')):
271
+ details += f"**Parent Model:** {model.get('parent_model', '')}\n\n"
272
+
273
+ return details
274
+
275
+
276
+ def create_interface():
277
+ """Create and launch Gradio interface."""
278
+ app = ModelNavigatorApp()
279
+
280
+ # Load initial data
281
+ status = app.load_initial_data(sample_size=10000)
282
+ print(status)
283
+
284
+ with gr.Blocks(title="Anatomy of a Machine Learning Ecosystem", theme=gr.themes.Soft()) as demo:
285
+ gr.Markdown("""
286
+ # Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face
287
+
288
+ **Authors:** Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
289
+
290
+ Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families.
291
+
292
+ **Read the full paper**: [arXiv:2508.06811](https://arxiv.org/abs/2508.06811)
293
+
294
+ ---
295
+
296
+ **How to use this navigator:**
297
+ - Adjust filters to explore different subsets of models
298
+ - Hover over points to see model information
299
+ - Use color and size options to highlight different attributes
300
+ - Similar models appear closer together in the latent space
301
+ - Models are positioned based on their similarity (tags, pipeline, library, and model card content)
302
+ """)
303
+
304
+ with gr.Row():
305
+ with gr.Column(scale=1):
306
+ gr.Markdown("### Filters")
307
+
308
+ search_query = gr.Textbox(
309
+ label="Search Models",
310
+ placeholder="Search by model ID or tags...",
311
+ value=""
312
+ )
313
+
314
+ min_downloads = gr.Slider(
315
+ label="Min Downloads",
316
+ minimum=0,
317
+ maximum=1000000,
318
+ value=0,
319
+ step=1000
320
+ )
321
+
322
+ min_likes = gr.Slider(
323
+ label="Min Likes",
324
+ minimum=0,
325
+ maximum=10000,
326
+ value=0,
327
+ step=10
328
+ )
329
+
330
+ color_by = gr.Dropdown(
331
+ label="Color By",
332
+ choices=["library_name", "pipeline_tag", "downloads", "likes"],
333
+ value="library_name"
334
+ )
335
+
336
+ size_by = gr.Dropdown(
337
+ label="Size By",
338
+ choices=["downloads", "likes", "trendingScore", "None"],
339
+ value="downloads"
340
+ )
341
+
342
+ update_btn = gr.Button("Update Visualization", variant="primary")
343
+
344
+ with gr.Column(scale=3):
345
+ plot = gr.Plot(label="Model Latent Space")
346
+ model_details = gr.Markdown(
347
+ value="**Instructions:** Use the filters above to explore models. Hover over points to see details, **click on a point** to view full model information and link to Hugging Face.",
348
+ label="Model Details"
349
+ )
350
+
351
+ def handle_plot_click(evt: gr.SelectData):
352
+ """Handle plot click and show model details."""
353
+ if evt is None or app.df is None:
354
+ return "**Click on a model point to see details**"
355
+
356
+ try:
357
+ # Get the point index from the click event
358
+ point_idx = evt.index
359
+ if point_idx is None:
360
+ return "**Click on a model point to see details**"
361
+
362
+ # Get the current filtered dataframe
363
+ if app.current_filtered_df is not None and len(app.current_filtered_df) > 0:
364
+ filtered_df = app.current_filtered_df
365
+ else:
366
+ # Fallback: use the full dataframe
367
+ filtered_df = app.df
368
+
369
+ # Limit to max_points if needed
370
+ if len(filtered_df) > 5000:
371
+ filtered_df = filtered_df.sample(n=5000, random_state=42)
372
+
373
+ if point_idx < len(filtered_df):
374
+ model_row = filtered_df.iloc[point_idx]
375
+ model_id = model_row.get('model_id', 'Unknown')
376
+
377
+ # Get full model details from the original dataframe
378
+ model = app.df[app.df.get('model_id', '') == model_id]
379
+ if len(model) == 0:
380
+ return f"**Model not found:** {model_id}"
381
+
382
+ model = model.iloc[0]
383
+ hf_url = f"https://huggingface.co/{model_id}"
384
+
385
+ details = f"""# {model_id}
386
+
387
+ **[View on Hugging Face]({hf_url})**
388
+
389
+ ## Model Information
390
+
391
+ - **Library:** {model.get('library_name', 'N/A')}
392
+ - **Pipeline Tag:** {model.get('pipeline_tag', 'N/A')}
393
+ - **Downloads:** {model.get('downloads', 0):,}
394
+ - **Likes:** {model.get('likes', 0):,}
395
+ """
396
+ if pd.notna(model.get('trendingScore')):
397
+ details += f"- **Trending Score:** {model.get('trendingScore', 0):.2f}\n\n"
398
+ else:
399
+ details += "\n"
400
+
401
+ if pd.notna(model.get('tags')):
402
+ details += f"**Tags:** {model.get('tags', '')}\n\n"
403
+ if pd.notna(model.get('licenses')):
404
+ details += f"**License:** {model.get('licenses', '')}\n\n"
405
+ if pd.notna(model.get('parent_model')):
406
+ details += f"**Parent Model:** {model.get('parent_model', '')}\n\n"
407
+
408
+ return details
409
+ else:
410
+ return f"**Point index out of range:** {point_idx}"
411
+ except Exception as e:
412
+ import traceback
413
+ return f"**Error loading model details:**\n```\n{str(e)}\n{traceback.format_exc()}\n```"
414
+
415
+ return "**Click on a model point to see details**"
416
+
417
+ def update_plot_and_store(color_by_val, size_by_val, min_dl, min_lk, search):
418
+ fig, df = app.generate_visualization(
419
+ color_by=color_by_val,
420
+ size_by=size_by_val,
421
+ min_downloads=int(min_dl),
422
+ min_likes=int(min_lk),
423
+ search_query=search
424
+ )
425
+ # Store the filtered dataframe for click handling
426
+ app.current_filtered_df = df
427
+ return fig
428
+
429
+ update_btn.click(
430
+ fn=update_plot_and_store,
431
+ inputs=[color_by, size_by, min_downloads, min_likes, search_query],
432
+ outputs=plot
433
+ )
434
+
435
+ # Handle plot clicks - Gradio's Plot component supports click events
436
+ plot.select(
437
+ fn=handle_plot_click,
438
+ outputs=model_details
439
+ )
440
+
441
+ # Initial plot
442
+ initial_fig, initial_df = app.generate_visualization()
443
+ plot.value = initial_fig
444
+ app.current_filtered_df = initial_df
445
+
446
+ return demo
447
+
448
+
449
+ if __name__ == "__main__":
450
+ demo = create_interface()
451
+ demo.launch(share=False, server_name="0.0.0.0", server_port=7860)
backend/api.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI backend for serving model data to React/Visx frontend.
3
+ """
4
+ from fastapi import FastAPI, HTTPException, Query
5
+ from fastapi.middleware.cors import CORSMiddleware
6
+ from typing import Optional, List
7
+ import pandas as pd
8
+ import numpy as np
9
+ import os
10
+ from pydantic import BaseModel
11
+
12
+ from data_loader import ModelDataLoader
13
+ from embeddings import ModelEmbedder
14
+ from dimensionality_reduction import DimensionReducer
15
+
16
+ app = FastAPI(title="HF Model Ecosystem API")
17
+
18
+ # CORS middleware for React frontend
19
+ # Update allow_origins with your Netlify URL in production
20
+ # Note: Add your specific Netlify URL after deployment
21
+ FRONTEND_URL = os.getenv("FRONTEND_URL", "http://localhost:3000")
22
+ app.add_middleware(
23
+ CORSMiddleware,
24
+ allow_origins=[
25
+ "http://localhost:3000", # Local development
26
+ FRONTEND_URL, # Production frontend URL
27
+ # Add your Netlify URL here after deployment, e.g.:
28
+ # "https://your-app-name.netlify.app",
29
+ ],
30
+ allow_credentials=True,
31
+ allow_methods=["*"],
32
+ allow_headers=["*"],
33
+ )
34
+
35
+ # Global state
36
+ data_loader = ModelDataLoader()
37
+ embedder: Optional[ModelEmbedder] = None
38
+ reducer: Optional[DimensionReducer] = None
39
+ df: Optional[pd.DataFrame] = None
40
+ embeddings: Optional[np.ndarray] = None
41
+ reduced_embeddings: Optional[np.ndarray] = None
42
+
43
+
44
+ class FilterParams(BaseModel):
45
+ min_downloads: int = 0
46
+ min_likes: int = 0
47
+ search_query: Optional[str] = None
48
+ libraries: Optional[List[str]] = None
49
+ pipeline_tags: Optional[List[str]] = None
50
+
51
+
52
+ class ModelPoint(BaseModel):
53
+ model_id: str
54
+ x: float
55
+ y: float
56
+ library_name: Optional[str]
57
+ pipeline_tag: Optional[str]
58
+ downloads: int
59
+ likes: int
60
+ trending_score: Optional[float]
61
+ tags: Optional[str]
62
+
63
+
64
+ @app.on_event("startup")
65
+ async def startup_event():
66
+ """Initialize data and models on startup."""
67
+ global df, embedder, reducer
68
+
69
+ print("Loading data...")
70
+ df = data_loader.load_data(sample_size=10000)
71
+ df = data_loader.preprocess_for_embedding(df)
72
+
73
+ print("Initializing embedder...")
74
+ embedder = ModelEmbedder()
75
+
76
+ print("Initializing reducer...")
77
+ reducer = DimensionReducer(method="umap", n_components=2)
78
+
79
+ print("API ready!")
80
+
81
+
82
+ @app.get("/")
83
+ async def root():
84
+ return {"message": "HF Model Ecosystem API", "status": "running"}
85
+
86
+
87
+ @app.get("/api/models", response_model=List[ModelPoint])
88
+ async def get_models(
89
+ min_downloads: int = Query(0),
90
+ min_likes: int = Query(0),
91
+ search_query: Optional[str] = Query(None),
92
+ color_by: str = Query("library_name"),
93
+ size_by: str = Query("downloads"),
94
+ max_points: int = Query(5000)
95
+ ):
96
+ """
97
+ Get filtered models with 2D coordinates for visualization.
98
+ """
99
+ global df, embedder, reducer, embeddings, reduced_embeddings
100
+
101
+ if df is None:
102
+ raise HTTPException(status_code=503, detail="Data not loaded")
103
+
104
+ # Filter data
105
+ filtered_df = data_loader.filter_data(
106
+ df=df,
107
+ min_downloads=min_downloads,
108
+ min_likes=min_likes,
109
+ search_query=search_query,
110
+ libraries=None, # Can be added as query params
111
+ pipeline_tags=None
112
+ )
113
+
114
+ if len(filtered_df) == 0:
115
+ return []
116
+
117
+ # Limit points
118
+ if len(filtered_df) > max_points:
119
+ filtered_df = filtered_df.sample(n=max_points, random_state=42)
120
+
121
+ # Generate embeddings if needed
122
+ if embeddings is None:
123
+ texts = df['combined_text'].tolist()
124
+ embeddings = embedder.generate_embeddings(texts)
125
+
126
+ # Reduce dimensions if needed
127
+ if reduced_embeddings is None:
128
+ reduced_embeddings = reducer.fit_transform(embeddings)
129
+
130
+ # Get coordinates for filtered data
131
+ filtered_indices = filtered_df.index.tolist()
132
+ filtered_reduced = reduced_embeddings[filtered_indices]
133
+
134
+ # Prepare response
135
+ models = []
136
+ for idx, (i, row) in enumerate(filtered_df.iterrows()):
137
+ models.append(ModelPoint(
138
+ model_id=row.get('model_id', 'Unknown'),
139
+ x=float(filtered_reduced[idx, 0]),
140
+ y=float(filtered_reduced[idx, 1]),
141
+ library_name=row.get('library_name'),
142
+ pipeline_tag=row.get('pipeline_tag'),
143
+ downloads=int(row.get('downloads', 0)),
144
+ likes=int(row.get('likes', 0)),
145
+ trending_score=float(row.get('trendingScore', 0)) if pd.notna(row.get('trendingScore')) else None,
146
+ tags=row.get('tags') if pd.notna(row.get('tags')) else None
147
+ ))
148
+
149
+ return models
150
+
151
+
152
+ @app.get("/api/stats")
153
+ async def get_stats():
154
+ """Get dataset statistics."""
155
+ if df is None:
156
+ raise HTTPException(status_code=503, detail="Data not loaded")
157
+
158
+ return {
159
+ "total_models": len(df),
160
+ "unique_libraries": df['library_name'].nunique() if 'library_name' in df.columns else 0,
161
+ "unique_pipelines": df['pipeline_tag'].nunique() if 'pipeline_tag' in df.columns else 0,
162
+ "avg_downloads": float(df['downloads'].mean()) if 'downloads' in df.columns else 0,
163
+ "avg_likes": float(df['likes'].mean()) if 'likes' in df.columns else 0
164
+ }
165
+
166
+
167
+ @app.get("/api/model/{model_id}")
168
+ async def get_model_details(model_id: str):
169
+ """Get detailed information about a specific model."""
170
+ if df is None:
171
+ raise HTTPException(status_code=503, detail="Data not loaded")
172
+
173
+ model = df[df.get('model_id', '') == model_id]
174
+ if len(model) == 0:
175
+ raise HTTPException(status_code=404, detail="Model not found")
176
+
177
+ model = model.iloc[0]
178
+ return {
179
+ "model_id": model.get('model_id'),
180
+ "library_name": model.get('library_name'),
181
+ "pipeline_tag": model.get('pipeline_tag'),
182
+ "downloads": int(model.get('downloads', 0)),
183
+ "likes": int(model.get('likes', 0)),
184
+ "trending_score": float(model.get('trendingScore', 0)) if pd.notna(model.get('trendingScore')) else None,
185
+ "tags": model.get('tags') if pd.notna(model.get('tags')) else None,
186
+ "licenses": model.get('licenses') if pd.notna(model.get('licenses')) else None,
187
+ "parent_model": model.get('parent_model') if pd.notna(model.get('parent_model')) else None
188
+ }
189
+
190
+
191
+ if __name__ == "__main__":
192
+ import uvicorn
193
+ uvicorn.run(app, host="0.0.0.0", port=8000)
194
+
backend/requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi>=0.104.0
2
+ uvicorn[standard]>=0.24.0
3
+ pydantic>=2.0.0
4
+ pandas>=2.0.0
5
+ numpy>=1.24.0
6
+ sentence-transformers>=2.2.0
7
+ umap-learn>=0.5.4
8
+ scikit-learn>=1.3.0
9
+ datasets>=2.14.0
10
+ huggingface-hub>=0.17.0
11
+ tqdm>=4.66.0
12
+
data_loader.py ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data loading and preprocessing for the Hugging Face model ecosystem dataset.
3
+ """
4
+ import pandas as pd
5
+ from datasets import load_dataset
6
+ from typing import Optional, Dict, List
7
+ import numpy as np
8
+
9
+
10
+ class ModelDataLoader:
11
+ """Load and preprocess model data from Hugging Face dataset."""
12
+
13
+ def __init__(self, dataset_name: str = "modelbiome/ai_ecosystem_withmodelcards"):
14
+ self.dataset_name = dataset_name
15
+ self.df: Optional[pd.DataFrame] = None
16
+
17
+ def load_data(self, sample_size: Optional[int] = None, split: str = "train") -> pd.DataFrame:
18
+ """
19
+ Load dataset from Hugging Face Hub.
20
+
21
+ Args:
22
+ sample_size: If provided, randomly sample this many rows
23
+ split: Dataset split to load
24
+
25
+ Returns:
26
+ DataFrame with model data
27
+ """
28
+ print(f"Loading dataset {self.dataset_name}...")
29
+ dataset = load_dataset(self.dataset_name, split=split)
30
+
31
+ if sample_size and len(dataset) > sample_size:
32
+ print(f"Sampling {sample_size} models from {len(dataset)} total...")
33
+ dataset = dataset.shuffle(seed=42).select(range(sample_size))
34
+
35
+ self.df = dataset.to_pandas()
36
+ print(f"Loaded {len(self.df)} models")
37
+
38
+ return self.df
39
+
40
+ def preprocess_for_embedding(self, df: Optional[pd.DataFrame] = None) -> pd.DataFrame:
41
+ """
42
+ Preprocess data for embedding generation.
43
+ Combines text fields into a single representation.
44
+
45
+ Args:
46
+ df: DataFrame to process (uses self.df if None)
47
+
48
+ Returns:
49
+ DataFrame with combined text field
50
+ """
51
+ if df is None:
52
+ df = self.df.copy()
53
+ else:
54
+ df = df.copy()
55
+
56
+ # Fill NaN values
57
+ text_fields = ['tags', 'pipeline_tag', 'library_name', 'modelCard']
58
+ for field in text_fields:
59
+ if field in df.columns:
60
+ df[field] = df[field].fillna('')
61
+
62
+ # Combine text fields for embedding
63
+ df['combined_text'] = (
64
+ df.get('tags', '').astype(str) + ' ' +
65
+ df.get('pipeline_tag', '').astype(str) + ' ' +
66
+ df.get('library_name', '').astype(str) + ' ' +
67
+ df['modelCard'].astype(str).str[:500] # Limit modelCard to first 500 chars
68
+ )
69
+
70
+ return df
71
+
72
+ def filter_data(
73
+ self,
74
+ df: Optional[pd.DataFrame] = None,
75
+ min_downloads: Optional[int] = None,
76
+ min_likes: Optional[int] = None,
77
+ libraries: Optional[List[str]] = None,
78
+ pipeline_tags: Optional[List[str]] = None,
79
+ search_query: Optional[str] = None
80
+ ) -> pd.DataFrame:
81
+ """
82
+ Filter dataset based on criteria.
83
+
84
+ Args:
85
+ df: DataFrame to filter (uses self.df if None)
86
+ min_downloads: Minimum download count
87
+ min_likes: Minimum like count
88
+ libraries: List of library names to include
89
+ pipeline_tags: List of pipeline tags to include
90
+ search_query: Text search in model_id or tags
91
+
92
+ Returns:
93
+ Filtered DataFrame
94
+ """
95
+ if df is None:
96
+ df = self.df.copy()
97
+ else:
98
+ df = df.copy()
99
+
100
+ if min_downloads is not None:
101
+ df = df[df.get('downloads', 0) >= min_downloads]
102
+
103
+ if min_likes is not None:
104
+ df = df[df.get('likes', 0) >= min_likes]
105
+
106
+ if libraries:
107
+ df = df[df.get('library_name', '').isin(libraries)]
108
+
109
+ if pipeline_tags:
110
+ df = df[df.get('pipeline_tag', '').isin(pipeline_tags)]
111
+
112
+ if search_query:
113
+ query_lower = search_query.lower()
114
+ mask = (
115
+ df.get('model_id', '').astype(str).str.lower().str.contains(query_lower) |
116
+ df.get('tags', '').astype(str).str.lower().str.contains(query_lower)
117
+ )
118
+ df = df[mask]
119
+
120
+ return df
121
+
122
+ def get_unique_values(self, column: str) -> List[str]:
123
+ """Get unique non-null values from a column."""
124
+ if self.df is None:
125
+ return []
126
+ values = self.df[column].dropna().unique().tolist()
127
+ return sorted([str(v) for v in values if v and str(v) != 'nan'])
128
+
dimensionality_reduction.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Dimensionality reduction for visualization (UMAP, t-SNE).
3
+ """
4
+ import numpy as np
5
+ from umap import UMAP
6
+ from sklearn.manifold import TSNE
7
+ from typing import Optional
8
+ import pickle
9
+ import os
10
+
11
+
12
+ class DimensionReducer:
13
+ """Reduce high-dimensional embeddings to 2D/3D for visualization."""
14
+
15
+ def __init__(self, method: str = "umap", n_components: int = 2):
16
+ """
17
+ Initialize reducer.
18
+
19
+ Args:
20
+ method: 'umap' or 'tsne'
21
+ n_components: Number of dimensions (2 or 3)
22
+ """
23
+ self.method = method.lower()
24
+ self.n_components = n_components
25
+
26
+ if self.method == "umap":
27
+ self.reducer = UMAP(
28
+ n_components=n_components,
29
+ n_neighbors=15,
30
+ min_dist=0.1,
31
+ metric='cosine',
32
+ random_state=42
33
+ )
34
+ elif self.method == "tsne":
35
+ self.reducer = TSNE(
36
+ n_components=n_components,
37
+ perplexity=30,
38
+ random_state=42,
39
+ n_iter=1000
40
+ )
41
+ else:
42
+ raise ValueError(f"Unknown method: {method}. Use 'umap' or 'tsne'")
43
+
44
+ def fit_transform(self, embeddings: np.ndarray) -> np.ndarray:
45
+ """
46
+ Fit reducer and transform embeddings.
47
+
48
+ Args:
49
+ embeddings: High-dimensional embeddings (n_samples, embedding_dim)
50
+
51
+ Returns:
52
+ Reduced embeddings (n_samples, n_components)
53
+ """
54
+ print(f"Reducing dimensions using {self.method.upper()}...")
55
+ reduced = self.reducer.fit_transform(embeddings)
56
+ print(f"Reduced to {self.n_components}D: shape {reduced.shape}")
57
+ return reduced
58
+
59
+ def transform(self, embeddings: np.ndarray) -> np.ndarray:
60
+ """Transform new embeddings (only for UMAP, t-SNE doesn't support this)."""
61
+ if self.method == "umap":
62
+ return self.reducer.transform(embeddings)
63
+ else:
64
+ raise ValueError("t-SNE doesn't support transform. Use fit_transform instead.")
65
+
66
+ def save_reducer(self, filepath: str):
67
+ """Save fitted reducer to disk."""
68
+ os.makedirs(os.path.dirname(filepath) if os.path.dirname(filepath) else '.', exist_ok=True)
69
+ with open(filepath, 'wb') as f:
70
+ pickle.dump(self.reducer, f)
71
+ print(f"Reducer saved to {filepath}")
72
+
73
+ def load_reducer(self, filepath: str):
74
+ """Load fitted reducer from disk."""
75
+ with open(filepath, 'rb') as f:
76
+ self.reducer = pickle.load(f)
77
+ print(f"Reducer loaded from {filepath}")
78
+
embeddings.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Generate embeddings for models using sentence transformers.
3
+ """
4
+ import numpy as np
5
+ from sentence_transformers import SentenceTransformer
6
+ from typing import List, Optional
7
+ import pickle
8
+ import os
9
+ from tqdm import tqdm
10
+
11
+
12
+ class ModelEmbedder:
13
+ """Generate embeddings for model descriptions."""
14
+
15
+ def __init__(self, model_name: str = "all-MiniLM-L6-v2", cache_dir: Optional[str] = None):
16
+ """
17
+ Initialize embedder.
18
+
19
+ Args:
20
+ model_name: Sentence transformer model name
21
+ cache_dir: Directory to cache embeddings
22
+ """
23
+ self.model_name = model_name
24
+ self.cache_dir = cache_dir
25
+ print(f"Loading embedding model: {model_name}...")
26
+ self.model = SentenceTransformer(model_name)
27
+ print("Embedding model loaded!")
28
+
29
+ def generate_embeddings(
30
+ self,
31
+ texts: List[str],
32
+ batch_size: int = 32,
33
+ show_progress: bool = True
34
+ ) -> np.ndarray:
35
+ """
36
+ Generate embeddings for a list of texts.
37
+
38
+ Args:
39
+ texts: List of text strings to embed
40
+ batch_size: Batch size for encoding
41
+ show_progress: Whether to show progress bar
42
+
43
+ Returns:
44
+ numpy array of embeddings (n_samples, embedding_dim)
45
+ """
46
+ if show_progress:
47
+ print(f"Generating embeddings for {len(texts)} models...")
48
+
49
+ embeddings = self.model.encode(
50
+ texts,
51
+ batch_size=batch_size,
52
+ show_progress_bar=show_progress,
53
+ convert_to_numpy=True
54
+ )
55
+
56
+ return embeddings
57
+
58
+ def save_embeddings(self, embeddings: np.ndarray, filepath: str):
59
+ """Save embeddings to disk."""
60
+ os.makedirs(os.path.dirname(filepath) if os.path.dirname(filepath) else '.', exist_ok=True)
61
+ with open(filepath, 'wb') as f:
62
+ pickle.dump(embeddings, f)
63
+ print(f"Embeddings saved to {filepath}")
64
+
65
+ def load_embeddings(self, filepath: str) -> np.ndarray:
66
+ """Load embeddings from disk."""
67
+ with open(filepath, 'rb') as f:
68
+ embeddings = pickle.load(f)
69
+ print(f"Embeddings loaded from {filepath}")
70
+ return embeddings
71
+
frontend/.gitignore ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
2
+
3
+ # dependencies
4
+ /node_modules
5
+ /.pnp
6
+ .pnp.js
7
+
8
+ # testing
9
+ /coverage
10
+
11
+ # production
12
+ /build
13
+
14
+ # misc
15
+ .DS_Store
16
+ .env.local
17
+ .env.development.local
18
+ .env.test.local
19
+ .env.production.local
20
+
21
+ npm-debug.log*
22
+ yarn-debug.log*
23
+ yarn-error.log*
24
+
frontend/.nvmrc ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ 18
2
+
frontend/_redirects ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /* /index.html 200
2
+
frontend/netlify.toml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build]
2
+ publish = "build"
3
+ command = "npm run build"
4
+
5
+ [build.environment]
6
+ NODE_VERSION = "18"
7
+ REACT_APP_API_URL = "https://your-backend-url.railway.app"
8
+
9
+ [[redirects]]
10
+ from = "/*"
11
+ to = "/index.html"
12
+ status = 200
13
+
frontend/package.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "hf-model-navigator-frontend",
3
+ "version": "1.0.0",
4
+ "description": "React frontend with Visx for HF Model Ecosystem Navigator",
5
+ "private": true,
6
+ "dependencies": {
7
+ "@visx/axis": "^3.0.0",
8
+ "@visx/brush": "^3.0.0",
9
+ "@visx/event": "^3.0.0",
10
+ "@visx/gradient": "^3.0.0",
11
+ "@visx/group": "^3.0.0",
12
+ "@visx/legend": "^3.0.0",
13
+ "@visx/point": "^3.0.0",
14
+ "@visx/scale": "^3.0.0",
15
+ "@visx/shape": "^3.0.0",
16
+ "@visx/tooltip": "^3.0.0",
17
+ "@visx/visx": "^3.0.0",
18
+ "react": "^18.2.0",
19
+ "react-dom": "^18.2.0",
20
+ "react-scripts": "5.0.1",
21
+ "typescript": "^5.0.0",
22
+ "@types/react": "^18.2.0",
23
+ "@types/react-dom": "^18.2.0",
24
+ "axios": "^1.6.0"
25
+ },
26
+ "scripts": {
27
+ "start": "react-scripts start",
28
+ "build": "react-scripts build",
29
+ "test": "react-scripts test",
30
+ "eject": "react-scripts eject"
31
+ },
32
+ "eslintConfig": {
33
+ "extends": [
34
+ "react-app"
35
+ ]
36
+ },
37
+ "browserslist": {
38
+ "production": [
39
+ ">0.2%",
40
+ "not dead",
41
+ "not op_mini all"
42
+ ],
43
+ "development": [
44
+ "last 1 chrome version",
45
+ "last 1 firefox version",
46
+ "last 1 safari version"
47
+ ]
48
+ }
49
+ }
50
+
frontend/public/_redirects ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ /* /index.html 200
2
+
frontend/public/index.html ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="utf-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1" />
6
+ <meta name="theme-color" content="#000000" />
7
+ <meta
8
+ name="description"
9
+ content="Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face. Analysis of 1.86 million models on Hugging Face, revealing fine-tuning lineages and model family structures using evolutionary biology methods."
10
+ />
11
+ <title>Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face</title>
12
+ </head>
13
+ <body>
14
+ <noscript>You need to enable JavaScript to run this app.</noscript>
15
+ <div id="root"></div>
16
+ </body>
17
+ </html>
18
+
frontend/src/App.css ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .App {
2
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
3
+ 'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
4
+ sans-serif;
5
+ -webkit-font-smoothing: antialiased;
6
+ -moz-osx-font-smoothing: grayscale;
7
+ }
8
+
9
+ .App-header {
10
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
11
+ color: white;
12
+ padding: 2rem;
13
+ text-align: center;
14
+ }
15
+
16
+ .App-header h1 {
17
+ margin: 0 0 1rem 0;
18
+ font-size: 2rem;
19
+ font-weight: 600;
20
+ }
21
+
22
+ .App-header p {
23
+ margin: 0;
24
+ opacity: 0.9;
25
+ }
26
+
27
+ .App-header a {
28
+ color: white;
29
+ text-decoration: underline;
30
+ opacity: 0.9;
31
+ transition: opacity 0.2s;
32
+ }
33
+
34
+ .App-header a:hover {
35
+ opacity: 1;
36
+ text-decoration: none;
37
+ }
38
+
39
+ .stats {
40
+ display: flex;
41
+ gap: 2rem;
42
+ justify-content: center;
43
+ margin-top: 1rem;
44
+ font-size: 0.9rem;
45
+ }
46
+
47
+ .main-content {
48
+ display: flex;
49
+ height: calc(100vh - 200px);
50
+ }
51
+
52
+ .sidebar {
53
+ width: 300px;
54
+ padding: 2rem;
55
+ background: #f5f5f5;
56
+ overflow-y: auto;
57
+ border-right: 1px solid #e0e0e0;
58
+ }
59
+
60
+ .sidebar h2 {
61
+ margin-top: 0;
62
+ font-size: 1.5rem;
63
+ }
64
+
65
+ .sidebar label {
66
+ display: block;
67
+ margin-bottom: 1.5rem;
68
+ font-weight: 500;
69
+ }
70
+
71
+ .sidebar input[type="text"],
72
+ .sidebar select {
73
+ width: 100%;
74
+ padding: 0.5rem;
75
+ margin-top: 0.5rem;
76
+ border: 1px solid #ccc;
77
+ border-radius: 4px;
78
+ font-size: 0.9rem;
79
+ }
80
+
81
+ .sidebar input[type="range"] {
82
+ width: 100%;
83
+ margin-top: 0.5rem;
84
+ }
85
+
86
+ .visualization {
87
+ flex: 1;
88
+ padding: 2rem;
89
+ display: flex;
90
+ align-items: center;
91
+ justify-content: center;
92
+ background: white;
93
+ }
94
+
95
+ .loading,
96
+ .error,
97
+ .empty {
98
+ text-align: center;
99
+ padding: 2rem;
100
+ font-size: 1.2rem;
101
+ }
102
+
103
+ .error {
104
+ color: #d32f2f;
105
+ }
106
+
107
+ .empty {
108
+ color: #666;
109
+ }
110
+
frontend/src/App.tsx ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Main React app component using Visx for visualization.
3
+ */
4
+ import React, { useState, useEffect, useCallback } from 'react';
5
+ import ScatterPlot from './components/ScatterPlot';
6
+ import ModelModal from './components/ModelModal';
7
+ import { ModelPoint, Stats } from './types';
8
+ import './App.css';
9
+
10
+ const API_BASE = process.env.REACT_APP_API_URL || 'http://localhost:8000';
11
+
12
+ function App() {
13
+ const [data, setData] = useState<ModelPoint[]>([]);
14
+ const [stats, setStats] = useState<Stats | null>(null);
15
+ const [loading, setLoading] = useState(true);
16
+ const [error, setError] = useState<string | null>(null);
17
+ const [selectedModel, setSelectedModel] = useState<ModelPoint | null>(null);
18
+ const [isModalOpen, setIsModalOpen] = useState(false);
19
+
20
+ // Filters
21
+ const [minDownloads, setMinDownloads] = useState(0);
22
+ const [minLikes, setMinLikes] = useState(0);
23
+ const [searchQuery, setSearchQuery] = useState('');
24
+ const [colorBy, setColorBy] = useState('library_name');
25
+ const [sizeBy, setSizeBy] = useState('downloads');
26
+
27
+ // Dimensions
28
+ const [width, setWidth] = useState(window.innerWidth * 0.7);
29
+ const [height, setHeight] = useState(window.innerHeight * 0.7);
30
+
31
+ useEffect(() => {
32
+ const handleResize = () => {
33
+ setWidth(window.innerWidth * 0.7);
34
+ setHeight(window.innerHeight * 0.7);
35
+ };
36
+ window.addEventListener('resize', handleResize);
37
+ return () => window.removeEventListener('resize', handleResize);
38
+ }, []);
39
+
40
+ const fetchData = useCallback(async () => {
41
+ setLoading(true);
42
+ setError(null);
43
+ try {
44
+ const params = new URLSearchParams({
45
+ min_downloads: minDownloads.toString(),
46
+ min_likes: minLikes.toString(),
47
+ color_by: colorBy,
48
+ size_by: sizeBy,
49
+ max_points: '5000',
50
+ });
51
+ if (searchQuery) {
52
+ params.append('search_query', searchQuery);
53
+ }
54
+
55
+ const response = await fetch(`${API_BASE}/api/models?${params}`);
56
+ if (!response.ok) throw new Error('Failed to fetch models');
57
+ const models = await response.json();
58
+ setData(models);
59
+ } catch (err) {
60
+ setError(err instanceof Error ? err.message : 'Unknown error');
61
+ } finally {
62
+ setLoading(false);
63
+ }
64
+ }, [minDownloads, minLikes, searchQuery, colorBy, sizeBy]);
65
+
66
+ useEffect(() => {
67
+ fetchData();
68
+ }, [fetchData]);
69
+
70
+ useEffect(() => {
71
+ // Fetch stats once
72
+ fetch(`${API_BASE}/api/stats`)
73
+ .then(res => res.json())
74
+ .then(setStats)
75
+ .catch(console.error);
76
+ }, []);
77
+
78
+ return (
79
+ <div className="App">
80
+ <header className="App-header">
81
+ <h1>Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face</h1>
82
+ <p style={{ maxWidth: '900px', margin: '0 auto', lineHeight: '1.6' }}>
83
+ Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens, we measure genetic similarity and mutation of traits over model families.
84
+ {' '}
85
+ <a
86
+ href="https://arxiv.org/abs/2508.06811"
87
+ target="_blank"
88
+ rel="noopener noreferrer"
89
+ style={{ color: 'white', textDecoration: 'underline', fontWeight: '500' }}
90
+ >
91
+ Read the full paper →
92
+ </a>
93
+ </p>
94
+ <p style={{ marginTop: '0.5rem', fontSize: '0.9rem', opacity: 0.9 }}>
95
+ <strong>Authors:</strong> Benjamin Laufer, Hamidah Oderinwale, Jon Kleinberg
96
+ </p>
97
+ {stats && (
98
+ <div className="stats">
99
+ <span>Total Models: {stats.total_models.toLocaleString()}</span>
100
+ <span>Libraries: {stats.unique_libraries}</span>
101
+ <span>Pipelines: {stats.unique_pipelines}</span>
102
+ </div>
103
+ )}
104
+ </header>
105
+
106
+ <div className="main-content">
107
+ <aside className="sidebar">
108
+ <h2>Filters</h2>
109
+
110
+ <label>
111
+ Search:
112
+ <input
113
+ type="text"
114
+ value={searchQuery}
115
+ onChange={(e) => setSearchQuery(e.target.value)}
116
+ placeholder="Model ID or tags..."
117
+ />
118
+ </label>
119
+
120
+ <label>
121
+ Min Downloads: {minDownloads.toLocaleString()}
122
+ <input
123
+ type="range"
124
+ min="0"
125
+ max="1000000"
126
+ step="1000"
127
+ value={minDownloads}
128
+ onChange={(e) => setMinDownloads(Number(e.target.value))}
129
+ />
130
+ </label>
131
+
132
+ <label>
133
+ Min Likes: {minLikes.toLocaleString()}
134
+ <input
135
+ type="range"
136
+ min="0"
137
+ max="10000"
138
+ step="10"
139
+ value={minLikes}
140
+ onChange={(e) => setMinLikes(Number(e.target.value))}
141
+ />
142
+ </label>
143
+
144
+ <label>
145
+ Color By:
146
+ <select value={colorBy} onChange={(e) => setColorBy(e.target.value)}>
147
+ <option value="library_name">Library</option>
148
+ <option value="pipeline_tag">Pipeline</option>
149
+ <option value="downloads">Downloads</option>
150
+ <option value="likes">Likes</option>
151
+ </select>
152
+ </label>
153
+
154
+ <label>
155
+ Size By:
156
+ <select value={sizeBy} onChange={(e) => setSizeBy(e.target.value)}>
157
+ <option value="downloads">Downloads</option>
158
+ <option value="likes">Likes</option>
159
+ <option value="trendingScore">Trending Score</option>
160
+ <option value="none">None</option>
161
+ </select>
162
+ </label>
163
+ </aside>
164
+
165
+ <main className="visualization">
166
+ {loading && <div className="loading">Loading models...</div>}
167
+ {error && <div className="error">Error: {error}</div>}
168
+ {!loading && !error && data.length === 0 && (
169
+ <div className="empty">No models match the filters</div>
170
+ )}
171
+ {!loading && !error && data.length > 0 && (
172
+ <ScatterPlot
173
+ width={width}
174
+ height={height}
175
+ data={data}
176
+ colorBy={colorBy}
177
+ sizeBy={sizeBy}
178
+ onPointClick={(model) => {
179
+ setSelectedModel(model);
180
+ setIsModalOpen(true);
181
+ }}
182
+ />
183
+ )}
184
+ </main>
185
+
186
+ <ModelModal
187
+ model={selectedModel}
188
+ isOpen={isModalOpen}
189
+ onClose={() => setIsModalOpen(false)}
190
+ />
191
+ </div>
192
+ </div>
193
+ );
194
+ }
195
+
196
+ export default App;
197
+
frontend/src/components/ModelModal.css ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .modal-overlay {
2
+ position: fixed;
3
+ top: 0;
4
+ left: 0;
5
+ right: 0;
6
+ bottom: 0;
7
+ background: rgba(0, 0, 0, 0.7);
8
+ display: flex;
9
+ align-items: center;
10
+ justify-content: center;
11
+ z-index: 1000;
12
+ padding: 2rem;
13
+ animation: fadeIn 0.2s ease-in;
14
+ }
15
+
16
+ @keyframes fadeIn {
17
+ from {
18
+ opacity: 0;
19
+ }
20
+ to {
21
+ opacity: 1;
22
+ }
23
+ }
24
+
25
+ .modal-content {
26
+ background: white;
27
+ border-radius: 12px;
28
+ max-width: 600px;
29
+ width: 100%;
30
+ max-height: 90vh;
31
+ overflow-y: auto;
32
+ padding: 2rem;
33
+ position: relative;
34
+ box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
35
+ animation: slideUp 0.3s ease-out;
36
+ }
37
+
38
+ @keyframes slideUp {
39
+ from {
40
+ transform: translateY(20px);
41
+ opacity: 0;
42
+ }
43
+ to {
44
+ transform: translateY(0);
45
+ opacity: 1;
46
+ }
47
+ }
48
+
49
+ .modal-close {
50
+ position: absolute;
51
+ top: 1rem;
52
+ right: 1rem;
53
+ background: none;
54
+ border: none;
55
+ font-size: 2rem;
56
+ line-height: 1;
57
+ cursor: pointer;
58
+ color: #666;
59
+ padding: 0;
60
+ width: 32px;
61
+ height: 32px;
62
+ display: flex;
63
+ align-items: center;
64
+ justify-content: center;
65
+ border-radius: 50%;
66
+ transition: all 0.2s;
67
+ }
68
+
69
+ .modal-close:hover {
70
+ background: #f0f0f0;
71
+ color: #000;
72
+ }
73
+
74
+ .modal-content h2 {
75
+ margin: 0 0 1.5rem 0;
76
+ font-size: 1.5rem;
77
+ color: #333;
78
+ word-break: break-word;
79
+ }
80
+
81
+ .modal-section {
82
+ margin-bottom: 1.5rem;
83
+ }
84
+
85
+ .modal-section:last-child {
86
+ margin-bottom: 0;
87
+ }
88
+
89
+ .modal-section h3 {
90
+ margin: 0 0 0.75rem 0;
91
+ font-size: 1rem;
92
+ font-weight: 600;
93
+ color: #555;
94
+ text-transform: uppercase;
95
+ letter-spacing: 0.5px;
96
+ }
97
+
98
+ .modal-info-grid {
99
+ display: grid;
100
+ grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
101
+ gap: 1rem;
102
+ }
103
+
104
+ .modal-info-item {
105
+ display: flex;
106
+ flex-direction: column;
107
+ gap: 0.25rem;
108
+ }
109
+
110
+ .modal-info-item strong {
111
+ font-size: 0.875rem;
112
+ color: #666;
113
+ font-weight: 500;
114
+ }
115
+
116
+ .modal-info-item span {
117
+ font-size: 1rem;
118
+ color: #333;
119
+ font-weight: 500;
120
+ }
121
+
122
+ .modal-tags {
123
+ margin: 0;
124
+ padding: 0.75rem;
125
+ background: #f5f5f5;
126
+ border-radius: 6px;
127
+ color: #333;
128
+ font-size: 0.9rem;
129
+ line-height: 1.5;
130
+ }
131
+
132
+ .modal-link {
133
+ display: inline-flex;
134
+ align-items: center;
135
+ gap: 0.5rem;
136
+ padding: 0.75rem 1.5rem;
137
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
138
+ color: white;
139
+ text-decoration: none;
140
+ border-radius: 6px;
141
+ font-weight: 500;
142
+ transition: all 0.2s;
143
+ margin-top: 0.5rem;
144
+ }
145
+
146
+ .modal-link:hover {
147
+ transform: translateY(-2px);
148
+ box-shadow: 0 4px 12px rgba(102, 126, 234, 0.4);
149
+ }
150
+
151
+ @media (max-width: 768px) {
152
+ .modal-content {
153
+ padding: 1.5rem;
154
+ max-width: 100%;
155
+ }
156
+
157
+ .modal-info-grid {
158
+ grid-template-columns: 1fr;
159
+ }
160
+ }
161
+
frontend/src/components/ModelModal.tsx ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Modal component for displaying detailed model information.
3
+ */
4
+ import React from 'react';
5
+ import { ModelPoint } from '../types';
6
+ import './ModelModal.css';
7
+
8
+ interface ModelModalProps {
9
+ model: ModelPoint | null;
10
+ isOpen: boolean;
11
+ onClose: () => void;
12
+ }
13
+
14
+ export default function ModelModal({ model, isOpen, onClose }: ModelModalProps) {
15
+ if (!isOpen || !model) return null;
16
+
17
+ const hfUrl = `https://huggingface.co/${model.model_id}`;
18
+
19
+ return (
20
+ <div className="modal-overlay" onClick={onClose}>
21
+ <div className="modal-content" onClick={(e) => e.stopPropagation()}>
22
+ <button className="modal-close" onClick={onClose}>×</button>
23
+
24
+ <h2>{model.model_id}</h2>
25
+
26
+ <div className="modal-section">
27
+ <h3>Model Information</h3>
28
+ <div className="modal-info-grid">
29
+ <div className="modal-info-item">
30
+ <strong>Library:</strong>
31
+ <span>{model.library_name || 'N/A'}</span>
32
+ </div>
33
+ <div className="modal-info-item">
34
+ <strong>Pipeline Tag:</strong>
35
+ <span>{model.pipeline_tag || 'N/A'}</span>
36
+ </div>
37
+ <div className="modal-info-item">
38
+ <strong>Downloads:</strong>
39
+ <span>{model.downloads.toLocaleString()}</span>
40
+ </div>
41
+ <div className="modal-info-item">
42
+ <strong>Likes:</strong>
43
+ <span>{model.likes.toLocaleString()}</span>
44
+ </div>
45
+ {model.trending_score !== null && (
46
+ <div className="modal-info-item">
47
+ <strong>Trending Score:</strong>
48
+ <span>{model.trending_score.toFixed(2)}</span>
49
+ </div>
50
+ )}
51
+ </div>
52
+ </div>
53
+
54
+ {model.tags && (
55
+ <div className="modal-section">
56
+ <h3>Tags</h3>
57
+ <p className="modal-tags">{model.tags}</p>
58
+ </div>
59
+ )}
60
+
61
+ <div className="modal-section">
62
+ <h3>Links</h3>
63
+ <a
64
+ href={hfUrl}
65
+ target="_blank"
66
+ rel="noopener noreferrer"
67
+ className="modal-link"
68
+ >
69
+ View on Hugging Face →
70
+ </a>
71
+ </div>
72
+
73
+ <div className="modal-section">
74
+ <h3>Position in Latent Space</h3>
75
+ <div className="modal-info-grid">
76
+ <div className="modal-info-item">
77
+ <strong>Dimension 1:</strong>
78
+ <span>{model.x.toFixed(4)}</span>
79
+ </div>
80
+ <div className="modal-info-item">
81
+ <strong>Dimension 2:</strong>
82
+ <span>{model.y.toFixed(4)}</span>
83
+ </div>
84
+ </div>
85
+ </div>
86
+ </div>
87
+ </div>
88
+ );
89
+ }
90
+
frontend/src/components/ScatterPlot.tsx ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Visx-based scatter plot component for model visualization.
3
+ * Based on visx gallery examples: https://visx.airbnb.tech/gallery
4
+ */
5
+ import React, { useMemo, useCallback } from 'react';
6
+ import { Group } from '@visx/group';
7
+ import { scaleLinear, scaleOrdinal } from '@visx/scale';
8
+ import { AxisBottom, AxisLeft } from '@visx/axis';
9
+ import { GridRows, GridColumns } from '@visx/grid';
10
+ import { Tooltip, useTooltip } from '@visx/tooltip';
11
+ import { LegendOrdinal } from '@visx/legend';
12
+ // Using circle elements directly instead of Point component
13
+ // Color schemes - using a predefined palette
14
+ const colorPalette = [
15
+ '#8dd3c7', '#ffffb3', '#bebada', '#fb8072', '#80b1d3',
16
+ '#fdb462', '#b3de69', '#fccde5', '#d9d9d9', '#bc80bd',
17
+ '#ccebc5', '#ffed6f'
18
+ ];
19
+ import { ModelPoint } from '../types';
20
+
21
+ interface ScatterPlotProps {
22
+ width: number;
23
+ height: number;
24
+ data: ModelPoint[];
25
+ colorBy: string;
26
+ sizeBy: string;
27
+ margin?: { top: number; right: number; bottom: number; left: number };
28
+ onPointClick?: (model: ModelPoint) => void;
29
+ }
30
+
31
+ const defaultMargin = { top: 40, right: 40, bottom: 60, left: 60 };
32
+
33
+ export default function ScatterPlot({
34
+ width,
35
+ height,
36
+ data,
37
+ colorBy,
38
+ sizeBy,
39
+ margin = defaultMargin,
40
+ onPointClick,
41
+ }: ScatterPlotProps) {
42
+ const {
43
+ tooltipData,
44
+ tooltipLeft,
45
+ tooltipTop,
46
+ tooltipOpen,
47
+ showTooltip,
48
+ hideTooltip,
49
+ } = useTooltip<ModelPoint>();
50
+
51
+ // Bounds
52
+ const xMax = width - margin.left - margin.right;
53
+ const yMax = height - margin.top - margin.bottom;
54
+
55
+ // Scales
56
+ const xScale = useMemo(
57
+ () =>
58
+ scaleLinear<number>({
59
+ domain: [Math.min(...data.map(d => d.x)), Math.max(...data.map(d => d.x))],
60
+ range: [0, xMax],
61
+ nice: true,
62
+ }),
63
+ [data, xMax]
64
+ );
65
+
66
+ const yScale = useMemo(
67
+ () =>
68
+ scaleLinear<number>({
69
+ domain: [Math.min(...data.map(d => d.y)), Math.max(...data.map(d => d.y))],
70
+ range: [yMax, 0],
71
+ nice: true,
72
+ }),
73
+ [data, yMax]
74
+ );
75
+
76
+ // Color scale
77
+ const getColorValue = (d: ModelPoint) => {
78
+ if (colorBy === 'library_name') return d.library_name || 'Unknown';
79
+ if (colorBy === 'pipeline_tag') return d.pipeline_tag || 'Unknown';
80
+ if (colorBy === 'downloads') return d.downloads;
81
+ if (colorBy === 'likes') return d.likes;
82
+ return 'All';
83
+ };
84
+
85
+ const colorValues = useMemo(() => data.map(getColorValue), [data, colorBy]);
86
+ const isCategorical = colorBy === 'library_name' || colorBy === 'pipeline_tag';
87
+
88
+ const colorScale = useMemo(() => {
89
+ if (isCategorical) {
90
+ const uniqueValues = Array.from(new Set(colorValues));
91
+ return scaleOrdinal<string, string>({
92
+ domain: uniqueValues,
93
+ range: colorPalette,
94
+ });
95
+ } else {
96
+ // For continuous, we'll use a linear scale with a color interpolator
97
+ const min = Math.min(...(colorValues as number[]));
98
+ const max = Math.max(...(colorValues as number[]));
99
+ return scaleLinear<number, string>({
100
+ domain: [min, max],
101
+ range: ['#440154', '#fde725'], // Viridis-like colors
102
+ });
103
+ }
104
+ }, [colorValues, isCategorical]);
105
+
106
+ // Size scale
107
+ const getSizeValue = (d: ModelPoint) => {
108
+ if (sizeBy === 'downloads') return d.downloads;
109
+ if (sizeBy === 'likes') return d.likes;
110
+ if (sizeBy === 'trendingScore' && d.trending_score) return d.trending_score;
111
+ return 10;
112
+ };
113
+
114
+ const sizeValues = useMemo(() => data.map(getSizeValue), [data, sizeBy]);
115
+ const minSize = Math.min(...sizeValues);
116
+ const maxSize = Math.max(...sizeValues);
117
+
118
+ const sizeScale = useMemo(
119
+ () =>
120
+ scaleLinear<number>({
121
+ domain: [minSize, maxSize],
122
+ range: [5, 20],
123
+ }),
124
+ [minSize, maxSize]
125
+ );
126
+
127
+ // Handle point hover
128
+ const handleMouseOver = useCallback(
129
+ (event: React.MouseEvent, datum: ModelPoint) => {
130
+ const coords = { x: event.clientX, y: event.clientY };
131
+ showTooltip({
132
+ tooltipLeft: coords.x,
133
+ tooltipTop: coords.y,
134
+ tooltipData: datum,
135
+ });
136
+ },
137
+ [showTooltip]
138
+ );
139
+
140
+ return (
141
+ <div style={{ position: 'relative' }}>
142
+ <svg width={width} height={height}>
143
+ <Group left={margin.left} top={margin.top}>
144
+ {/* Grid */}
145
+ <GridRows scale={yScale} width={xMax} strokeDasharray="3,3" stroke="#e0e0e0" />
146
+ <GridColumns scale={xScale} height={yMax} strokeDasharray="3,3" stroke="#e0e0e0" />
147
+
148
+ {/* Points */}
149
+ {data.map((d, i) => {
150
+ const x = xScale(d.x);
151
+ const y = yScale(d.y);
152
+ const color = isCategorical
153
+ ? colorScale(getColorValue(d) as string)
154
+ : colorScale(getColorValue(d) as number);
155
+ const size = sizeScale(getSizeValue(d));
156
+
157
+ return (
158
+ <circle
159
+ key={`point-${i}`}
160
+ cx={x}
161
+ cy={y}
162
+ r={size / 2}
163
+ fill={color}
164
+ opacity={0.7}
165
+ stroke="white"
166
+ strokeWidth={0.5}
167
+ onMouseOver={(e) => handleMouseOver(e, d)}
168
+ onMouseOut={hideTooltip}
169
+ onClick={() => onPointClick && onPointClick(d)}
170
+ style={{ cursor: 'pointer' }}
171
+ />
172
+ );
173
+ })}
174
+
175
+ {/* Axes */}
176
+ <AxisBottom
177
+ top={yMax}
178
+ scale={xScale}
179
+ numTicks={5}
180
+ label="Dimension 1"
181
+ stroke="#333"
182
+ tickStroke="#333"
183
+ />
184
+ <AxisLeft
185
+ scale={yScale}
186
+ numTicks={5}
187
+ label="Dimension 2"
188
+ stroke="#333"
189
+ tickStroke="#333"
190
+ />
191
+ </Group>
192
+ </svg>
193
+
194
+ {/* Tooltip */}
195
+ {tooltipOpen && tooltipData && (
196
+ <Tooltip
197
+ top={tooltipTop}
198
+ left={tooltipLeft}
199
+ style={{
200
+ backgroundColor: 'rgba(0, 0, 0, 0.9)',
201
+ color: 'white',
202
+ padding: '8px',
203
+ borderRadius: '4px',
204
+ fontSize: '12px',
205
+ }}
206
+ >
207
+ <div>
208
+ <strong>{tooltipData.model_id}</strong>
209
+ <br />
210
+ Library: {tooltipData.library_name || 'N/A'}
211
+ <br />
212
+ Pipeline: {tooltipData.pipeline_tag || 'N/A'}
213
+ <br />
214
+ Downloads: {tooltipData.downloads.toLocaleString()}
215
+ <br />
216
+ Likes: {tooltipData.likes.toLocaleString()}
217
+ </div>
218
+ </Tooltip>
219
+ )}
220
+
221
+ {/* Legend */}
222
+ {isCategorical && (
223
+ <div style={{ position: 'absolute', top: 10, right: 10 }}>
224
+ <LegendOrdinal
225
+ scale={colorScale as any}
226
+ labelFormat={(label) => label}
227
+ direction="column"
228
+ style={{ fontSize: '12px' }}
229
+ />
230
+ </div>
231
+ )}
232
+ </div>
233
+ );
234
+ }
235
+
frontend/src/index.css ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ body {
2
+ margin: 0;
3
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
4
+ 'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
5
+ sans-serif;
6
+ -webkit-font-smoothing: antialiased;
7
+ -moz-osx-font-smoothing: grayscale;
8
+ }
9
+
10
+ code {
11
+ font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New',
12
+ monospace;
13
+ }
14
+
15
+ * {
16
+ box-sizing: border-box;
17
+ }
18
+
frontend/src/index.tsx ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React from 'react';
2
+ import ReactDOM from 'react-dom/client';
3
+ import './index.css';
4
+ import App from './App';
5
+
6
+ const root = ReactDOM.createRoot(
7
+ document.getElementById('root') as HTMLElement
8
+ );
9
+ root.render(
10
+ <React.StrictMode>
11
+ <App />
12
+ </React.StrictMode>
13
+ );
14
+
frontend/src/types.ts ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ export interface ModelPoint {
2
+ model_id: string;
3
+ x: number;
4
+ y: number;
5
+ library_name: string | null;
6
+ pipeline_tag: string | null;
7
+ downloads: number;
8
+ likes: number;
9
+ trending_score: number | null;
10
+ tags: string | null;
11
+ }
12
+
13
+ export interface Stats {
14
+ total_models: number;
15
+ unique_libraries: number;
16
+ unique_pipelines: number;
17
+ avg_downloads: number;
18
+ avg_likes: number;
19
+ }
20
+
frontend/tsconfig.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "compilerOptions": {
3
+ "target": "es5",
4
+ "lib": [
5
+ "dom",
6
+ "dom.iterable",
7
+ "esnext"
8
+ ],
9
+ "allowJs": true,
10
+ "skipLibCheck": true,
11
+ "esModuleInterop": true,
12
+ "allowSyntheticDefaultImports": true,
13
+ "strict": true,
14
+ "forceConsistentCasingInFileNames": true,
15
+ "noFallthroughCasesInSwitch": true,
16
+ "module": "esnext",
17
+ "moduleResolution": "node",
18
+ "resolveJsonModule": true,
19
+ "isolatedModules": true,
20
+ "noEmit": true,
21
+ "jsx": "react-jsx"
22
+ },
23
+ "include": [
24
+ "src"
25
+ ]
26
+ }
27
+
netlify-functions/api.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Netlify Serverless Function for model data API.
3
+ This is a simplified version that works with Netlify Functions.
4
+ """
5
+ import json
6
+ import os
7
+ import sys
8
+
9
+ # Add parent directory to path to import modules
10
+ sys.path.append(os.path.join(os.path.dirname(__file__), '..'))
11
+
12
+ from data_loader import ModelDataLoader
13
+ from embeddings import ModelEmbedder
14
+ from dimensionality_reduction import DimensionReducer
15
+ import pandas as pd
16
+ import numpy as np
17
+
18
+ # Global state (persists across invocations in serverless)
19
+ data_loader = ModelDataLoader()
20
+ embedder = None
21
+ reducer = None
22
+ df = None
23
+ embeddings = None
24
+ reduced_embeddings = None
25
+
26
+
27
+ def handler(event, context):
28
+ """
29
+ Netlify serverless function handler.
30
+ """
31
+ global embedder, reducer, df, embeddings, reduced_embeddings
32
+
33
+ # Parse query parameters
34
+ query_params = event.get('queryStringParameters') or {}
35
+ path = event.get('path', '')
36
+
37
+ # CORS headers
38
+ headers = {
39
+ 'Access-Control-Allow-Origin': '*',
40
+ 'Access-Control-Allow-Headers': 'Content-Type',
41
+ 'Access-Control-Allow-Methods': 'GET, OPTIONS',
42
+ 'Content-Type': 'application/json',
43
+ }
44
+
45
+ # Handle OPTIONS (CORS preflight)
46
+ if event.get('httpMethod') == 'OPTIONS':
47
+ return {
48
+ 'statusCode': 200,
49
+ 'headers': headers,
50
+ 'body': ''
51
+ }
52
+
53
+ # Initialize data on first request
54
+ if df is None:
55
+ try:
56
+ print("Loading data...")
57
+ df = data_loader.load_data(sample_size=10000)
58
+ df = data_loader.preprocess_for_embedding(df)
59
+ print(f"Loaded {len(df)} models")
60
+ except Exception as e:
61
+ return {
62
+ 'statusCode': 500,
63
+ 'headers': headers,
64
+ 'body': json.dumps({'error': f'Failed to load data: {str(e)}'})
65
+ }
66
+
67
+ # Route requests
68
+ if path.endswith('/api/models') or '/api/models' in path:
69
+ return get_models(query_params, headers)
70
+ elif path.endswith('/api/stats') or '/api/stats' in path:
71
+ return get_stats(headers)
72
+ else:
73
+ return {
74
+ 'statusCode': 404,
75
+ 'headers': headers,
76
+ 'body': json.dumps({'error': 'Not found'})
77
+ }
78
+
79
+
80
+ def get_models(query_params, headers):
81
+ """Get filtered models."""
82
+ global df, embedder, reducer, embeddings, reduced_embeddings
83
+
84
+ try:
85
+ min_downloads = int(query_params.get('min_downloads', 0))
86
+ min_likes = int(query_params.get('min_likes', 0))
87
+ search_query = query_params.get('search_query')
88
+ max_points = int(query_params.get('max_points', 5000))
89
+
90
+ # Filter data
91
+ filtered_df = data_loader.filter_data(
92
+ df=df,
93
+ min_downloads=min_downloads,
94
+ min_likes=min_likes,
95
+ search_query=search_query
96
+ )
97
+
98
+ if len(filtered_df) == 0:
99
+ return {
100
+ 'statusCode': 200,
101
+ 'headers': headers,
102
+ 'body': json.dumps([])
103
+ }
104
+
105
+ # Limit points
106
+ if len(filtered_df) > max_points:
107
+ filtered_df = filtered_df.sample(n=max_points, random_state=42)
108
+
109
+ # Generate embeddings if needed
110
+ if embedder is None:
111
+ embedder = ModelEmbedder()
112
+
113
+ if embeddings is None:
114
+ texts = df['combined_text'].tolist()
115
+ embeddings = embedder.generate_embeddings(texts)
116
+
117
+ # Reduce dimensions if needed
118
+ if reducer is None:
119
+ reducer = DimensionReducer(method="umap", n_components=2)
120
+
121
+ if reduced_embeddings is None:
122
+ reduced_embeddings = reducer.fit_transform(embeddings)
123
+
124
+ # Get coordinates
125
+ filtered_indices = filtered_df.index.tolist()
126
+ filtered_reduced = reduced_embeddings[filtered_indices]
127
+
128
+ # Prepare response
129
+ models = []
130
+ for idx, (i, row) in enumerate(filtered_df.iterrows()):
131
+ models.append({
132
+ 'model_id': row.get('model_id', 'Unknown'),
133
+ 'x': float(filtered_reduced[idx, 0]),
134
+ 'y': float(filtered_reduced[idx, 1]),
135
+ 'library_name': row.get('library_name'),
136
+ 'pipeline_tag': row.get('pipeline_tag'),
137
+ 'downloads': int(row.get('downloads', 0)),
138
+ 'likes': int(row.get('likes', 0)),
139
+ 'trending_score': float(row.get('trendingScore', 0)) if pd.notna(row.get('trendingScore')) else None,
140
+ 'tags': row.get('tags') if pd.notna(row.get('tags')) else None
141
+ })
142
+
143
+ return {
144
+ 'statusCode': 200,
145
+ 'headers': headers,
146
+ 'body': json.dumps(models)
147
+ }
148
+ except Exception as e:
149
+ return {
150
+ 'statusCode': 500,
151
+ 'headers': headers,
152
+ 'body': json.dumps({'error': str(e)})
153
+ }
154
+
155
+
156
+ def get_stats(headers):
157
+ """Get dataset statistics."""
158
+ global df
159
+
160
+ if df is None:
161
+ return {
162
+ 'statusCode': 503,
163
+ 'headers': headers,
164
+ 'body': json.dumps({'error': 'Data not loaded'})
165
+ }
166
+
167
+ stats = {
168
+ 'total_models': len(df),
169
+ 'unique_libraries': df['library_name'].nunique() if 'library_name' in df.columns else 0,
170
+ 'unique_pipelines': df['pipeline_tag'].nunique() if 'pipeline_tag' in df.columns else 0,
171
+ 'avg_downloads': float(df['downloads'].mean()) if 'downloads' in df.columns else 0,
172
+ 'avg_likes': float(df['likes'].mean()) if 'likes' in df.columns else 0
173
+ }
174
+
175
+ return {
176
+ 'statusCode': 200,
177
+ 'headers': headers,
178
+ 'body': json.dumps(stats)
179
+ }
180
+
netlify-functions/models.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Netlify serverless function wrapper for models API.
3
+ This file is the entry point for Netlify Functions.
4
+ """
5
+ import sys
6
+ import os
7
+
8
+ # Add parent directories to path
9
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
10
+
11
+ from api import handler
12
+
13
+ # Netlify Functions expects a handler function
14
+ def lambda_handler(event, context):
15
+ """
16
+ AWS Lambda/Netlify Functions handler.
17
+ Converts Netlify event format to our handler format.
18
+ """
19
+ # Convert Netlify event to our format
20
+ # Netlify passes path in event['path']
21
+ # Query params are in event['queryStringParameters']
22
+ return handler(event, context)
23
+
netlify-functions/requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ pandas>=2.0.0
2
+ numpy>=1.24.0
3
+ sentence-transformers>=2.2.0
4
+ umap-learn>=0.5.4
5
+ scikit-learn>=1.3.0
6
+ datasets>=2.14.0
7
+ huggingface-hub>=0.17.0
8
+ tqdm>=4.66.0
9
+
netlify.toml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build]
2
+ base = "frontend"
3
+ publish = "frontend/build"
4
+ command = "npm install && npm run build"
5
+
6
+ [build.environment]
7
+ NODE_VERSION = "18"
8
+
9
+ # Redirect all routes to index.html for React Router
10
+ [[redirects]]
11
+ from = "/*"
12
+ to = "/index.html"
13
+ status = 200
14
+
15
+ # Netlify Functions (if using serverless backend)
16
+ [functions]
17
+ directory = "netlify-functions"
18
+ node_bundler = "esbuild"
19
+
20
+ # Headers for API routes
21
+ [[headers]]
22
+ for = "/.netlify/functions/*"
23
+ [headers.values]
24
+ Access-Control-Allow-Origin = "*"
25
+ Access-Control-Allow-Headers = "Content-Type"
26
+ Access-Control-Allow-Methods = "GET, POST, OPTIONS"
27
+
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ plotly>=5.18.0
3
+ pandas>=2.0.0
4
+ numpy>=1.24.0
5
+ sentence-transformers>=2.2.0
6
+ umap-learn>=0.5.4
7
+ scikit-learn>=1.3.0
8
+ datasets>=2.14.0
9
+ huggingface-hub>=0.17.0
10
+ tqdm>=4.66.0
11
+ python-dotenv>=1.0.0
12
+
test_local.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Quick test script to verify the application works locally.
3
+ Run this before deploying to Hugging Face Spaces.
4
+ """
5
+ import sys
6
+ from app import create_interface
7
+
8
+ if __name__ == "__main__":
9
+ print("Creating interface...")
10
+ demo = create_interface()
11
+ print("Launching demo...")
12
+ demo.launch(share=False, server_name="127.0.0.1", server_port=7860)
13
+