--- language: - en license: apache-2.0 library_name: transformers tags: - text-classification - modernbert - dataset-classification - huggingface-hub - domain-classification datasets: - davanstrien/hf-dataset-domain-labels-v3 base_model: answerdotai/ModernBERT-base pipeline_tag: text-classification metrics: - accuracy - f1 model-index: - name: modernbert-hf-dataset-domain-v3 results: - task: type: text-classification name: Domain Classification dataset: name: OOD Gold Set (72 examples, multi-model consensus validated) type: davanstrien/hf-dataset-domain-labels-v3 split: test metrics: - name: Accuracy (with input filter) type: accuracy value: 0.903 - name: Macro F1 type: f1 value: 0.850 --- # ModernBERT HF Dataset Domain Classifier v3 Classifies HuggingFace dataset cards into 10 domain categories. Built through iterative, data-centric development with LLM-in-the-loop active learning. ## Labels | Label | Description | |-------|-------------| | biology | Genomics, ecology, proteins, life sciences | | chemistry | Molecules, reactions, materials science | | climate | Weather, earth observation, environmental monitoring | | code | Programming, software engineering, code generation | | cybersecurity | Security threats, malware, vulnerability detection | | finance | Banking, trading, economics, financial documents | | legal | Law, regulations, court cases, legal documents | | math | Mathematics, theorem proving, formal verification | | medical | Healthcare, clinical, biomedical, drug discovery | | none | General-purpose, no specific domain, stubs | ## Usage ```python from transformers import pipeline clf = pipeline("text-classification", model="davanstrien/modernbert-hf-dataset-domain-v3", top_k=1) result = clf("This dataset contains chest X-ray images for pneumonia detection...") # [{'label': 'medical', 'score': 0.99}] ``` ## Performance Evaluated on 72 out-of-distribution examples validated by 3-model consensus (Qwen3-235B, DeepSeek-V3, Llama-3.3-70B). **With recommended input filter** (filters known stub orgs + templates): | Metric | Value | |--------|-------| | Overall accuracy | 90.3% | | False positives (domain→none) | 3/27 | | Wrong domain | 0 | | None recall | 89% | **Model only** (no input filter): | Metric | Value | |--------|-------| | Overall accuracy | 81.9% | | Macro F1 | 0.850 | | 100% recall on | biology, chemistry, code, cybersecurity, finance, legal, math, medical | ### Per-class (model + filter) | Class | Accuracy | |-------|----------| | biology | 83% | | chemistry | 100% | | climate | 78% | | code | 100% | | cybersecurity | 100% | | finance | 80% | | legal | 100% | | math | 100% | | medical | 100% | | none | 89% | ## Recommended Input Filter For best results, filter known junk before classification: ```python def should_skip(card_text: str, dataset_id: str = "") -> bool: """Returns True if the card should be auto-classified as 'none'.""" # Known stub orgs if dataset_id: org = dataset_id.split("/")[0] if "/" in dataset_id else "" if org in {"french-open-data"}: return True # HF default template if "Dataset Card for Dataset Name" in card_text and "Provide a quick summary" in card_text: return True # Too short after stripping markup import re clean = re.sub(r'<[^>]+>', '', card_text) clean = re.sub(r'https?://\S+', '', clean) clean = re.sub(r'\s+', ' ', clean).strip() if len(clean) < 100: return True return False ``` ## Training ### Data 3,437 examples from three sources: - **Tag-based labels** (1,990): Derived from existing HuggingFace dataset tags - **LLM-labelled "none" examples** (607): Qwen3-235B classified untagged datasets - **Active learning hard negatives** (800): Disagreement sampling between v2 model and Qwen3-4B, adjudicated by Qwen3-235B ### Development History | Version | OOD Accuracy | Method | |---------|-------------|--------| | v1 | 52.8% | Tag-based labels only | | v2 | 76.4% | + LLM-labelled "none" examples | | v3 | 84.7% (90.3% with filter) | + Disagreement-based active learning | ### Methodology 1. **Bootstrap from Hub tags** — 2,889 of initial labels came from existing dataset tags for free 2. **Multi-model consensus validation** — 3 frontier LLMs validated 72 OOD examples to build a gold evaluation set 3. **Disagreement-based active learning** — Ran model on 5K datasets, used cheap LLM (Qwen3-4B) to find disagreements, adjudicated with strong LLM (Qwen3-235B). Found 382 overturned predictions in 800 examples. 4. **Iterative retraining** — Each round targeted specific weaknesses identified by error analysis ### Architecture - Base model: `answerdotai/ModernBERT-base` - Max sequence length: 2048 tokens - Class-weighted cross-entropy loss - 5 epochs, lr=2e-5, effective batch size 16 ### Known Limitations - **French open-data stubs**: Datasets from `french-open-data` org are empty references to external portals. The input filter handles these, but without it the model may confidently misclassify them. - **HTML-heavy cards**: Cards with extensive HTML markup (badges, styled headers) may lose signal. Strip HTML tags before classification. - **Taxonomy gaps**: No "physics" or "engineering" category — datasets in those areas may be classified as "none" or adjacent domains. ## How This Was Built This model was developed through a collaborative workflow between a human ([@davanstrien](https://huggingface.co/davanstrien)) and [Hermes Agent](https://github.com/nousresearch/hermes-agent) — an AI coding agent built by NousResearch. The entire data-centric development loop (error diagnosis, LLM-in-the-loop active learning, training, evaluation) was conducted interactively in a single session, with the agent writing and executing scripts, managing HF Jobs, and iterating on the methodology based on results. The active learning pipeline — disagreement sampling between ModernBERT and cheap/strong LLMs, three-tier adjudication, and iterative retraining — was designed and executed by the agent as part of a [data-centric model development skill](https://github.com/nousresearch/hermes-agent) that captures reusable patterns for training task-specific models with a focus on data quality. ## Links - Training dataset: [davanstrien/hf-dataset-domain-labels-v3](https://huggingface.co/datasets/davanstrien/hf-dataset-domain-labels-v3) - Development repo: Contains full training scripts, evaluation pipeline, and learnings from 3 iterations - Previous versions: [v1](https://huggingface.co/davanstrien/setfit-hf-dataset-domain-v0), [v2](https://huggingface.co/davanstrien/modernbert-hf-dataset-domain-v2) - Agent: [Hermes Agent](https://github.com/nousresearch/hermes-agent) by NousResearch