---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- modernbert
- dataset-classification
- huggingface-hub
- domain-classification
datasets:
- davanstrien/hf-dataset-domain-labels-v3
base_model: answerdotai/ModernBERT-base
pipeline_tag: text-classification
metrics:
- accuracy
- f1
model-index:
- name: modernbert-hf-dataset-domain-v3
  results:
  - task:
      type: text-classification
      name: Domain Classification
    dataset:
      name: OOD Gold Set (72 examples, multi-model consensus validated)
      type: davanstrien/hf-dataset-domain-labels-v3
      split: test
    metrics:
    - name: Accuracy (with input filter)
      type: accuracy
      value: 0.903
    - name: Macro F1
      type: f1
      value: 0.850
---

# ModernBERT HF Dataset Domain Classifier v3

Classifies HuggingFace dataset cards into 10 domain categories. Built through iterative, data-centric development with LLM-in-the-loop active learning.

## Labels

| Label | Description |
|-------|-------------|
| biology | Genomics, ecology, proteins, life sciences |
| chemistry | Molecules, reactions, materials science |
| climate | Weather, earth observation, environmental monitoring |
| code | Programming, software engineering, code generation |
| cybersecurity | Security threats, malware, vulnerability detection |
| finance | Banking, trading, economics, financial documents |
| legal | Law, regulations, court cases, legal documents |
| math | Mathematics, theorem proving, formal verification |
| medical | Healthcare, clinical, biomedical, drug discovery |
| none | General-purpose, no specific domain, stubs |

## Usage

```python
from transformers import pipeline

clf = pipeline("text-classification", model="davanstrien/modernbert-hf-dataset-domain-v3", top_k=1)
result = clf("This dataset contains chest X-ray images for pneumonia detection...")
# [{'label': 'medical', 'score': 0.99}]
```

## Performance

Evaluated on 72 out-of-distribution examples validated by 3-model consensus (Qwen3-235B, DeepSeek-V3, Llama-3.3-70B).

**With recommended input filter** (filters known stub orgs + templates):

| Metric | Value |
|--------|-------|
| Overall accuracy | 90.3% |
| False positives (domain→none) | 3/27 |
| Wrong domain | 0 |
| None recall | 89% |

**Model only** (no input filter):

| Metric | Value |
|--------|-------|
| Overall accuracy | 81.9% |
| Macro F1 | 0.850 |
| 100% recall on | biology, chemistry, code, cybersecurity, finance, legal, math, medical |

### Per-class (model + filter)

| Class | Accuracy |
|-------|----------|
| biology | 83% |
| chemistry | 100% |
| climate | 78% |
| code | 100% |
| cybersecurity | 100% |
| finance | 80% |
| legal | 100% |
| math | 100% |
| medical | 100% |
| none | 89% |

## Recommended Input Filter

For best results, filter known junk before classification:

```python
def should_skip(card_text: str, dataset_id: str = "") -> bool:
    """Returns True if the card should be auto-classified as 'none'."""
    # Known stub orgs
    if dataset_id:
        org = dataset_id.split("/")[0] if "/" in dataset_id else ""
        if org in {"french-open-data"}:
            return True
    # HF default template
    if "Dataset Card for Dataset Name" in card_text and "Provide a quick summary" in card_text:
        return True
    # Too short after stripping markup
    import re
    clean = re.sub(r'<[^>]+>', '', card_text)
    clean = re.sub(r'https?://\S+', '', clean)
    clean = re.sub(r'\s+', ' ', clean).strip()
    if len(clean) < 100:
        return True
    return False
```

## Training

### Data

3,437 examples from three sources:
- **Tag-based labels** (1,990): Derived from existing HuggingFace dataset tags
- **LLM-labelled "none" examples** (607): Qwen3-235B classified untagged datasets
- **Active learning hard negatives** (800): Disagreement sampling between v2 model and Qwen3-4B, adjudicated by Qwen3-235B

### Development History

| Version | OOD Accuracy | Method |
|---------|-------------|--------|
| v1 | 52.8% | Tag-based labels only |
| v2 | 76.4% | + LLM-labelled "none" examples |
| v3 | 84.7% (90.3% with filter) | + Disagreement-based active learning |

### Methodology

1. **Bootstrap from Hub tags** — 2,889 of initial labels came from existing dataset tags for free
2. **Multi-model consensus validation** — 3 frontier LLMs validated 72 OOD examples to build a gold evaluation set
3. **Disagreement-based active learning** — Ran model on 5K datasets, used cheap LLM (Qwen3-4B) to find disagreements, adjudicated with strong LLM (Qwen3-235B). Found 382 overturned predictions in 800 examples.
4. **Iterative retraining** — Each round targeted specific weaknesses identified by error analysis

### Architecture

- Base model: `answerdotai/ModernBERT-base`
- Max sequence length: 2048 tokens
- Class-weighted cross-entropy loss
- 5 epochs, lr=2e-5, effective batch size 16

### Known Limitations

- **French open-data stubs**: Datasets from `french-open-data` org are empty references to external portals. The input filter handles these, but without it the model may confidently misclassify them.
- **HTML-heavy cards**: Cards with extensive HTML markup (badges, styled headers) may lose signal. Strip HTML tags before classification.
- **Taxonomy gaps**: No "physics" or "engineering" category — datasets in those areas may be classified as "none" or adjacent domains.

## How This Was Built

This model was developed through a collaborative workflow between a human ([@davanstrien](https://huggingface.co/davanstrien)) and [Hermes Agent](https://github.com/nousresearch/hermes-agent) — an AI coding agent built by NousResearch. The entire data-centric development loop (error diagnosis, LLM-in-the-loop active learning, training, evaluation) was conducted interactively in a single session, with the agent writing and executing scripts, managing HF Jobs, and iterating on the methodology based on results.

The active learning pipeline — disagreement sampling between ModernBERT and cheap/strong LLMs, three-tier adjudication, and iterative retraining — was designed and executed by the agent as part of a [data-centric model development skill](https://github.com/nousresearch/hermes-agent) that captures reusable patterns for training task-specific models with a focus on data quality.

## Links

- Training dataset: [davanstrien/hf-dataset-domain-labels-v3](https://huggingface.co/datasets/davanstrien/hf-dataset-domain-labels-v3)
- Development repo: Contains full training scripts, evaluation pipeline, and learnings from 3 iterations
- Previous versions: [v1](https://huggingface.co/davanstrien/setfit-hf-dataset-domain-v0), [v2](https://huggingface.co/davanstrien/modernbert-hf-dataset-domain-v2)
- Agent: [Hermes Agent](https://github.com/nousresearch/hermes-agent) by NousResearch