---
license: apache-2.0
base_model: jhu-clsp/mmBERT-base
tags:
  - content-safety
  - text-classification
  - lora
  - peft
  - mlcommons
  - ai-safety
  - hazard-classification
  - moderation
datasets:
  - nvidia/Aegis-AI-Content-Safety-Dataset-2.0
  - llm-semantic-router/mlcommons-ai-safety-synth
language:
  - en
  - multilingual
metrics:
  - f1
  - recall
  - accuracy
pipeline_tag: text-classification
library_name: peft
---

# MLCommons AI Safety Classifier - Level 2 (9-Class Hazard)

A LoRA-finetuned multilingual BERT model for 9-class hazard category classification, following the MLCommons AI Safety Hazard Taxonomy.

## Model Description

This is Level 2 of a hierarchical safety classification system:
- **Level 1**: Binary classification (safe vs unsafe)
- **Level 2 (this model)**: 9-class hazard category classification

The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages.

## Training Results

| Metric | Value |
|--------|-------|
| **Recall (macro)** | 93.5% |
| **F1 Score (macro)** | 93.5% |
| **Accuracy** | 93.5% |
| **Precision (macro)** | 93.4% |

## Hazard Categories (MLCommons AI Safety Taxonomy)

| ID | Category | Description |
|----|----------|-------------|
| S1 | Violent Crimes | Murder, assault, kidnapping, terrorism |
| S2 | Non-Violent Crimes | Fraud, theft, cybercrime, drug trafficking |
| S3 | Sex Crimes | Sexual assault, CSAM, sexual exploitation |
| S5 | Weapons & CBRNE | Weapons creation, chemical/biological/nuclear threats |
| S6 | Self-Harm | Suicide, self-injury, eating disorders |
| S7 | Hate | Discrimination, slurs, hate speech |
| S8 | Specialized Advice | Unqualified medical, legal, financial advice |
| S9 | Privacy | PII exposure, surveillance, data harvesting |
| S13 | Misinformation | Disinformation, conspiracy theories, false claims |

## Training Data

- **Total samples**: ~20,000 (balanced across categories)
- **Sources**:
  - [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) (~18,000 samples)
  - [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth) (12,000 synthesized samples for weak categories)

### Synthesized Data Distribution
The synthetic dataset targets previously underrepresented categories:
- S2 (Non-Violent Crimes): 2,000 samples
- S6 (Self-Harm): 2,000 samples
- S7 (Hate): 2,000 samples
- S9 (Privacy): 2,000 samples
- S11 (Elections): 2,000 samples
- S13 (Misinformation): 2,000 samples

## Model Architecture & Training

### Base Model
- **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)
- **Architecture**: ModernBERT (314M parameters)

### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` |
| Trainable Parameters | 6.76M (2.15%) |

### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Epochs | 10 |
| Batch Size | 64 |
| Learning Rate | 3e-4 |
| Optimizer | AdamW |
| Scheduler | Linear warmup |

## Hardware & Environment

| Component | Specification |
|-----------|---------------|
| GPU | AMD Instinct MI300X |
| VRAM | 192GB HBM3 |
| Platform | ROCm 6.2 |
| Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` |
| Training Time | ~3.5 minutes |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Load base model and tokenizer
base_model = "jhu-clsp/mmBERT-base"
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level2-hazard")
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=9)
model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard")

# Classify
text = "How to hack into someone's email account"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()

# Label mapping
labels = [
    "S1_violent_crimes", "S2_nonviolent_crimes", "S3_sex_crimes",
    "S5_weapons_cbrne", "S6_self_harm", "S7_hate",
    "S8_specialized_advice", "S9_privacy", "S13_misinformation"
]
print(f"Hazard Category: {labels[prediction]}")
```

## Label Mapping

```json
{
  "S1_violent_crimes": 0,
  "S2_nonviolent_crimes": 1,
  "S3_sex_crimes": 2,
  "S5_weapons_cbrne": 3,
  "S6_self_harm": 4,
  "S7_hate": 5,
  "S8_specialized_advice": 6,
  "S9_privacy": 7,
  "S13_misinformation": 8
}
```

## Hierarchical Usage (Recommended)

For production use, combine Level 1 and Level 2:

```python
# Step 1: Binary classification (Level 1)
level1_pred = level1_model(inputs)
if level1_pred == "unsafe":
    # Step 2: Hazard classification (Level 2)
    hazard_category = level2_model(inputs)
```

## Intended Use

This model is designed for:
- Detailed hazard categorization of unsafe content
- Content moderation with specific policy enforcement
- Safety analytics and reporting
- Research on content safety classification

## Limitations

- Optimized for English but supports 1800+ languages via mmBERT
- Should be used after Level 1 filtering for efficiency
- Some categories may have regional/cultural variations
- May require domain-specific fine-tuning for specialized applications

## Citation

```bibtex
@misc{mlcommons-safety-classifier,
  title={MLCommons AI Safety Classifier},
  author={LLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level2-hazard}
}
```

## License

Apache 2.0