--- license: apache-2.0 base_model: jhu-clsp/mmBERT-base tags: - content-safety - text-classification - lora - peft - mlcommons - ai-safety - hazard-classification - moderation datasets: - nvidia/Aegis-AI-Content-Safety-Dataset-2.0 - llm-semantic-router/mlcommons-ai-safety-synth language: - en - multilingual metrics: - f1 - recall - accuracy pipeline_tag: text-classification library_name: peft --- # MLCommons AI Safety Classifier - Level 2 (9-Class Hazard) A LoRA-finetuned multilingual BERT model for 9-class hazard category classification, following the MLCommons AI Safety Hazard Taxonomy. ## Model Description This is Level 2 of a hierarchical safety classification system: - **Level 1**: Binary classification (safe vs unsafe) - **Level 2 (this model)**: 9-class hazard category classification The model uses **mmBERT** (Multilingual ModernBERT) as the base, supporting 1800+ languages. ## Training Results | Metric | Value | |--------|-------| | **Recall (macro)** | 93.5% | | **F1 Score (macro)** | 93.5% | | **Accuracy** | 93.5% | | **Precision (macro)** | 93.4% | ## Hazard Categories (MLCommons AI Safety Taxonomy) | ID | Category | Description | |----|----------|-------------| | S1 | Violent Crimes | Murder, assault, kidnapping, terrorism | | S2 | Non-Violent Crimes | Fraud, theft, cybercrime, drug trafficking | | S3 | Sex Crimes | Sexual assault, CSAM, sexual exploitation | | S5 | Weapons & CBRNE | Weapons creation, chemical/biological/nuclear threats | | S6 | Self-Harm | Suicide, self-injury, eating disorders | | S7 | Hate | Discrimination, slurs, hate speech | | S8 | Specialized Advice | Unqualified medical, legal, financial advice | | S9 | Privacy | PII exposure, surveillance, data harvesting | | S13 | Misinformation | Disinformation, conspiracy theories, false claims | ## Training Data - **Total samples**: ~20,000 (balanced across categories) - **Sources**: - [AEGIS AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) (~18,000 samples) - [MLCommons AI Safety Synth](https://huggingface.co/datasets/llm-semantic-router/mlcommons-ai-safety-synth) (12,000 synthesized samples for weak categories) ### Synthesized Data Distribution The synthetic dataset targets previously underrepresented categories: - S2 (Non-Violent Crimes): 2,000 samples - S6 (Self-Harm): 2,000 samples - S7 (Hate): 2,000 samples - S9 (Privacy): 2,000 samples - S11 (Elections): 2,000 samples - S13 (Misinformation): 2,000 samples ## Model Architecture & Training ### Base Model - **Model**: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) - **Architecture**: ModernBERT (314M parameters) ### LoRA Configuration | Parameter | Value | |-----------|-------| | Rank (r) | 32 | | Alpha | 64 | | Dropout | 0.1 | | Target Modules | `attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo` | | Trainable Parameters | 6.76M (2.15%) | ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 10 | | Batch Size | 64 | | Learning Rate | 3e-4 | | Optimizer | AdamW | | Scheduler | Linear warmup | ## Hardware & Environment | Component | Specification | |-----------|---------------| | GPU | AMD Instinct MI300X | | VRAM | 192GB HBM3 | | Platform | ROCm 6.2 | | Container | `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` | | Training Time | ~3.5 minutes | ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel # Load base model and tokenizer base_model = "jhu-clsp/mmBERT-base" tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mlcommons-safety-classifier-level2-hazard") model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=9) model = PeftModel.from_pretrained(model, "llm-semantic-router/mlcommons-safety-classifier-level2-hazard") # Classify text = "How to hack into someone's email account" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) outputs = model(**inputs) prediction = outputs.logits.argmax(-1).item() # Label mapping labels = [ "S1_violent_crimes", "S2_nonviolent_crimes", "S3_sex_crimes", "S5_weapons_cbrne", "S6_self_harm", "S7_hate", "S8_specialized_advice", "S9_privacy", "S13_misinformation" ] print(f"Hazard Category: {labels[prediction]}") ``` ## Label Mapping ```json { "S1_violent_crimes": 0, "S2_nonviolent_crimes": 1, "S3_sex_crimes": 2, "S5_weapons_cbrne": 3, "S6_self_harm": 4, "S7_hate": 5, "S8_specialized_advice": 6, "S9_privacy": 7, "S13_misinformation": 8 } ``` ## Hierarchical Usage (Recommended) For production use, combine Level 1 and Level 2: ```python # Step 1: Binary classification (Level 1) level1_pred = level1_model(inputs) if level1_pred == "unsafe": # Step 2: Hazard classification (Level 2) hazard_category = level2_model(inputs) ``` ## Intended Use This model is designed for: - Detailed hazard categorization of unsafe content - Content moderation with specific policy enforcement - Safety analytics and reporting - Research on content safety classification ## Limitations - Optimized for English but supports 1800+ languages via mmBERT - Should be used after Level 1 filtering for efficiency - Some categories may have regional/cultural variations - May require domain-specific fine-tuning for specialized applications ## Citation ```bibtex @misc{mlcommons-safety-classifier, title={MLCommons AI Safety Classifier}, author={LLM Semantic Router Team}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/llm-semantic-router/mlcommons-safety-classifier-level2-hazard} } ``` ## License Apache 2.0