---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen3-0.6B-Base
tags:
- content-safety
- content-moderation
- safety
- lora
- fine-tuned
- nvidia-aegis
- text-classification
datasets:
- nvidia/Aegis-AI-Content-Safety-Dataset-2.0
metrics:
- perplexity
- accuracy
library_name: transformers
pipeline_tag: text-classification
---

# Qwen3-0.6B Fine-tuned on Aegis AI Content Safety

## Model Description

This is a fine-tuned version of Qwen3-0.6B-Base, optimized for content safety classification and moderation tasks. Qwen3 is a compact yet powerful language model developed by Alibaba Cloud, designed for efficient deployment while maintaining strong performance.

This model was fine-tuned using **LoRA (Low-Rank Adaptation)** on the [NVIDIA Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0), which contains diverse examples of safe and unsafe content across multiple categories.

## Model Details

- **Base Model**: [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Dataset**: NVIDIA Aegis AI Content Safety Dataset 2.0
- **Training Samples**: 2,000 carefully selected samples
- **Language**: English
- **License**: Apache 2.0

## Capabilities

- Content safety classification
- Toxic content detection
- Harmful content identification
- Safety-aware text generation
- Content moderation assistance

## Intended Use Cases

- Content moderation systems
- Chat application safety filters
- User-generated content screening
- Educational content filtering
- Social media safety monitoring

## Training Configuration

### LoRA Parameters
- **Rank (r)**: 16
- **Alpha**: 32
- **Dropout**: 0.05
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

### Training Hyperparameters
- **Learning Rate**: 2e-4
- **Batch Size**: 4 (per device)
- **Gradient Accumulation Steps**: 4
- **Effective Batch Size**: 16
- **Epochs**: 3
- **Optimizer**: AdamW (8-bit paged)
- **LR Scheduler**: Cosine with warmup
- **Warmup Ratio**: 0.1
- **FP16 Training**: Yes
- **Max Sequence Length**: 512

## Usage

### Installation

```bash
pip install transformers torch peft
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "ahczhg/qwen3-0.6b-aegis-safety-lora"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Example: Content safety check
prompt = "### Instruction:\nAnalyze this content for safety: 'Your text here'\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        do_sample=True,
        top_p=0.95
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Advanced Usage with Pipeline

```python
from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="ahczhg/qwen3-0.6b-aegis-safety-lora",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate safety analysis
result = generator(
    "### Instruction:\nIs this content safe? 'Hello, how are you?'\n\n### Response:\n",
    max_new_tokens=128,
    temperature=0.7,
    do_sample=True
)

print(result[0]['generated_text'])
```

## Evaluation

The model was evaluated on a held-out test set from the Aegis AI Content Safety Dataset. Key metrics include:

- Perplexity on validation set
- Content safety classification accuracy
- False positive/negative rates for harmful content detection

## Limitations

- The model is primarily trained on English language content
- Performance may vary on domain-specific or highly technical content
- Should be used as part of a comprehensive content moderation system, not as the sole decision-maker
- May require fine-tuning for specific use cases or content domains
- The model's outputs should be reviewed by human moderators for critical applications

## Ethical Considerations

- This model is designed to assist in content safety and moderation tasks
- It should not be used to censor legitimate speech or suppress diverse viewpoints
- Decisions about content moderation should involve human oversight
- The model may reflect biases present in the training data
- Users should implement appropriate safeguards and appeal processes

## Training Data

The model was fine-tuned on the [NVIDIA Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0), which includes:

- Diverse examples of safe and unsafe content
- Multiple categories of potentially harmful content
- Balanced representation of safe content
- Real-world scenarios and edge cases

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{qwen3_0.6b_aegis_safety,
  author = {ahczhg},
  title = {Qwen3-0.6B Fine-tuned on Aegis AI Content Safety},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ahczhg/qwen3-0.6b-aegis-safety-lora}},
  note = {Fine-tuned on NVIDIA Aegis AI Content Safety Dataset 2.0}
}
```

## Acknowledgments

- Base model by the original authors: Qwen
- Dataset provided by NVIDIA
- Fine-tuning performed using HuggingFace Transformers and PEFT libraries

## Contact

For questions, issues, or feedback, please visit the [model repository](https://huggingface.co/ahczhg/qwen3-0.6b-aegis-safety-lora).

## Model Card Authors

- ahczhg

## Model Card Contact

- https://huggingface.co/ahczhg

[![Support me on Ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/ahczhg)