---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- alignment
- preference-optimization
- dpo
- thesis-research
- kto
- fine-tuned
base_model: Qwen/Qwen3-0.6B
datasets:
- Anthropic/hh-rlhf
- stanfordnlp/shp
- OpenAssistant/oasst1
pipeline_tag: text-generation
---
# Qwen3-0.6B - Dpo
## Model Description
This model is a **LoRA Adapter** fine-tuned from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using the **Dpo** training method.
**Direct Preference Optimization - Paired preference optimization without reference model**
This model was developed as part of thesis research on **LLM Alignment using Preference Optimization Methods**.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| **Training Method** | Dpo |
| **Model Type** | LoRA Adapter |
| **Training Date** | December 2025 |
| **Framework** | PyTorch + Transformers + PEFT |
## Benchmark Results
| Benchmark | Score |
|-----------|-------|
| HellaSwag (10-shot) | 0.456 |
| TruthfulQA (0-shot MC2) | 0.422 |
| MMLU-Mini (5-shot) | 0.368 |
### Comparative Analysis
The following chart compares this method against other training approaches on the same base model:

## Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 1 |
| **Batch Size** | 2 |
| **Gradient Accumulation** | 8 |
| **Effective Batch Size** | 16 |
| **Learning Rate** | 2e-4 |
| **Max Sequence Length** | 512 |
| **LoRA Rank** | 16 |
| **LoRA Alpha** | 32 |
| **Dataset** | Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant) |
### Combined Preference Dataset (kto_combined)
Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:
| Source | Total Samples | Interactions |
|--------|---------------|--------------|
| **Anthropic HH-RLHF** | 321,600 | 61,568 |
| **Stanford Human Preferences (SHP)** | 697,436 | 38,984 |
| **OpenAssistant Conversations v1** | 16,810 | 8,904 |
| **Total** | **1,035,846** | **109,456** |
**Actual Training Statistics (subset split `train_prefs[:32090]`)**:
- Training samples: **13,300** (paired examples)
- Validation samples: **700** (5%)
- Round-Robin distribution: **1,130** interactions per source
- Seed: **42** (for reproducibility)
## Usage
### Loading as LoRA Adapter
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# Load adapter
model = PeftModel.from_pretrained(base_model, "Nishef/Qwen3-0.6B-Full_DPO_20251225_130318")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```
## Training Methodology
### Dpo
Direct Preference Optimization - Paired preference optimization without reference model
#### Key Features:
- Paired preference optimization
- Direct policy optimization without reward model
- Efficient single-stage training
- Bradley-Terry preference modeling
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{qwen3_0.6b_dpo_2025,
title = {Qwen3-0.6B Fine-tuned with Dpo},
author = {Thesis Research},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nishef/Qwen3-0.6B-Full_DPO_20251225_130318}
}
```
## Repository Structure
```
.
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # Model weights
├── tokenizer files # Tokenizer configuration
├── eval_summary.csv # Evaluation results
├── thesis_plots/ # Visualization assets
│ ├── benchmark_results.png
│ └── training_loss.png
└── README.md # This file
```
## Acknowledgments
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
- **Training Framework**: [Hugging Face Transformers](https://github.com/huggingface/transformers)
- **Fine-tuning Library**: [PEFT](https://github.com/huggingface/peft)
## License
This model is released under the Apache 2.0 license.
---
*This model was created as part of thesis research on LLM alignment using preference optimization methods.*