---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- alignment
- preference-optimization
- dpo
- thesis-research
- kto
- fine-tuned
base_model: Qwen/Qwen3-0.6B
datasets:
- Anthropic/hh-rlhf
- stanfordnlp/shp
- OpenAssistant/oasst1
pipeline_tag: text-generation
---

# Qwen3-0.6B - Dpo

<div align="center">
  <img src="thesis_plots/benchmark_results.png" alt="Benchmark Results" width="600"/>
</div>

## Model Description

This model is a **LoRA Adapter** fine-tuned from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using the **Dpo** training method.

**Direct Preference Optimization - Paired preference optimization without reference model**

This model was developed as part of thesis research on **LLM Alignment using Preference Optimization Methods**.

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| **Training Method** | Dpo |
| **Model Type** | LoRA Adapter |
| **Training Date** | December 2025 |
| **Framework** | PyTorch + Transformers + PEFT |

## Benchmark Results

| Benchmark | Score |
|-----------|-------|
| HellaSwag (10-shot) | 0.456 |
| TruthfulQA (0-shot MC2) | 0.422 |
| MMLU-Mini (5-shot) | 0.368 |

### Comparative Analysis

The following chart compares this method against other training approaches on the same base model:

![Training Loss Curves](thesis_plots/training_loss.png)

## Training Configuration

| Parameter | Value |
|-----------|-------|
| **Epochs** | 1 |
| **Batch Size** | 2 |
| **Gradient Accumulation** | 8 |
| **Effective Batch Size** | 16 |
| **Learning Rate** | 2e-4 |
| **Max Sequence Length** | 512 |
| **LoRA Rank** | 16 |
| **LoRA Alpha** | 32 |
| **Dataset** | Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant) |


### Combined Preference Dataset (kto_combined)

Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:

| Source | Total Samples | Interactions |
|--------|---------------|--------------|
| **Anthropic HH-RLHF** | 321,600 | 61,568 |
| **Stanford Human Preferences (SHP)** | 697,436 | 38,984 |
| **OpenAssistant Conversations v1** | 16,810 | 8,904 |
| **Total** | **1,035,846** | **109,456** |

**Actual Training Statistics (subset split `train_prefs[:32090]`)**:
- Training samples: **13,300** (paired examples)
- Validation samples: **700** (5%)
- Round-Robin distribution: **1,130** interactions per source
- Seed: **42** (for reproducibility)

## Usage

### Loading as LoRA Adapter

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# Load adapter
model = PeftModel.from_pretrained(base_model, "Nishef/Qwen3-0.6B-Full_DPO_20251225_130318")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
```

## Training Methodology

### Dpo

Direct Preference Optimization - Paired preference optimization without reference model


#### Key Features:
- Paired preference optimization
- Direct policy optimization without reward model
- Efficient single-stage training
- Bradley-Terry preference modeling


## Citation

If you use this model in your research, please cite:

```bibtex
@misc{qwen3_0.6b_dpo_2025,
  title = {Qwen3-0.6B Fine-tuned with Dpo},
  author = {Thesis Research},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Nishef/Qwen3-0.6B-Full_DPO_20251225_130318}
}
```

## Repository Structure

```
.
├── adapter_config.json      # LoRA configuration
├── adapter_model.safetensors # Model weights
├── tokenizer files          # Tokenizer configuration
├── eval_summary.csv         # Evaluation results
├── thesis_plots/            # Visualization assets
│   ├── benchmark_results.png
│   └── training_loss.png
└── README.md               # This file
```

## Acknowledgments

- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
- **Training Framework**: [Hugging Face Transformers](https://github.com/huggingface/transformers)
- **Fine-tuning Library**: [PEFT](https://github.com/huggingface/peft)

## License

This model is released under the Apache 2.0 license.

---

*This model was created as part of thesis research on LLM alignment using preference optimization methods.*