--- language: - en license: apache-2.0 library_name: transformers tags: - alignment - preference-optimization - dpo - thesis-research - kto - fine-tuned base_model: Qwen/Qwen3-0.6B datasets: - Anthropic/hh-rlhf - stanfordnlp/shp - OpenAssistant/oasst1 pipeline_tag: text-generation --- # Qwen3-0.6B - Dpo
Benchmark Results
## Model Description This model is a **LoRA Adapter** fine-tuned from [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using the **Dpo** training method. **Direct Preference Optimization - Paired preference optimization without reference model** This model was developed as part of thesis research on **LLM Alignment using Preference Optimization Methods**. ## Model Details | Property | Value | |----------|-------| | **Base Model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | | **Training Method** | Dpo | | **Model Type** | LoRA Adapter | | **Training Date** | December 2025 | | **Framework** | PyTorch + Transformers + PEFT | ## Benchmark Results | Benchmark | Score | |-----------|-------| | HellaSwag (10-shot) | 0.456 | | TruthfulQA (0-shot MC2) | 0.422 | | MMLU-Mini (5-shot) | 0.368 | ### Comparative Analysis The following chart compares this method against other training approaches on the same base model: ![Training Loss Curves](thesis_plots/training_loss.png) ## Training Configuration | Parameter | Value | |-----------|-------| | **Epochs** | 1 | | **Batch Size** | 2 | | **Gradient Accumulation** | 8 | | **Effective Batch Size** | 16 | | **Learning Rate** | 2e-4 | | **Max Sequence Length** | 512 | | **LoRA Rank** | 16 | | **LoRA Alpha** | 32 | | **Dataset** | Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant) | ### Combined Preference Dataset (kto_combined) Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources: | Source | Total Samples | Interactions | |--------|---------------|--------------| | **Anthropic HH-RLHF** | 321,600 | 61,568 | | **Stanford Human Preferences (SHP)** | 697,436 | 38,984 | | **OpenAssistant Conversations v1** | 16,810 | 8,904 | | **Total** | **1,035,846** | **109,456** | **Actual Training Statistics (subset split `train_prefs[:32090]`)**: - Training samples: **13,300** (paired examples) - Validation samples: **700** (5%) - Round-Robin distribution: **1,130** interactions per source - Seed: **42** (for reproducibility) ## Usage ### Loading as LoRA Adapter ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") # Load adapter model = PeftModel.from_pretrained(base_model, "Nishef/Qwen3-0.6B-Full_DPO_20251225_130318") # Generate text inputs = tokenizer("Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` ## Training Methodology ### Dpo Direct Preference Optimization - Paired preference optimization without reference model #### Key Features: - Paired preference optimization - Direct policy optimization without reward model - Efficient single-stage training - Bradley-Terry preference modeling ## Citation If you use this model in your research, please cite: ```bibtex @misc{qwen3_0.6b_dpo_2025, title = {Qwen3-0.6B Fine-tuned with Dpo}, author = {Thesis Research}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/Nishef/Qwen3-0.6B-Full_DPO_20251225_130318} } ``` ## Repository Structure ``` . ├── adapter_config.json # LoRA configuration ├── adapter_model.safetensors # Model weights ├── tokenizer files # Tokenizer configuration ├── eval_summary.csv # Evaluation results ├── thesis_plots/ # Visualization assets │ ├── benchmark_results.png │ └── training_loss.png └── README.md # This file ``` ## Acknowledgments - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) - **Training Framework**: [Hugging Face Transformers](https://github.com/huggingface/transformers) - **Fine-tuning Library**: [PEFT](https://github.com/huggingface/peft) ## License This model is released under the Apache 2.0 license. --- *This model was created as part of thesis research on LLM alignment using preference optimization methods.*