SmolLM2-360M - Dpo
Model Description
This model is a Merged Standalone Model fine-tuned from HuggingFaceTB/SmolLM2-360M using the Dpo training method.
Direct Preference Optimization - Paired preference optimization without reference model
This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.
Model Details
| Property | Value |
|---|---|
| Base Model | HuggingFaceTB/SmolLM2-360M |
| Training Method | Dpo |
| Model Type | Merged Standalone Model |
| Training Date | December 2025 |
| Framework | PyTorch + Transformers + PEFT |
Benchmark Results
| Benchmark | Score |
|---|---|
| HellaSwag (10-shot) | 0.550 |
| TruthfulQA (0-shot MC2) | 0.361 |
| MMLU-Mini (5-shot) | 0.264 |
Comparative Analysis
The following chart compares this method against other training approaches on the same base model:
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-4 |
| Max Sequence Length | 512 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Dataset | Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant) |
Combined Preference Dataset (kto_combined)
Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:
| Source | Total Samples | Interactions |
|---|---|---|
| Anthropic HH-RLHF | 321,600 | 61,568 |
| Stanford Human Preferences (SHP) | 697,436 | 38,984 |
| OpenAssistant Conversations v1 | 16,810 | 8,904 |
| Total | 1,035,846 | 109,456 |
Actual Training Statistics (subset split train_prefs[:32090]):
- Training samples: 13,300 (paired examples)
- Validation samples: 700 (5%)
- Round-Robin distribution: 1,130 interactions per source
- Seed: 42 (for reproducibility)
Usage
Direct Loading (Merged Model)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("SmolLM2-360M")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Training Methodology
Dpo
Direct Preference Optimization - Paired preference optimization without reference model
Key Features:
- Paired preference optimization
- Direct policy optimization without reward model
- Efficient single-stage training
- Bradley-Terry preference modeling
Citation
If you use this model in your research, please cite:
@misc{smollm2_360m_dpo_2025,
title = {SmolLM2-360M Fine-tuned with Dpo},
author = {Thesis Research},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nishef/SmolLM2-360M-Full_DPO_20251225_043457}
}
Repository Structure
.
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # Model weights
├── tokenizer files # Tokenizer configuration
├── eval_summary.csv # Evaluation results
├── thesis_plots/ # Visualization assets
│ ├── benchmark_results.png
│ └── training_loss.png
└── README.md # This file
Acknowledgments
- Base Model: HuggingFaceTB/SmolLM2-360M
- Training Framework: Hugging Face Transformers
- Fine-tuning Library: PEFT
License
This model is released under the Apache 2.0 license.
This model was created as part of thesis research on LLM alignment using preference optimization methods.
- Downloads last month
- 2
Model tree for Nishef/SmolLM2-360M-Full_DPO_20251225_043457-merged
Base model
HuggingFaceTB/SmolLM2-360M