SmolLM2-360M - Enhanced KTO

Benchmark Results

Model Description

This model is a LoRA Adapter fine-tuned from HuggingFaceTB/SmolLM2-360M using Enhanced KTO (Kahneman-Tversky Optimization) with authentic Prospect Theory components.

This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.

Model Details

Property Value
Base Model HuggingFaceTB/SmolLM2-360M
Training Method Enhanced KTO
Model Type LoRA Adapter
Training Date December 2025
Framework PyTorch + Transformers + PEFT

Benchmark Results

Benchmark Score
HellaSwag (10-shot) 0.496
TruthfulQA (0-shot MC2) 0.390
MMLU-Mini (5-shot) 0.289

Enhanced KTO Components

This implementation incorporates multiple Prospect Theory-inspired components:

Component Status Description
Value Function Active Asymmetric loss treatment reflecting loss aversion (losses weighted ~2x gains)
Probability Weighting Active Non-linear transformation of model confidence scores
Odds Ratio Integration Active ORPO-inspired reference-free preference modeling
BCO Shift Implemented but Disabled Baseline-corrected optimization (see note below)

Note on BCO Shift

The BCO (Baseline-Corrected Optimization) Shift component was fully implemented in the codebase following the approach described in recent preference optimization literature. However, it was disabled for final training due to the following observations during hyperparameter tuning:

  1. Training Instability: Enabling BCO Shift in conjunction with the Value Function and Probability Weighting led to gradient instability in approximately 40% of training runs
  2. No Significant Improvement: Preliminary experiments did not show meaningful performance gains when BCO was enabled
  3. Complexity Trade-off: The added complexity did not justify the marginal (and inconsistent) benefits

The BCO Shift code remains in the implementation for transparency and future research. We hypothesize that a staged training approach (enabling BCO after initial convergence) or component-specific gradient clipping may enable stable integration.

Training Plots

Training Loss Curve

Training Loss

Rewards During Training

Rewards

KL Divergence

KL Divergence

Learning Rate Schedule

Learning Rate

Training Configuration

Parameter Value
Epochs 1
Batch Size 2
Gradient Accumulation 8
Effective Batch Size 16
Learning Rate 2e-4
Max Sequence Length 512
LoRA Rank 16
LoRA Alpha 32
Dataset Combined Preference Dataset (HH-RLHF + SHP + OpenAssistant)
Beta (KTO) 0.1

Combined Preference Dataset (kto_combined)

Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:

Source Total Samples Interactions
Anthropic HH-RLHF 321,600 61,568
Stanford Human Preferences (SHP) 697,436 38,984
OpenAssistant Conversations v1 16,810 8,904
Total 1,035,846 109,456

Actual Training Statistics (subset split train_prefs[:32090]):

  • Training samples: 13,300 (paired examples)
  • Validation samples: 700 (5%)
  • Round-Robin distribution: 1,130 interactions per source
  • Seed: 42 (for reproducibility)

Usage

Loading as LoRA Adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

# Load adapter
model = PeftModel.from_pretrained(base_model, "Nishef/SmolLM2-360M-Full_ENHANCED_KTO_20251225_074953")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Methodology

Enhanced KTO vs Standard KTO

Enhanced KTO extends the standard KTO algorithm by incorporating additional components from Kahneman and Tversky's Prospect Theory:

  1. Value Function: Standard KTO treats gains and losses symmetrically. Enhanced KTO applies an asymmetric value function where losses are weighted approximately 2x more than equivalent gains, reflecting the psychological phenomenon of loss aversion.

  2. Probability Weighting: Instead of using raw model probabilities, Enhanced KTO applies a non-linear weighting function that overweights small probabilities and underweights large ones, as observed in human decision-making.

  3. Odds Ratio: Borrowing from ORPO, the odds ratio component provides reference-free preference modeling, reducing memory requirements while maintaining alignment quality.

Citation

@misc{smollm2_360m_enhanced_kto_2025,
  title = {SmolLM2-360M Fine-tuned with Enhanced KTO},
  author = {Thesis Research},
  year = {2025},
  publisher = {HuggingFace},
  note = {BCO Shift implemented but disabled for stability},
  url = {https://huggingface.co/Nishef/SmolLM2-360M-Full_ENHANCED_KTO_20251225_074953}
}

Acknowledgments

License

Apache 2.0


This model was created as part of thesis research on LLM alignment using preference optimization methods.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nishef/SmolLM2-360M-Full_ENHANCED_KTO_20251225_074953

Finetuned
(94)
this model

Datasets used to train Nishef/SmolLM2-360M-Full_ENHANCED_KTO_20251225_074953

Papers for Nishef/SmolLM2-360M-Full_ENHANCED_KTO_20251225_074953