Model Card for mypo-qwen2.5-coder-1.5b-dpo-v3

Preference-tuned Python coding model that prefers fully type-annotated code by default.


TL;DR

v3 is the first DPO model in the MyPO project that actually shifts argmax decoding past the base. Two complementary measurements are reported — both published, both reproducible:

metric base dpo-v2 SFT dpo-v3 gold (chosen)
mypy --strict pass — n=150 batched 6.0% 6.0% 92.7% 92.0% 100%
mypy --strict pass — n=30 single-prompt 0.0% 0.0% 73.3% 73.3%
annotation slot coverage — n=150 batched 0.000 0.000 0.953 0.963 0.955
annotation slot coverage — n=30 single-prompt 0.000 0.000 0.971 0.976
black pass — n=150 batched 12.0% 12.0% 97.3% 95.3% 98.0%
preference win-rate vs gold (n=150) 0.0% 49.0% 52.7%

The large effects are robust: 0 % → 73 % mypy --strict pass and 0.0 → 0.976 annotation slot coverage under real-world single-prompt inference (batch=1, no padding). The earlier batched and single-prompt validations are both retained as in-domain measurements, but we no longer attribute their gap to left-padding or batching as a general causal explanation.

An external benchmark now exists as well: on the latest canonical full HumanEval+ run (n=164), this model reaches 96 / 164 = 58.5 % pass@1 on base tests and 84 / 164 = 51.2 % on plus tests. That still underperforms the Qwen base model (112 / 164 base-test pass, 99 / 164 plus-test pass), so v3 should be understood as an in-domain type-hinting preference model rather than a generally stronger code generation model.

At n=30 single-prompt, SFT and v3 are statistically indistinguishable on the hard metrics; v3's clearer advantage over SFT is the 52.7 % preference win-rate vs gold on the n=150 batched eval (first model to exceed 50 % vs gold). v2 is indistinguishable from base under both decoding conditions — see the v2 card for the failure-mode post-mortem.


What changed vs v2

v2 logged healthy training telemetry (rewards/accuracies → 1.0) but generated text indistinguishable from the base model at greedy decode. The DPO ranking objective can be satisfied by infinitesimal weight deltas when both the LoRA scale and the learning rate are small. v2's effective scale was α/r = 16/256 = 0.0625, and its lr was 1e-6; the product was too small to move argmax decoding.

v3 addresses all proximate causes at once:

Design choice v2 v3 Rationale
Starting point Base model Base + SFT (merged) DPO optimizes beyond SFT instead of re-deriving type-hint behavior
LoRA α 16 256 Matches r=256 → effective scale α/r = 1.0 (was 0.0625)
Learning rate 1e-6 5e-5 50× higher; calibrated to the matched LoRA scale
DPO β 0.1 0.3 Stronger preference margin target
Epochs 3 2 Higher lr + scale + warm-start → faster convergence
Precision 4-bit (QLoRA) bf16 full 1.5B bf16 fits on A10G 24 GB; clean merge_and_unload
Optimizer paged_adamw_8bit adamw_torch No bitsandbytes dep in bf16
Published as PEFT adapter Fully merged model v3's DPO LoRA is only valid on top of (base+SFT); shipping a bare adapter would break the obvious load pattern

Quick start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
  dtype=torch.bfloat16,
    device_map="auto",
  attn_implementation="sdpa",
)

messages = [{"role": "user", "content": "Write a function that returns the nth Fibonacci number."}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([rendered], return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False, use_cache=True, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

No PEFT dependency required — v3 is a merged full model.

Real one-prompt demo

The repo joshuasundance/mypo-training includes a runnable comparison script at examples/reproduce_v3.py. We executed that exact script on HF Jobs (69e959a92aa1660eaffa8ca6) with the prompt Write a function that returns the nth Fibonacci number.

Observed outputs:

Base

def fibonacci(n: int) -> int:
  if n == 0 or n == 1:
    return n

  prev = 0
  curr = 1

  for i in range(2, n + 1):
    temp = curr
    curr += prev
    prev = temp

  return curr


# Driver code
n = 9
print(fibonacci(n))

SFT

def fibonacci(n: int) -> int:
  if n == 0 or n == 1:
    return n

  prev = 0
  curr = 1

  for i in range(2, n + 1):
    temp = curr
    curr += prev
    prev = temp

  return curr


# Driver code
n = 9
print(fibonacci(n))

DPO-v2

def fibonacci(n):
  # Base cases: F(0) = 0, F(1) = 1
  if n == 0:
    return 0
  elif n == 1:
    return 1
  else:
    return fibonacci(n-1) + fibonacci(n-2)
...followed by a natural-language explanation block in the same response.

DPO-v3

from typing import Union


def fibonacci(n: int) -> Union[int, float]:
  if n == 0:
    return 0
  elif n == 1:
    return 1
  else:
    return fibonacci(n - 1) + fibonacci(n - 2)

This single prompt is useful as a smoke test, but it is not the main evidence for v3's value because the base model already returns typed code here. The stronger evidence is the 150-prompt characterization table above: across that broader sample, v3 materially improves annotation coverage and is the only model to exceed 50% preference win-rate vs gold.

If you want to reproduce a specific stored row from the published eval artifacts, use examples/reproduce_eval_row.py. We validated this on row 13 from samples.jsonl: replaying the prompt by itself did not match the stored sample, but replaying the original 8-prompt batch window did.


Training

Trained with TRL DPOTrainer on a single NVIDIA A10G via Hugging Face Jobs. Training script: mypo_dpo_train_v3.py. Job id: 69e933522aa1660eaffa8c51.

Hyperparameters (full DPOConfig)

Group Setting
Base model Qwen/Qwen2.5-Coder-1.5B-Instruct
Warm-start joshuasundance/mypo-qwen2.5-coder-1.5b-sft (merged into base before DPO LoRA attached)
Dataset joshuasundance/mypo-4k-rfc (train + validation concatenated → 6,361 pairs)
LoRA r=256, α=256, dropout=0.05, target_modules="all-linear", task_type=CAUSAL_LM
Optimization adamw_torch, lr=5e-5, cosine schedule, warmup_steps=100
DPO β=0.3, loss_type="sigmoid"
Batching per_device_train_batch_size=1, gradient_accumulation_steps=8 (effective 8)
Schedule num_train_epochs=2, max_length=2048
Precision bf16, gradient checkpointing on, attn_implementation="sdpa"
Reporting report_to=["codecarbon"], logging_steps=10
Seed 42

Final training metrics (from job logs)

Metric Value
train_runtime 6,005 s (~1 h 40 m)
train_loss (DPO sigmoid) 3.28 × 10⁻³
rewards/accuracies (final) 1.000
rewards/margins (peak / plateau) ~26
rewards/chosen (final) +6.24
rewards/rejected (final) −15.7
mean_token_accuracy (final) 0.910
grad_norm (late training) ≲ 1 × 10⁻⁵

Convergence note: rewards/accuracies saturated to 1.0 by epoch ~0.3 and rewards/margins plateaued by epoch ~0.5. The remaining ~1.5 epochs were cosine-decay ride-out with near-zero grads. v4 draft adds EarlyStoppingCallback and a held-out eval split to cut this.


Evaluation

Evaluated on 150 stratified held-out validation prompts from joshuasundance/mypo-4k-rfc. Full report: reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md. Raw generations and per-subject JSON/CSV are also published in the training repo under generations/ and analysis/.

metric base dpo-v2 SFT dpo-v3 gold (chosen) rejected
parse rate 0.973 0.973 1.000 1.000 1.000 1.000
black pass rate 0.120 0.120 0.973 0.953 0.980 0.060
ruff pass rate 0.933 0.940 0.960 0.913 1.000 0.913
mypy --strict pass rate 0.060 0.060 0.927 0.920 1.000 0.000
annotation slot coverage 0.000 0.000 0.953 0.963 0.955 0.000
fully-annotated fn fraction 0.000 0.000 0.893 0.903 0.898 0.000
mean ruff violations / sample 0.47 0.46 0.07 0.09 0.00 0.11
mean mypy errors / sample 2.30 2.35 0.13 0.13 0.00 2.25
preference win-rate vs gold 0.000 0.490 0.527
preference win-rate vs base 0.500 1.000 1.000
preference win-rate vs rejected 0.500 1.000 1.000

Interpretation (batched n=150):

  • v3 matches SFT on every quality gate (within noise).
  • v3 has the highest annotation slot coverage of any model, including gold (0.963 vs gold 0.955 vs SFT 0.953). Judgment call whether this is "more thorough" or "slight over-annotation."
  • v3 is the only subject to exceed 50% win-rate vs gold (52.7%) on this eval — measurable DPO-level gain on top of SFT at this sample size.
  • ruff regression (0.913 vs SFT 0.960) is small but real; likely a handful of idiomatic style issues introduced by more aggressive annotation.

Single-prompt validation (n=30)

A follow-up job re-decoded 30 stratified validation prompts with batch_size=1 and no padding — i.e., the realistic one-user inference condition — across all four subjects. This directly tests whether the batched characterization numbers reflect real-world behavior or batching/left-padding artifacts. Full artifacts: single-prompt-validation/single-prompt-2026-04-23T002137Z/.

metric base dpo-v2 SFT dpo-v3
parse rate 0.933 0.967 1.000 1.000
black pass rate 0.067 0.067 1.000 0.967
ruff pass rate 0.900 0.967 0.933 0.800
mypy --strict pass rate 0.000 0.000 0.733 0.733
annotation slot coverage 0.000 0.000 0.971 0.976
mean mypy errors / sample 2.33 2.40 0.30 0.30

What this tells us:

  • The core claim holds under real-world inference. 0 % → 73 % mypy --strict is not a batching artifact.
  • The batched n=150 and single-prompt n=30 validations should be treated as two different measurement regimes. We no longer claim that the gap is specifically caused by left-padding or batching as a general explanation.
  • v2's no-op is confirmed under both decoding modes. Rules out "v2 adapter not loading" as an alternative explanation.
  • SFT and v3 are indistinguishable at n=30 single-prompt (both 0.733 mypy, both ≈ 0.97 annotation coverage). At this sample size we cannot claim v3 is hard-metric better than SFT; the case for v3 over SFT rests on the 52.7 % preference win-rate vs gold in the batched eval.
  • v3's ruff regression is larger in single-prompt mode (0.800 vs SFT 0.933). Consistent with v3 trading some style-conformance for stronger annotation behavior.

HumanEval+ external benchmark (n=164)

We also ran a full evalplus HumanEval+ benchmark. That is the stronger out-of-domain coding benchmark, and it does not show a general gain for v3:

subject pass@1 base tests pass@1 plus tests
base 112 / 164 (68.3%) 99 / 164 (60.4%)
dpo-v2 110 / 164 (67.1%) 97 / 164 (59.1%)
sft 97 / 164 (59.1%) 86 / 164 (52.4%)
dpo-v3 96 / 164 (58.5%) 84 / 164 (51.2%)

So the honest reading is: v3 changes the model's in-domain type-hinting behavior, but it is not a generally stronger HumanEval+ solver than the base model.


Environmental impact

Reported with CodeCarbon v3.2.6. Raw data: emissions.csv.

Training (this model)

Metric Value
Duration 6,005.4 s (1 h 40 m)
Energy consumed 0.363 kWh
CO₂e emissions 0.134 kg
GPU energy / avg power 0.242 kWh / 144.9 W
CPU energy / avg power 0.034 kWh / 21.4 W
RAM energy / avg power 0.087 kWh / 54.0 W
Hardware 1 × NVIDIA A10G, AMD EPYC 7R32 (48 vCPU), 187 GB RAM
Region AWS us-east-1 (Virginia, USA); PUE 1.0
Tracker codecarbon 3.2.6, tracking_mode=machine

Cumulative project footprint

Because v3 builds on SFT warm-start + v2 was a training run too, the full energy cost of this model's lineage is:

Stage Duration Energy CO₂e
SFT training 8,340 s 0.472 kWh 0.174 kg
v2 DPO training (failed) 10,938 s 0.646 kWh 0.238 kg
v3 DPO training (this) 6,005 s 0.363 kWh 0.134 kg
v3 characterization (generate × 4 models) 937 s 0.052 kWh 0.019 kg
6 analysis jobs (cpu-upgrade) ~3 min each, parallel ~0.01 kWh ~0.004 kg
Cumulative (SFT + v2 + v3 + eval) ~7.3 h ~1.55 kWh ~0.57 kg

Approximate compute cost

HF Jobs wall-clock billed at published HF Jobs rates. Rates shown are approximate.

Stage Flavor Wall-clock Approx cost
SFT training a10g-large 2.32 h ~$3.50
v2 DPO training a10g-large 3.04 h ~$4.60
v3 DPO training a10g-large 1.67 h ~$2.50
v3 characterization generate a10g-large 0.26 h ~$0.40
6 analysis jobs cpu-upgrade × 6 parallel ~3 min each <$0.05
Rollup report cpu-basic <1 min ~$0
Cumulative project cost ~$11

Limitations and biases

  • Narrow objective: optimized only for Python type-hint preference. Docstring style, line length, complexity, security idioms, etc. were not objectives.
  • Possible over-annotation: rewards/rejected fell to ~−20 during training, meaning the model strongly suppresses unhinted outputs. In principle this could cause annotations where Python idiom doesn't require them (trivial lambdas, short list comprehensions). v3's annotation coverage slightly exceeding gold's is mild evidence of this; watch for it in your downstream use.
  • No eval split during training: v3 trained on the full 6,361-pair pool with no held-out metric for best-checkpoint selection. v4 draft adds a 2% eval split and load_best_model_at_end.
  • bf16 weights only: merged safetensors are bf16. Fine for A10G/A100/H100; float16 consumers should cast.
  • Small base model: 1.5B parameters. For larger code tasks, consider applying the same recipe to Qwen2.5-Coder-7B or similar.
  • English + code only: training data is English prompts, English/Python responses.

Reproducibility

Everything needed to reproduce this model is on the Hub:

Artifact Location
Training script mypo-training/mypo_dpo_train_v3.py
Training data joshuasundance/mypo-4k-rfc
SFT warm-start joshuasundance/mypo-qwen2.5-coder-1.5b-sft
Training energy log emissions.csv (this repo)
Evaluation pipeline mypo-training/eval/ (generate / analyze / report scripts)
Raw generations mypo-training/generations/2026-04-22-qwen2.5-1.5b-v3/
Per-subject analysis mypo-training/analysis/2026-04-22-qwen2.5-1.5b-v3/
Characterization report mypo-training/reports/2026-04-22-qwen2.5-1.5b-v3/
Single-prompt validation (n=30) mypo-training/single-prompt-validation/single-prompt-2026-04-23T002137Z/

To re-train from scratch:

hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
  https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py

Framework versions

  • Python 3.12
  • PyTorch 2.4+, Transformers 4.45+, TRL 0.15+, PEFT 0.12+, Datasets 3.0+, Accelerate 0.34+
  • CodeCarbon 3.2.6

License

Apache 2.0 (inherits from the Qwen2.5-Coder-1.5B-Instruct base model).


Citations

This model

@software{mypo_dpo_v3_2026,
  title   = {{MyPO DPO v3: Qwen2.5-Coder-1.5B Type-Hint Preference Optimization}},
  author  = {Bailey, Joshua Sundance},
  year    = 2026,
  url     = {https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3}
}

CodeCarbon (emissions tracking)

@software{codecarbon,
  author  = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
  title   = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
  year    = 2024,
  doi     = {10.5281/zenodo.11171501},
  url     = {https://github.com/mlco2/codecarbon}
}

DPO

@inproceedings{rafailov2023direct,
  title     = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
  author    = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
  booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
  year      = 2023
}

TRL

@software{vonwerra2020trl,
  title   = {{TRL: Transformer Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = 2020
}

LoRA

@inproceedings{hu2022lora,
  title     = {{LoRA: Low-Rank Adaptation of Large Language Models}},
  author    = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  booktitle = {International Conference on Learning Representations},
  year      = 2022
}

Qwen2.5-Coder (base model)

@article{hui2024qwen25coder,
  title   = {{Qwen2.5-Coder Technical Report}},
  author  = {Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
  journal = {arXiv preprint arXiv:2409.12186},
  year    = 2024
}
Downloads last month
1,064
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Finetuned
(150)
this model

Dataset used to train joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Collection including joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Paper for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Evaluation results