Model Card for `mypo-qwen2.5-coder-1.5b-dpo-v3`

Preference-tuned Python coding model that prefers fully type-annotated code by default.

Base: Qwen/Qwen2.5-Coder-1.5B-Instruct
Pipeline: base → SFT adapter (merged) → DPO LoRA (merged) → this model
Training data: joshuasundance/mypo-4k-rfc — chosen = type-hinted Python, rejected = unhinted Python
This repo ships a fully merged standalone model, not a LoRA adapter. Load directly with AutoModelForCausalLM.from_pretrained(...).
Training scripts, raw generations, per-subject analysis, and the comparison report live in joshuasundance/mypo-training.

TL;DR

v3 is the first DPO model in the MyPO project that actually shifts argmax decoding past the base. Two complementary measurements are reported — both published, both reproducible:

metric	base	dpo-v2	SFT	dpo-v3	gold (`chosen`)
`mypy --strict` pass — n=150 batched	6.0%	6.0%	92.7%	92.0%	100%
`mypy --strict` pass — n=30 single-prompt	0.0%	0.0%	73.3%	73.3%	—
annotation slot coverage — n=150 batched	0.000	0.000	0.953	0.963	0.955
annotation slot coverage — n=30 single-prompt	0.000	0.000	0.971	0.976	—
`black` pass — n=150 batched	12.0%	12.0%	97.3%	95.3%	98.0%
preference win-rate vs gold (n=150)	—	0.0%	49.0%	52.7%	—

The large effects are robust: 0 % → 73 % mypy --strict pass and 0.0 → 0.976 annotation slot coverage under real-world single-prompt inference (batch=1, no padding). The earlier batched and single-prompt validations are both retained as in-domain measurements, but we no longer attribute their gap to left-padding or batching as a general causal explanation.

An external benchmark now exists as well: on the latest canonical full HumanEval+ run (n=164), this model reaches 96 / 164 = 58.5 % pass@1 on base tests and 84 / 164 = 51.2 % on plus tests. That still underperforms the Qwen base model (112 / 164 base-test pass, 99 / 164 plus-test pass), so v3 should be understood as an in-domain type-hinting preference model rather than a generally stronger code generation model.

At n=30 single-prompt, SFT and v3 are statistically indistinguishable on the hard metrics; v3's clearer advantage over SFT is the 52.7 % preference win-rate vs gold on the n=150 batched eval (first model to exceed 50 % vs gold). v2 is indistinguishable from base under both decoding conditions — see the v2 card for the failure-mode post-mortem.

What changed vs v2

v2 logged healthy training telemetry (rewards/accuracies → 1.0) but generated text indistinguishable from the base model at greedy decode. The DPO ranking objective can be satisfied by infinitesimal weight deltas when both the LoRA scale and the learning rate are small. v2's effective scale was α/r = 16/256 = 0.0625, and its lr was 1e-6; the product was too small to move argmax decoding.

v3 addresses all proximate causes at once:

Design choice	v2	v3	Rationale
Starting point	Base model	Base + SFT (merged)	DPO optimizes beyond SFT instead of re-deriving type-hint behavior
LoRA α	16	256	Matches r=256 → effective scale α/r = 1.0 (was 0.0625)
Learning rate	1e-6	5e-5	50× higher; calibrated to the matched LoRA scale
DPO β	0.1	0.3	Stronger preference margin target
Epochs	3	2	Higher lr + scale + warm-start → faster convergence
Precision	4-bit (QLoRA)	bf16 full	1.5B bf16 fits on A10G 24 GB; clean `merge_and_unload`
Optimizer	`paged_adamw_8bit`	`adamw_torch`	No bitsandbytes dep in bf16
Published as	PEFT adapter	Fully merged model	v3's DPO LoRA is only valid on top of (base+SFT); shipping a bare adapter would break the obvious load pattern

Quick start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
  dtype=torch.bfloat16,
    device_map="auto",
  attn_implementation="sdpa",
)

messages = [{"role": "user", "content": "Write a function that returns the nth Fibonacci number."}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([rendered], return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False, use_cache=True, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

No PEFT dependency required — v3 is a merged full model.

Real one-prompt demo

The repo joshuasundance/mypo-training includes a runnable comparison script at examples/reproduce_v3.py. We executed that exact script on HF Jobs (69e959a92aa1660eaffa8ca6) with the prompt Write a function that returns the nth Fibonacci number.

Observed outputs:

Base

def fibonacci(n: int) -> int:
  if n == 0 or n == 1:
    return n

  prev = 0
  curr = 1

  for i in range(2, n + 1):
    temp = curr
    curr += prev
    prev = temp

  return curr


# Driver code
n = 9
print(fibonacci(n))

SFT

def fibonacci(n: int) -> int:
  if n == 0 or n == 1:
    return n

  prev = 0
  curr = 1

  for i in range(2, n + 1):
    temp = curr
    curr += prev
    prev = temp

  return curr


# Driver code
n = 9
print(fibonacci(n))

DPO-v2

def fibonacci(n):
  # Base cases: F(0) = 0, F(1) = 1
  if n == 0:
    return 0
  elif n == 1:
    return 1
  else:
    return fibonacci(n-1) + fibonacci(n-2)

...followed by a natural-language explanation block in the same response.

DPO-v3

from typing import Union


def fibonacci(n: int) -> Union[int, float]:
  if n == 0:
    return 0
  elif n == 1:
    return 1
  else:
    return fibonacci(n - 1) + fibonacci(n - 2)

This single prompt is useful as a smoke test, but it is not the main evidence for v3's value because the base model already returns typed code here. The stronger evidence is the 150-prompt characterization table above: across that broader sample, v3 materially improves annotation coverage and is the only model to exceed 50% preference win-rate vs gold.

If you want to reproduce a specific stored row from the published eval artifacts, use examples/reproduce_eval_row.py. We validated this on row 13 from samples.jsonl: replaying the prompt by itself did not match the stored sample, but replaying the original 8-prompt batch window did.

Training

Trained with TRL DPOTrainer on a single NVIDIA A10G via Hugging Face Jobs. Training script: mypo_dpo_train_v3.py. Job id: 69e933522aa1660eaffa8c51.

Hyperparameters (full `DPOConfig`)

Group	Setting
Base model	`Qwen/Qwen2.5-Coder-1.5B-Instruct`
Warm-start	`joshuasundance/mypo-qwen2.5-coder-1.5b-sft` (merged into base before DPO LoRA attached)
Dataset	`joshuasundance/mypo-4k-rfc` (train + validation concatenated → 6,361 pairs)
LoRA	`r=256`, `α=256`, `dropout=0.05`, `target_modules="all-linear"`, `task_type=CAUSAL_LM`
Optimization	`adamw_torch`, `lr=5e-5`, cosine schedule, `warmup_steps=100`
DPO	`β=0.3`, `loss_type="sigmoid"`
Batching	`per_device_train_batch_size=1`, `gradient_accumulation_steps=8` (effective 8)
Schedule	`num_train_epochs=2`, `max_length=2048`
Precision	bf16, gradient checkpointing on, `attn_implementation="sdpa"`
Reporting	`report_to=["codecarbon"]`, `logging_steps=10`
Seed	42

Final training metrics (from job logs)

Metric	Value
`train_runtime`	6,005 s (~1 h 40 m)
`train_loss` (DPO sigmoid)	3.28 × 10⁻³
`rewards/accuracies` (final)	1.000
`rewards/margins` (peak / plateau)	~26
`rewards/chosen` (final)	+6.24
`rewards/rejected` (final)	−15.7
`mean_token_accuracy` (final)	0.910
`grad_norm` (late training)	≲ 1 × 10⁻⁵

Convergence note: rewards/accuracies saturated to 1.0 by epoch ~0.3 and rewards/margins plateaued by epoch ~0.5. The remaining ~1.5 epochs were cosine-decay ride-out with near-zero grads. v4 draft adds EarlyStoppingCallback and a held-out eval split to cut this.

Evaluation

Evaluated on 150 stratified held-out validation prompts from joshuasundance/mypo-4k-rfc. Full report: reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md. Raw generations and per-subject JSON/CSV are also published in the training repo under generations/ and analysis/.

metric	base	dpo-v2	SFT	dpo-v3	gold (`chosen`)	`rejected`
parse rate	0.973	0.973	1.000	1.000	1.000	1.000
`black` pass rate	0.120	0.120	0.973	0.953	0.980	0.060
`ruff` pass rate	0.933	0.940	0.960	0.913	1.000	0.913
`mypy --strict` pass rate	0.060	0.060	0.927	0.920	1.000	0.000
annotation slot coverage	0.000	0.000	0.953	0.963	0.955	0.000
fully-annotated fn fraction	0.000	0.000	0.893	0.903	0.898	0.000
mean `ruff` violations / sample	0.47	0.46	0.07	0.09	0.00	0.11
mean `mypy` errors / sample	2.30	2.35	0.13	0.13	0.00	2.25
preference win-rate vs gold	—	0.000	0.490	0.527	—	—
preference win-rate vs base	—	0.500	1.000	1.000	—	—
preference win-rate vs `rejected`	—	0.500	1.000	1.000	—	—

Interpretation (batched n=150):

v3 matches SFT on every quality gate (within noise).
v3 has the highest annotation slot coverage of any model, including gold (0.963 vs gold 0.955 vs SFT 0.953). Judgment call whether this is "more thorough" or "slight over-annotation."
v3 is the only subject to exceed 50% win-rate vs gold (52.7%) on this eval — measurable DPO-level gain on top of SFT at this sample size.
ruff regression (0.913 vs SFT 0.960) is small but real; likely a handful of idiomatic style issues introduced by more aggressive annotation.

Single-prompt validation (n=30)

A follow-up job re-decoded 30 stratified validation prompts with batch_size=1 and no padding — i.e., the realistic one-user inference condition — across all four subjects. This directly tests whether the batched characterization numbers reflect real-world behavior or batching/left-padding artifacts. Full artifacts: single-prompt-validation/single-prompt-2026-04-23T002137Z/.

metric	base	dpo-v2	SFT	dpo-v3
parse rate	0.933	0.967	1.000	1.000
`black` pass rate	0.067	0.067	1.000	0.967
`ruff` pass rate	0.900	0.967	0.933	0.800
`mypy --strict` pass rate	0.000	0.000	0.733	0.733
annotation slot coverage	0.000	0.000	0.971	0.976
mean `mypy` errors / sample	2.33	2.40	0.30	0.30

What this tells us:

The core claim holds under real-world inference. 0 % → 73 % mypy --strict is not a batching artifact.
The batched n=150 and single-prompt n=30 validations should be treated as two different measurement regimes. We no longer claim that the gap is specifically caused by left-padding or batching as a general explanation.
v2's no-op is confirmed under both decoding modes. Rules out "v2 adapter not loading" as an alternative explanation.
SFT and v3 are indistinguishable at n=30 single-prompt (both 0.733 mypy, both ≈ 0.97 annotation coverage). At this sample size we cannot claim v3 is hard-metric better than SFT; the case for v3 over SFT rests on the 52.7 % preference win-rate vs gold in the batched eval.
v3's ruff regression is larger in single-prompt mode (0.800 vs SFT 0.933). Consistent with v3 trading some style-conformance for stronger annotation behavior.

HumanEval+ external benchmark (n=164)

We also ran a full evalplus HumanEval+ benchmark. That is the stronger out-of-domain coding benchmark, and it does not show a general gain for v3:

subject	pass@1 base tests	pass@1 plus tests
`base`	112 / 164 (68.3%)	99 / 164 (60.4%)
`dpo-v2`	110 / 164 (67.1%)	97 / 164 (59.1%)
`sft`	97 / 164 (59.1%)	86 / 164 (52.4%)
`dpo-v3`	96 / 164 (58.5%)	84 / 164 (51.2%)

So the honest reading is: v3 changes the model's in-domain type-hinting behavior, but it is not a generally stronger HumanEval+ solver than the base model.

Environmental impact

Reported with CodeCarbon v3.2.6. Raw data: emissions.csv.

Training (this model)

Metric	Value
Duration	6,005.4 s (1 h 40 m)
Energy consumed	0.363 kWh
CO₂e emissions	0.134 kg
GPU energy / avg power	0.242 kWh / 144.9 W
CPU energy / avg power	0.034 kWh / 21.4 W
RAM energy / avg power	0.087 kWh / 54.0 W
Hardware	1 × NVIDIA A10G, AMD EPYC 7R32 (48 vCPU), 187 GB RAM
Region	AWS `us-east-1` (Virginia, USA); PUE 1.0
Tracker	codecarbon 3.2.6, `tracking_mode=machine`

Cumulative project footprint

Because v3 builds on SFT warm-start + v2 was a training run too, the full energy cost of this model's lineage is:

Stage	Duration	Energy	CO₂e
SFT training	8,340 s	0.472 kWh	0.174 kg
v2 DPO training (failed)	10,938 s	0.646 kWh	0.238 kg
v3 DPO training (this)	6,005 s	0.363 kWh	0.134 kg
v3 characterization (generate × 4 models)	937 s	0.052 kWh	0.019 kg
6 analysis jobs (cpu-upgrade)	~3 min each, parallel	~0.01 kWh	~0.004 kg
Cumulative (SFT + v2 + v3 + eval)	~7.3 h	~1.55 kWh	~0.57 kg

Approximate compute cost

HF Jobs wall-clock billed at published HF Jobs rates. Rates shown are approximate.

Stage	Flavor	Wall-clock	Approx cost
SFT training	a10g-large	2.32 h	~$3.50
v2 DPO training	a10g-large	3.04 h	~$4.60
v3 DPO training	a10g-large	1.67 h	~$2.50
v3 characterization generate	a10g-large	0.26 h	~$0.40
6 analysis jobs	cpu-upgrade × 6 parallel	~3 min each	<$0.05
Rollup report	cpu-basic	<1 min	~$0
Cumulative project cost			~$11

Limitations and biases

Narrow objective: optimized only for Python type-hint preference. Docstring style, line length, complexity, security idioms, etc. were not objectives.
Possible over-annotation: rewards/rejected fell to ~−20 during training, meaning the model strongly suppresses unhinted outputs. In principle this could cause annotations where Python idiom doesn't require them (trivial lambdas, short list comprehensions). v3's annotation coverage slightly exceeding gold's is mild evidence of this; watch for it in your downstream use.
No eval split during training: v3 trained on the full 6,361-pair pool with no held-out metric for best-checkpoint selection. v4 draft adds a 2% eval split and load_best_model_at_end.
bf16 weights only: merged safetensors are bf16. Fine for A10G/A100/H100; float16 consumers should cast.
Small base model: 1.5B parameters. For larger code tasks, consider applying the same recipe to Qwen2.5-Coder-7B or similar.
English + code only: training data is English prompts, English/Python responses.

Reproducibility

Everything needed to reproduce this model is on the Hub:

Artifact	Location
Training script	`mypo-training/mypo_dpo_train_v3.py`
Training data	`joshuasundance/mypo-4k-rfc`
SFT warm-start	`joshuasundance/mypo-qwen2.5-coder-1.5b-sft`
Training energy log	`emissions.csv` (this repo)
Evaluation pipeline	`mypo-training/eval/` (generate / analyze / report scripts)
Raw generations	`mypo-training/generations/2026-04-22-qwen2.5-1.5b-v3/`
Per-subject analysis	`mypo-training/analysis/2026-04-22-qwen2.5-1.5b-v3/`
Characterization report	`mypo-training/reports/2026-04-22-qwen2.5-1.5b-v3/`
Single-prompt validation (n=30)	`mypo-training/single-prompt-validation/single-prompt-2026-04-23T002137Z/`

To re-train from scratch:

hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
  https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py

Framework versions

Python 3.12
PyTorch 2.4+, Transformers 4.45+, TRL 0.15+, PEFT 0.12+, Datasets 3.0+, Accelerate 0.34+
CodeCarbon 3.2.6

License

Apache 2.0 (inherits from the Qwen2.5-Coder-1.5B-Instruct base model).

Citations

This model

@software{mypo_dpo_v3_2026,
  title   = {{MyPO DPO v3: Qwen2.5-Coder-1.5B Type-Hint Preference Optimization}},
  author  = {Bailey, Joshua Sundance},
  year    = 2026,
  url     = {https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3}
}

CodeCarbon (emissions tracking)

@software{codecarbon,
  author  = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
  title   = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
  year    = 2024,
  doi     = {10.5281/zenodo.11171501},
  url     = {https://github.com/mlco2/codecarbon}
}

DPO

@inproceedings{rafailov2023direct,
  title     = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
  author    = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
  booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
  year      = 2023
}

TRL

@software{vonwerra2020trl,
  title   = {{TRL: Transformer Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = 2020
}

LoRA

@inproceedings{hu2022lora,
  title     = {{LoRA: Low-Rank Adaptation of Large Language Models}},
  author    = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  booktitle = {International Conference on Learning Representations},
  year      = 2022
}

Qwen2.5-Coder (base model)

@article{hui2024qwen25coder,
  title   = {{Qwen2.5-Coder Technical Report}},
  author  = {Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
  journal = {arXiv preprint arXiv:2409.12186},
  year    = 2024
}

Downloads last month: 1,064

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B-Instruct

Finetuned

(150)

this model

Dataset used to train joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Collection including joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

MyPO Project

Collection

MyPO models, datasets, training assets, and live dashboard for typed Python preference tuning. • 10 items • Updated 3 days ago

Paper for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 154

Evaluation results

parse rate on mypo-4k-rfc
validation set self-reported

1.000
black pass rate on mypo-4k-rfc
validation set self-reported

0.953
ruff pass rate on mypo-4k-rfc
validation set self-reported

0.913
mypy --strict pass rate on mypo-4k-rfc
validation set self-reported

0.920
annotation slot coverage on mypo-4k-rfc
validation set self-reported

0.963
preference win-rate vs gold (chosen) on mypo-4k-rfc
validation set self-reported

0.527
pass@1 (base tests) on HumanEval+
test set self-reported

0.585
pass@1 (plus tests) on HumanEval+
test set self-reported

0.512

Model Card for mypo-qwen2.5-coder-1.5b-dpo-v3

TL;DR

What changed vs v2

Quick start

Real one-prompt demo

Training

Hyperparameters (full DPOConfig)

Final training metrics (from job logs)

Evaluation

Single-prompt validation (n=30)

HumanEval+ external benchmark (n=164)

Environmental impact

Training (this model)

Cumulative project footprint

Approximate compute cost

Limitations and biases

Reproducibility

Framework versions

License

Citations

This model

CodeCarbon (emissions tracking)

DPO

TRL

LoRA

Qwen2.5-Coder (base model)

Model tree for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Dataset used to train joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Collection including joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Paper for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

Evaluation results

Model Card for `mypo-qwen2.5-coder-1.5b-dpo-v3`

Hyperparameters (full `DPOConfig`)