Model Card for mypo-qwen2.5-coder-1.5b-dpo-v3
Preference-tuned Python coding model that prefers fully type-annotated code by default.
- Base:
Qwen/Qwen2.5-Coder-1.5B-Instruct - Pipeline: base → SFT adapter (merged) → DPO LoRA (merged) → this model
- Training data:
joshuasundance/mypo-4k-rfc—chosen= type-hinted Python,rejected= unhinted Python - This repo ships a fully merged standalone model, not a LoRA adapter. Load directly with
AutoModelForCausalLM.from_pretrained(...). - Training scripts, raw generations, per-subject analysis, and the comparison report live in
joshuasundance/mypo-training.
TL;DR
v3 is the first DPO model in the MyPO project that actually shifts argmax decoding past the base. Two complementary measurements are reported — both published, both reproducible:
| metric | base | dpo-v2 | SFT | dpo-v3 | gold (chosen) |
|---|---|---|---|---|---|
mypy --strict pass — n=150 batched |
6.0% | 6.0% | 92.7% | 92.0% | 100% |
mypy --strict pass — n=30 single-prompt |
0.0% | 0.0% | 73.3% | 73.3% | — |
| annotation slot coverage — n=150 batched | 0.000 | 0.000 | 0.953 | 0.963 | 0.955 |
| annotation slot coverage — n=30 single-prompt | 0.000 | 0.000 | 0.971 | 0.976 | — |
black pass — n=150 batched |
12.0% | 12.0% | 97.3% | 95.3% | 98.0% |
| preference win-rate vs gold (n=150) | — | 0.0% | 49.0% | 52.7% | — |
The large effects are robust: 0 % → 73 % mypy --strict pass and 0.0 → 0.976 annotation slot coverage under real-world single-prompt inference (batch=1, no padding). The earlier batched and single-prompt validations are both retained as in-domain measurements, but we no longer attribute their gap to left-padding or batching as a general causal explanation.
An external benchmark now exists as well: on the latest canonical full
HumanEval+ run (n=164), this model reaches 96 / 164 = 58.5 % pass@1 on
base tests and 84 / 164 = 51.2 % on plus tests. That still underperforms
the Qwen base model (112 / 164 base-test pass, 99 / 164 plus-test pass),
so v3 should be understood as an in-domain type-hinting preference model
rather than a generally stronger code generation model.
At n=30 single-prompt, SFT and v3 are statistically indistinguishable on the hard metrics; v3's clearer advantage over SFT is the 52.7 % preference win-rate vs gold on the n=150 batched eval (first model to exceed 50 % vs gold). v2 is indistinguishable from base under both decoding conditions — see the v2 card for the failure-mode post-mortem.
What changed vs v2
v2 logged healthy training telemetry (rewards/accuracies → 1.0) but generated text indistinguishable from the base model at greedy decode. The DPO ranking objective can be satisfied by infinitesimal weight deltas when both the LoRA scale and the learning rate are small. v2's effective scale was α/r = 16/256 = 0.0625, and its lr was 1e-6; the product was too small to move argmax decoding.
v3 addresses all proximate causes at once:
| Design choice | v2 | v3 | Rationale |
|---|---|---|---|
| Starting point | Base model | Base + SFT (merged) | DPO optimizes beyond SFT instead of re-deriving type-hint behavior |
| LoRA α | 16 | 256 | Matches r=256 → effective scale α/r = 1.0 (was 0.0625) |
| Learning rate | 1e-6 | 5e-5 | 50× higher; calibrated to the matched LoRA scale |
| DPO β | 0.1 | 0.3 | Stronger preference margin target |
| Epochs | 3 | 2 | Higher lr + scale + warm-start → faster convergence |
| Precision | 4-bit (QLoRA) | bf16 full | 1.5B bf16 fits on A10G 24 GB; clean merge_and_unload |
| Optimizer | paged_adamw_8bit |
adamw_torch |
No bitsandbytes dep in bf16 |
| Published as | PEFT adapter | Fully merged model | v3's DPO LoRA is only valid on top of (base+SFT); shipping a bare adapter would break the obvious load pattern |
Quick start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
)
messages = [{"role": "user", "content": "Write a function that returns the nth Fibonacci number."}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([rendered], return_tensors="pt", padding=True, truncation=True, max_length=2048).to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False, use_cache=True, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
No PEFT dependency required — v3 is a merged full model.
Real one-prompt demo
The repo joshuasundance/mypo-training includes a runnable comparison script at examples/reproduce_v3.py. We executed that exact script on HF Jobs (69e959a92aa1660eaffa8ca6) with the prompt Write a function that returns the nth Fibonacci number.
Observed outputs:
Base
def fibonacci(n: int) -> int:
if n == 0 or n == 1:
return n
prev = 0
curr = 1
for i in range(2, n + 1):
temp = curr
curr += prev
prev = temp
return curr
# Driver code
n = 9
print(fibonacci(n))
SFT
def fibonacci(n: int) -> int:
if n == 0 or n == 1:
return n
prev = 0
curr = 1
for i in range(2, n + 1):
temp = curr
curr += prev
prev = temp
return curr
# Driver code
n = 9
print(fibonacci(n))
DPO-v2
def fibonacci(n):
# Base cases: F(0) = 0, F(1) = 1
if n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)
...followed by a natural-language explanation block in the same response.
DPO-v3
from typing import Union
def fibonacci(n: int) -> Union[int, float]:
if n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n - 1) + fibonacci(n - 2)
This single prompt is useful as a smoke test, but it is not the main evidence for v3's value because the base model already returns typed code here. The stronger evidence is the 150-prompt characterization table above: across that broader sample, v3 materially improves annotation coverage and is the only model to exceed 50% preference win-rate vs gold.
If you want to reproduce a specific stored row from the published eval artifacts, use examples/reproduce_eval_row.py. We validated this on row 13 from samples.jsonl: replaying the prompt by itself did not match the stored sample, but replaying the original 8-prompt batch window did.
Training
Trained with TRL DPOTrainer on a single NVIDIA A10G via Hugging Face Jobs. Training script: mypo_dpo_train_v3.py. Job id: 69e933522aa1660eaffa8c51.
Hyperparameters (full DPOConfig)
| Group | Setting |
|---|---|
| Base model | Qwen/Qwen2.5-Coder-1.5B-Instruct |
| Warm-start | joshuasundance/mypo-qwen2.5-coder-1.5b-sft (merged into base before DPO LoRA attached) |
| Dataset | joshuasundance/mypo-4k-rfc (train + validation concatenated → 6,361 pairs) |
| LoRA | r=256, α=256, dropout=0.05, target_modules="all-linear", task_type=CAUSAL_LM |
| Optimization | adamw_torch, lr=5e-5, cosine schedule, warmup_steps=100 |
| DPO | β=0.3, loss_type="sigmoid" |
| Batching | per_device_train_batch_size=1, gradient_accumulation_steps=8 (effective 8) |
| Schedule | num_train_epochs=2, max_length=2048 |
| Precision | bf16, gradient checkpointing on, attn_implementation="sdpa" |
| Reporting | report_to=["codecarbon"], logging_steps=10 |
| Seed | 42 |
Final training metrics (from job logs)
| Metric | Value |
|---|---|
train_runtime |
6,005 s (~1 h 40 m) |
train_loss (DPO sigmoid) |
3.28 × 10⁻³ |
rewards/accuracies (final) |
1.000 |
rewards/margins (peak / plateau) |
~26 |
rewards/chosen (final) |
+6.24 |
rewards/rejected (final) |
−15.7 |
mean_token_accuracy (final) |
0.910 |
grad_norm (late training) |
≲ 1 × 10⁻⁵ |
Convergence note: rewards/accuracies saturated to 1.0 by epoch ~0.3 and rewards/margins plateaued by epoch ~0.5. The remaining ~1.5 epochs were cosine-decay ride-out with near-zero grads. v4 draft adds EarlyStoppingCallback and a held-out eval split to cut this.
Evaluation
Evaluated on 150 stratified held-out validation prompts from joshuasundance/mypo-4k-rfc. Full report: reports/2026-04-22-qwen2.5-1.5b-v3/CHARACTERIZATION.md. Raw generations and per-subject JSON/CSV are also published in the training repo under generations/ and analysis/.
| metric | base | dpo-v2 | SFT | dpo-v3 | gold (chosen) |
rejected |
|---|---|---|---|---|---|---|
| parse rate | 0.973 | 0.973 | 1.000 | 1.000 | 1.000 | 1.000 |
black pass rate |
0.120 | 0.120 | 0.973 | 0.953 | 0.980 | 0.060 |
ruff pass rate |
0.933 | 0.940 | 0.960 | 0.913 | 1.000 | 0.913 |
mypy --strict pass rate |
0.060 | 0.060 | 0.927 | 0.920 | 1.000 | 0.000 |
| annotation slot coverage | 0.000 | 0.000 | 0.953 | 0.963 | 0.955 | 0.000 |
| fully-annotated fn fraction | 0.000 | 0.000 | 0.893 | 0.903 | 0.898 | 0.000 |
mean ruff violations / sample |
0.47 | 0.46 | 0.07 | 0.09 | 0.00 | 0.11 |
mean mypy errors / sample |
2.30 | 2.35 | 0.13 | 0.13 | 0.00 | 2.25 |
| preference win-rate vs gold | — | 0.000 | 0.490 | 0.527 | — | — |
| preference win-rate vs base | — | 0.500 | 1.000 | 1.000 | — | — |
preference win-rate vs rejected |
— | 0.500 | 1.000 | 1.000 | — | — |
Interpretation (batched n=150):
- v3 matches SFT on every quality gate (within noise).
- v3 has the highest annotation slot coverage of any model, including gold (0.963 vs gold 0.955 vs SFT 0.953). Judgment call whether this is "more thorough" or "slight over-annotation."
- v3 is the only subject to exceed 50% win-rate vs gold (52.7%) on this eval — measurable DPO-level gain on top of SFT at this sample size.
- ruff regression (0.913 vs SFT 0.960) is small but real; likely a handful of idiomatic style issues introduced by more aggressive annotation.
Single-prompt validation (n=30)
A follow-up job re-decoded 30 stratified validation prompts with batch_size=1 and no padding — i.e., the realistic one-user inference condition — across all four subjects. This directly tests whether the batched characterization numbers reflect real-world behavior or batching/left-padding artifacts. Full artifacts: single-prompt-validation/single-prompt-2026-04-23T002137Z/.
| metric | base | dpo-v2 | SFT | dpo-v3 |
|---|---|---|---|---|
| parse rate | 0.933 | 0.967 | 1.000 | 1.000 |
black pass rate |
0.067 | 0.067 | 1.000 | 0.967 |
ruff pass rate |
0.900 | 0.967 | 0.933 | 0.800 |
mypy --strict pass rate |
0.000 | 0.000 | 0.733 | 0.733 |
| annotation slot coverage | 0.000 | 0.000 | 0.971 | 0.976 |
mean mypy errors / sample |
2.33 | 2.40 | 0.30 | 0.30 |
What this tells us:
- The core claim holds under real-world inference. 0 % → 73 %
mypy --strictis not a batching artifact. - The batched n=150 and single-prompt n=30 validations should be treated as two different measurement regimes. We no longer claim that the gap is specifically caused by left-padding or batching as a general explanation.
- v2's no-op is confirmed under both decoding modes. Rules out "v2 adapter not loading" as an alternative explanation.
- SFT and v3 are indistinguishable at n=30 single-prompt (both 0.733
mypy, both ≈ 0.97 annotation coverage). At this sample size we cannot claim v3 is hard-metric better than SFT; the case for v3 over SFT rests on the 52.7 % preference win-rate vs gold in the batched eval. - v3's ruff regression is larger in single-prompt mode (0.800 vs SFT 0.933). Consistent with v3 trading some style-conformance for stronger annotation behavior.
HumanEval+ external benchmark (n=164)
We also ran a full evalplus HumanEval+ benchmark. That is the stronger out-of-domain coding benchmark, and it does not show a general gain for v3:
| subject | pass@1 base tests | pass@1 plus tests |
|---|---|---|
base |
112 / 164 (68.3%) | 99 / 164 (60.4%) |
dpo-v2 |
110 / 164 (67.1%) | 97 / 164 (59.1%) |
sft |
97 / 164 (59.1%) | 86 / 164 (52.4%) |
dpo-v3 |
96 / 164 (58.5%) | 84 / 164 (51.2%) |
So the honest reading is: v3 changes the model's in-domain type-hinting behavior, but it is not a generally stronger HumanEval+ solver than the base model.
Environmental impact
Reported with CodeCarbon v3.2.6. Raw data: emissions.csv.
Training (this model)
| Metric | Value |
|---|---|
| Duration | 6,005.4 s (1 h 40 m) |
| Energy consumed | 0.363 kWh |
| CO₂e emissions | 0.134 kg |
| GPU energy / avg power | 0.242 kWh / 144.9 W |
| CPU energy / avg power | 0.034 kWh / 21.4 W |
| RAM energy / avg power | 0.087 kWh / 54.0 W |
| Hardware | 1 × NVIDIA A10G, AMD EPYC 7R32 (48 vCPU), 187 GB RAM |
| Region | AWS us-east-1 (Virginia, USA); PUE 1.0 |
| Tracker | codecarbon 3.2.6, tracking_mode=machine |
Cumulative project footprint
Because v3 builds on SFT warm-start + v2 was a training run too, the full energy cost of this model's lineage is:
| Stage | Duration | Energy | CO₂e |
|---|---|---|---|
| SFT training | 8,340 s | 0.472 kWh | 0.174 kg |
| v2 DPO training (failed) | 10,938 s | 0.646 kWh | 0.238 kg |
| v3 DPO training (this) | 6,005 s | 0.363 kWh | 0.134 kg |
| v3 characterization (generate × 4 models) | 937 s | 0.052 kWh | 0.019 kg |
| 6 analysis jobs (cpu-upgrade) | ~3 min each, parallel | ~0.01 kWh | ~0.004 kg |
| Cumulative (SFT + v2 + v3 + eval) | ~7.3 h | ~1.55 kWh | ~0.57 kg |
Approximate compute cost
HF Jobs wall-clock billed at published HF Jobs rates. Rates shown are approximate.
| Stage | Flavor | Wall-clock | Approx cost |
|---|---|---|---|
| SFT training | a10g-large | 2.32 h | ~$3.50 |
| v2 DPO training | a10g-large | 3.04 h | ~$4.60 |
| v3 DPO training | a10g-large | 1.67 h | ~$2.50 |
| v3 characterization generate | a10g-large | 0.26 h | ~$0.40 |
| 6 analysis jobs | cpu-upgrade × 6 parallel | ~3 min each | <$0.05 |
| Rollup report | cpu-basic | <1 min | ~$0 |
| Cumulative project cost | ~$11 |
Limitations and biases
- Narrow objective: optimized only for Python type-hint preference. Docstring style, line length, complexity, security idioms, etc. were not objectives.
- Possible over-annotation:
rewards/rejectedfell to ~−20 during training, meaning the model strongly suppresses unhinted outputs. In principle this could cause annotations where Python idiom doesn't require them (trivial lambdas, short list comprehensions). v3's annotation coverage slightly exceeding gold's is mild evidence of this; watch for it in your downstream use. - No eval split during training: v3 trained on the full 6,361-pair pool with no held-out metric for best-checkpoint selection. v4 draft adds a 2% eval split and
load_best_model_at_end. - bf16 weights only: merged safetensors are bf16. Fine for A10G/A100/H100; float16 consumers should cast.
- Small base model: 1.5B parameters. For larger code tasks, consider applying the same recipe to Qwen2.5-Coder-7B or similar.
- English + code only: training data is English prompts, English/Python responses.
Reproducibility
Everything needed to reproduce this model is on the Hub:
| Artifact | Location |
|---|---|
| Training script | mypo-training/mypo_dpo_train_v3.py |
| Training data | joshuasundance/mypo-4k-rfc |
| SFT warm-start | joshuasundance/mypo-qwen2.5-coder-1.5b-sft |
| Training energy log | emissions.csv (this repo) |
| Evaluation pipeline | mypo-training/eval/ (generate / analyze / report scripts) |
| Raw generations | mypo-training/generations/2026-04-22-qwen2.5-1.5b-v3/ |
| Per-subject analysis | mypo-training/analysis/2026-04-22-qwen2.5-1.5b-v3/ |
| Characterization report | mypo-training/reports/2026-04-22-qwen2.5-1.5b-v3/ |
| Single-prompt validation (n=30) | mypo-training/single-prompt-validation/single-prompt-2026-04-23T002137Z/ |
To re-train from scratch:
hf jobs uv run --flavor a10g-large --timeout 3h --secrets HF_TOKEN \
https://huggingface.co/joshuasundance/mypo-training/raw/main/mypo_dpo_train_v3.py
Framework versions
- Python 3.12
- PyTorch 2.4+, Transformers 4.45+, TRL 0.15+, PEFT 0.12+, Datasets 3.0+, Accelerate 0.34+
- CodeCarbon 3.2.6
License
Apache 2.0 (inherits from the Qwen2.5-Coder-1.5B-Instruct base model).
Citations
This model
@software{mypo_dpo_v3_2026,
title = {{MyPO DPO v3: Qwen2.5-Coder-1.5B Type-Hint Preference Optimization}},
author = {Bailey, Joshua Sundance},
year = 2026,
url = {https://huggingface.co/joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3}
}
CodeCarbon (emissions tracking)
@software{codecarbon,
author = {Benoit Courty and Victor Schmidt and Sasha Luccioni and Goyal-Kamal and MarionCoutarel and Boris Feld and Jérémy Lecourt and LiamConnell and Amine Saboni and Inimaz and supatomic and Mathilde Léval and Luis Blanche and Alexis Cruveiller and Ouminasara and Franklin Zhao and Aditya Joshi and Alexis Bogroff and Hugues de Lavoreille and Niko Laskaris and Edoardo Abati and Douglas Blank and Ziyao Wang and Armin Catovic and Marc Alencon and Michał Stęchły and Christian Bauer and Lucas Otávio N. de Araújo and JPW and MinervaBooks},
title = {{CodeCarbon: Estimate and track carbon emissions from machine learning computing}},
year = 2024,
doi = {10.5281/zenodo.11171501},
url = {https://github.com/mlco2/codecarbon}
}
DPO
@inproceedings{rafailov2023direct,
title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
booktitle = {Advances in Neural Information Processing Systems 36 (NeurIPS 2023)},
year = 2023
}
TRL
@software{vonwerra2020trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = 2020
}
LoRA
@inproceedings{hu2022lora,
title = {{LoRA: Low-Rank Adaptation of Large Language Models}},
author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
booktitle = {International Conference on Learning Representations},
year = 2022
}
Qwen2.5-Coder (base model)
@article{hui2024qwen25coder,
title = {{Qwen2.5-Coder Technical Report}},
author = {Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, Kai and others},
journal = {arXiv preprint arXiv:2409.12186},
year = 2024
}
- Downloads last month
- 1,064
Model tree for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
Base model
Qwen/Qwen2.5-1.5BDataset used to train joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
Collection including joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
Paper for joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
Evaluation results
- parse rate on mypo-4k-rfcvalidation set self-reported1.000
- black pass rate on mypo-4k-rfcvalidation set self-reported0.953
- ruff pass rate on mypo-4k-rfcvalidation set self-reported0.913
- mypy --strict pass rate on mypo-4k-rfcvalidation set self-reported0.920
- annotation slot coverage on mypo-4k-rfcvalidation set self-reported0.963
- preference win-rate vs gold (chosen) on mypo-4k-rfcvalidation set self-reported0.527
- pass@1 (base tests) on HumanEval+test set self-reported0.585
- pass@1 (plus tests) on HumanEval+test set self-reported0.512