yeahrlo/olmo3-dpo-original-notI-step50
RLVR (GRPO) fine-tuned from allenai/Olmo-3-7B-Instruct to avoid starting responses with "I".
- Base: allenai/Olmo-3-7B-Instruct
- Method: GRPO with first-token-not-I verifier
- Checkpoint: step_50
- Source checkpoint path:
/tmp/rlvr-output/olmo3-7b-DPO-original_first_token_not_i_20260428_110927/olmo3-7b-DPO-original_first_token_not_i_20260428_110927__42__1777342200_checkpoints/step_50
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained("yeahrlo/olmo3-dpo-original-notI-step50")
model = AutoModelForCausalLM.from_pretrained(
"yeahrlo/olmo3-dpo-original-notI-step50",
torch_dtype=torch.bfloat16,
device_map="auto",
)
- Downloads last month
- 294
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for yeahrlo/olmo3-dpo-original-notI-step50
Base model
allenai/Olmo-3-1025-7B Finetuned
allenai/Olmo-3-7B-Instruct-SFT Finetuned
allenai/Olmo-3-7B-Instruct-DPO Finetuned
allenai/Olmo-3-7B-Instruct