yeahrlo/olmo3-dpo-original-notI-step50

RLVR (GRPO) fine-tuned from allenai/Olmo-3-7B-Instruct to avoid starting responses with "I".

Base: allenai/Olmo-3-7B-Instruct
Method: GRPO with first-token-not-I verifier
Checkpoint: step_50
Source checkpoint path: /tmp/rlvr-output/olmo3-7b-DPO-original_first_token_not_i_20260428_110927/olmo3-7b-DPO-original_first_token_not_i_20260428_110927__42__1777342200_checkpoints/step_50

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("yeahrlo/olmo3-dpo-original-notI-step50")
model = AutoModelForCausalLM.from_pretrained(
    "yeahrlo/olmo3-dpo-original-notI-step50",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)