yeahrlo/olmo3-dpo-original-notI-step50

RLVR (GRPO) fine-tuned from allenai/Olmo-3-7B-Instruct to avoid starting responses with "I".

  • Base: allenai/Olmo-3-7B-Instruct
  • Method: GRPO with first-token-not-I verifier
  • Checkpoint: step_50
  • Source checkpoint path: /tmp/rlvr-output/olmo3-7b-DPO-original_first_token_not_i_20260428_110927/olmo3-7b-DPO-original_first_token_not_i_20260428_110927__42__1777342200_checkpoints/step_50

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("yeahrlo/olmo3-dpo-original-notI-step50")
model = AutoModelForCausalLM.from_pretrained(
    "yeahrlo/olmo3-dpo-original-notI-step50",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
Downloads last month
294
Safetensors
Model size
528k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yeahrlo/olmo3-dpo-original-notI-step50