ws-wm-0218-step-120

Exp8-r2 (0218) World Model checkpoint at step 120 (epoch 0.86).

Training

  • Method: Pivot-GRPO with Cauchy α=1.0, BehR-only reward
  • Base: Qwen2.5-7B SFT World Model
  • Bug Fixes: 9 critical fixes (negative reward, signal dilution, API dedup, etc.)
  • Ckpt Monitor BehR: See collection description for full results

Key Config

  • reward_mode=cauchy, behavior_scale_coef=1.0
  • behavior_weight=1.0, facts_weight=0, length_penalty_weight=0
  • lr=2e-6, batch_size=32, temperature=1.2
  • max_prompt_length=28672, ppo_epochs=1
  • Judge: Qwen3-8B (vLLM TP=4)

Part of collection: ws-wm-0218

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ricardo-H/ws-wm-0218-step-120

Base model

Qwen/Qwen2.5-7B
Finetuned
(5)
this model