ws-wm-0218-step-120
Exp8-r2 (0218) World Model checkpoint at step 120 (epoch 0.86).
Training
- Method: Pivot-GRPO with Cauchy α=1.0, BehR-only reward
- Base: Qwen2.5-7B SFT World Model
- Bug Fixes: 9 critical fixes (negative reward, signal dilution, API dedup, etc.)
- Ckpt Monitor BehR: See collection description for full results
Key Config
reward_mode=cauchy,behavior_scale_coef=1.0behavior_weight=1.0,facts_weight=0,length_penalty_weight=0lr=2e-6,batch_size=32,temperature=1.2max_prompt_length=28672,ppo_epochs=1- Judge: Qwen3-8B (vLLM TP=4)
Part of collection: ws-wm-0218
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support