Ricardo-H
/

ws-wm-0218-step-120

functional-equivalence

Model card Files Files and versions

ws-wm-0218-step-120

Exp8-r2 (0218) World Model checkpoint at step 120 (epoch 0.86).

Training

Method: Pivot-GRPO with Cauchy α=1.0, BehR-only reward
Base: Qwen2.5-7B SFT World Model
Bug Fixes: 9 critical fixes (negative reward, signal dilution, API dedup, etc.)
Ckpt Monitor BehR: See collection description for full results

Key Config

reward_mode=cauchy, behavior_scale_coef=1.0
behavior_weight=1.0, facts_weight=0, length_penalty_weight=0
lr=2e-6, batch_size=32, temperature=1.2
max_prompt_length=28672, ppo_epochs=1
Judge: Qwen3-8B (vLLM TP=4)

Part of collection: ws-wm-0218

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ricardo-H/ws-wm-0218-step-120

Base model

Qwen/Qwen2.5-7B

Finetuned

X1AOX1A/WorldModel-Webshop-Qwen2.5-7B

Finetuned

(5)

this model