SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

Paper GitHub

Abstract

Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models.

Pipeline

Pipeline

Model Zoo

This repository contains the weights for SoliReward. The project primarily utilizes the following two checkpoints:

Model Path Focus Dimension
pixel_orm/physics-deformity-HPQA-InternVL3-1B Physical Plausibility & Subject Deformity
pixel_orm/TA-HPQA-InternVL3-1B Text Alignment (Semantic Consistency)

Quick Start

1. Environment Setup

git clone https://github.com/lian700/SoliReward.git
cd SoliReward
bash scripts/setup_env.sh
conda activate solireward

2. Training

Modify the configuration in scripts/solireward_train.sh and run:

bash scripts/solireward_train.sh

3. Inference

Modify the configuration in scripts/solireward_infer.sh and run:

bash scripts/solireward_infer.sh

Supported Models

Model Type Parameter Name Description
InternVL3 InternVL3 InternVL3 series models
InternVL3.5 InternVL3-5 InternVL3.5 series models
Qwen2.5-VL Qwen2.5-VL Qwen2.5-VL series models
Qwen2-VL Qwen2-VL Qwen2-VL series models

Loss Functions

  • BT Loss: Bradley-Terry ranking loss
  • BTT Loss: Bradley-Terry-Tie loss for handling tie samples
  • BCE Loss: Binary Cross Entropy for absolute quality prediction

Citation

If you find this project helpful for your research, please cite our paper:

@article{lian2025solireward,
  title={SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models},
  author={Lian, Jiesong and Zhong, Ruizhe and Zhou, Zixiang and Mi, Xiaoyue and Hao, Yixue and Zhou, Yuan and Lu, Qinglin and Hu, Long and Yan, Junchi},
  journal={arXiv preprint arXiv:2512.22170},
  year={2025}
}

Acknowledgments

This project is built upon the following excellent open-source projects:

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Yukino271828/SoliReward

Finetuned
(7)
this model

Paper for Yukino271828/SoliReward