SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Abstract
A new post-training method called SOAR addresses the gap between supervised fine-tuning and reinforcement learning in diffusion models by providing dense, reward-free supervision through self-correction mechanisms.
The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.
Community
HY-SOAR: reward-free trajectory self-correction for diffusion post-training
We are excited to share HY-SOAR: Self-Correction for Optimal Alignment and Refinement, a data-driven post-training method for diffusion and flow-matching models.
The motivation is that current diffusion post-training still leaves a lot of signal unused.
SFT learns from high-quality data, but only on ideal forward-noised states. During inference, however, the model follows its own denoising trajectory. Once an early step drifts away, later states may fall into regions the model was never explicitly trained to recover from.
RL-based alignment can optimize final outcomes, but it often compresses rich data supervision into a sparse terminal reward. This creates credit-assignment difficulty across denoising steps and may introduce reward hacking.
SOAR tries to recover this missing trajectory-level signal directly from data.
Given a real image-caption pair, SOAR first constructs an on-trajectory latent, then performs a single stop-gradient ODE rollout with the current model to simulate an off-trajectory state. This state is re-noised along the same noise ray to create auxiliary training points. The model is then supervised with an analytic correction target anchored to the original clean sample.
This gives SOAR three useful properties:
- Reward-free: no reward model, preference labels, or negative samples are required.
- Dense supervision: correction signals are provided at intermediate denoising states, not only after a full rollout.
- On-policy correction: off-trajectory states come from the current model itself, so the training distribution adapts as the model improves.
In our experiments on SD3.5-Medium with 286K image-text samples, SOAR improves over SFT across all reported metrics: GenEval 0.70 → 0.78, OCR 0.64 → 0.67, and consistent gains on PickScore, HPSv2.1, Aesthetic, and ImageReward.
On high-aesthetic and high-CLIPScore subsets, SOAR also achieves stronger final target metrics than Flow-GRPO despite using no reward model: Aesthetic 5.94 vs 5.87, CLIPScore 0.300 vs 0.296.
We see SOAR not as a replacement for RL, but as a stronger and more stable first post-training stage before RL: first teach the model to keep its denoising trajectory stable, then use RL for preference exploration and further alignment.
Project page: https://hy-soar.github.io
Paper: https://arxiv.org/abs/2604.12617
Code: https://github.com/Tencent-Hunyuan/HY-SOAR
Would love to hear thoughts from the community: can trajectory-level self-correction become a stronger foundation for diffusion post-training than relying only on SFT or terminal reward optimization?
Get this paper in your agent:
hf papers read 2604.12617 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
