Papers
arxiv:2604.12617

SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

Published on Apr 14
Authors:
,
,
,
,
,

Abstract

A new post-training method called SOAR addresses the gap between supervised fine-tuning and reinforcement learning in diffusion models by providing dense, reward-free supervision through self-correction mechanisms.

AI-generated summary

The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.

Community

HY-SOAR: reward-free trajectory self-correction for diffusion post-training

We are excited to share HY-SOAR: Self-Correction for Optimal Alignment and Refinement, a data-driven post-training method for diffusion and flow-matching models.

The motivation is that current diffusion post-training still leaves a lot of signal unused.

SFT learns from high-quality data, but only on ideal forward-noised states. During inference, however, the model follows its own denoising trajectory. Once an early step drifts away, later states may fall into regions the model was never explicitly trained to recover from.

RL-based alignment can optimize final outcomes, but it often compresses rich data supervision into a sparse terminal reward. This creates credit-assignment difficulty across denoising steps and may introduce reward hacking.

SOAR tries to recover this missing trajectory-level signal directly from data.

image

Given a real image-caption pair, SOAR first constructs an on-trajectory latent, then performs a single stop-gradient ODE rollout with the current model to simulate an off-trajectory state. This state is re-noised along the same noise ray to create auxiliary training points. The model is then supervised with an analytic correction target anchored to the original clean sample.

This gives SOAR three useful properties:

  • Reward-free: no reward model, preference labels, or negative samples are required.
  • Dense supervision: correction signals are provided at intermediate denoising states, not only after a full rollout.
  • On-policy correction: off-trajectory states come from the current model itself, so the training distribution adapts as the model improves.

In our experiments on SD3.5-Medium with 286K image-text samples, SOAR improves over SFT across all reported metrics: GenEval 0.70 → 0.78, OCR 0.64 → 0.67, and consistent gains on PickScore, HPSv2.1, Aesthetic, and ImageReward.

On high-aesthetic and high-CLIPScore subsets, SOAR also achieves stronger final target metrics than Flow-GRPO despite using no reward model: Aesthetic 5.94 vs 5.87, CLIPScore 0.300 vs 0.296.

We see SOAR not as a replacement for RL, but as a stronger and more stable first post-training stage before RL: first teach the model to keep its denoising trajectory stable, then use RL for preference exploration and further alignment.

Project page: https://hy-soar.github.io
Paper: https://arxiv.org/abs/2604.12617
Code: https://github.com/Tencent-Hunyuan/HY-SOAR

Would love to hear thoughts from the community: can trajectory-level self-correction become a stronger foundation for diffusion post-training than relying only on SFT or terminal reward optimization?

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.12617
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.12617 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.12617 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.12617 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.