arxiv:2604.13010

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Published on Apr 14

· Submitted by

taesiri on Apr 15

NVIDIA

Upvote

Authors:

Hai Cai

Abstract

Lightning OPD enables efficient offline on-policy distillation for large language models by enforcing teacher consistency and eliminating the need for live teacher inference servers.

AI-generated summary

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

View arXiv page View PDF Add to collection

Community

librarian-bot

6 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

sxcasf

6 days ago

Why are the ExOPD results universally worse than OPD in Lightning OPD paper? This completely contradicts the findings in the original ExOPD paper. Is this an issue with ExOPD itself, or a problem with the Lightning OPD paper?

gbcfchc

44 minutes ago

Thanks for your attention to our work! We include ExOPD results in our table mainly as a reference point for recent OPD post-training methods, rather than as a strictly controlled comparison with Lightning OPD. The two works differ substantially in their experimental setups:

Student model: ExOPD uses Qwen3-4B-Non-Thinking, while we use Qwen3-4B-Base — these are different base models with different capabilities.
Teacher model: ExOPD's primary setting uses domain-specific teachers derived by applying RL to the same Qwen3-4B-Non-Thinking model (i.e., same-size self-distillation), whereas our 4B-scale teacher is Qwen3-8B — a larger and more capable model.
SFT stage: Lightning OPD first performs SFT on teacher-generated data before OPD, while ExOPD operates on an already instruction-tuned model.

These differences in student models, teacher models, and training pipelines naturally lead to different absolute performance numbers. Our inclusion of ExOPD results is intended to contextualize Lightning OPD within the broader landscape of OPD methods, not to claim that ExOPD's approach is inherently weaker. We hope this clarifies the comparison.

HuanxinSheng

about 5 hours ago

very similar to the offline OPD in https://arxiv.org/abs/2509.26497

gbcfchc

35 minutes ago

We appreciate you bringing this work to our attention, and we apologize for missing it in our literature review — we will cite it in the revised version.

We agree that the idea of precomputing teacher signals offline over student-generated responses is not unique to our work, and [2509.26497] has explored a similar offline on-policy distillation strategy. That said, there are notable differences between the two approaches:

Distillation paradigm: [2509.26497] formulates distillation as supervised learning with a composite CE + KL loss using soft targets from teacher logits, whereas Lightning OPD formulates it as an RL problem with policy gradient optimization, where the per-token advantage is computed as log P_teacher − log P_student.
Teacher signal: [2509.26497] stores top-k teacher logits per token (k=10, a distribution vector), while Lightning OPD stores only a single scalar log-probability per token, which is more storage-efficient.
SFT–distillation coupling: [2509.26497] treats SFT and knowledge distillation as independent, sequential stages — SFT is performed on independently curated data, and the KD teacher is a separate 7B model with no constraint linking the two stages. A central finding of Lightning OPD is that the two stages must be considered holistically: the SFT data must be generated by the same teacher used for OPD. We prove that violating this teacher consistency condition introduces an irreducible gradient bias (Theorems 3.11–3.13), and that enforcing it is both necessary and sufficient for offline OPD to provably match online OPD (Theorems 3.5, 3.7, 3.9).
Theoretical analysis: [2509.26497] provides empirical validation of offline on-policy KD, while we contribute formal guarantees including shared stationary points between offline and online OPD (Theorem 3.7) and an implicit regularization effect that prevents policy collapse without explicit KL penalties (Theorem 3.9).
Empirical results: [2509.26497] demonstrates strong results for general-purpose edge deployment (1B student from 7B teacher, on MMLU, GSM8K, etc.). Lightning OPD targets challenging reasoning and code generation tasks at larger scales (4B/8B students from 8B/32B teachers), achieving 69.9% on AIME 2024 at the 8B scale. Importantly, we show that our offline approach incurs no performance degradation compared to online OPD when teacher consistency is enforced — a stronger claim than demonstrating that offline distillation improves over an SFT baseline.

In light of this, we recognize that offline precomputation of teacher signals should not be positioned as a novel contribution of our work. In the revised version, we will:

Cite [2509.26497] as prior work that employs offline on-policy teacher signal precomputation in a supervised distillation setting.
Reframe our contribution claims to emphasize teacher consistency (the principle that couples SFT and distillation, along with its theoretical grounding) and the formal guarantees as the primary contributions, with offline precomputation serving as the mechanism rather than the novelty.

Thank you again for the constructive feedback.