Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Abstract
Lightning OPD enables efficient offline on-policy distillation for large language models by enforcing teacher consistency and eliminating the need for live teacher inference servers.
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Survey of On-Policy Distillation for Large Language Models (2026)
- X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs (2026)
- DP-OPD: Differentially Private On-Policy Distillation for Language Models (2026)
- Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models (2026)
- Fast and Effective On-policy Distillation from Reasoning Prefixes (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- SODA: Semi On-Policy Black-Box Distillation for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Why are the ExOPD results universally worse than OPD in Lightning OPD paper? This completely contradicts the findings in the original ExOPD paper. Is this an issue with ExOPD itself, or a problem with the Lightning OPD paper?
Thanks for your attention to our work! We include ExOPD results in our table mainly as a reference point for recent OPD post-training methods, rather than as a strictly controlled comparison with Lightning OPD. The two works differ substantially in their experimental setups:
- Student model: ExOPD uses Qwen3-4B-Non-Thinking, while we use Qwen3-4B-Base — these are different base models with different capabilities.
- Teacher model: ExOPD's primary setting uses domain-specific teachers derived by applying RL to the same Qwen3-4B-Non-Thinking model (i.e., same-size self-distillation), whereas our 4B-scale teacher is Qwen3-8B — a larger and more capable model.
- SFT stage: Lightning OPD first performs SFT on teacher-generated data before OPD, while ExOPD operates on an already instruction-tuned model.
These differences in student models, teacher models, and training pipelines naturally lead to different absolute performance numbers. Our inclusion of ExOPD results is intended to contextualize Lightning OPD within the broader landscape of OPD methods, not to claim that ExOPD's approach is inherently weaker. We hope this clarifies the comparison.
We appreciate you bringing this work to our attention, and we apologize for missing it in our literature review — we will cite it in the revised version.
We agree that the idea of precomputing teacher signals offline over student-generated responses is not unique to our work, and [2509.26497] has explored a similar offline on-policy distillation strategy. That said, there are notable differences between the two approaches:
- Distillation paradigm: [2509.26497] formulates distillation as supervised learning with a composite CE + KL loss using soft targets from teacher logits, whereas Lightning OPD formulates it as an RL problem with policy gradient optimization, where the per-token advantage is computed as log P_teacher − log P_student.
- Teacher signal: [2509.26497] stores top-k teacher logits per token (k=10, a distribution vector), while Lightning OPD stores only a single scalar log-probability per token, which is more storage-efficient.
- SFT–distillation coupling: [2509.26497] treats SFT and knowledge distillation as independent, sequential stages — SFT is performed on independently curated data, and the KD teacher is a separate 7B model with no constraint linking the two stages. A central finding of Lightning OPD is that the two stages must be considered holistically: the SFT data must be generated by the same teacher used for OPD. We prove that violating this teacher consistency condition introduces an irreducible gradient bias (Theorems 3.11–3.13), and that enforcing it is both necessary and sufficient for offline OPD to provably match online OPD (Theorems 3.5, 3.7, 3.9).
- Theoretical analysis: [2509.26497] provides empirical validation of offline on-policy KD, while we contribute formal guarantees including shared stationary points between offline and online OPD (Theorem 3.7) and an implicit regularization effect that prevents policy collapse without explicit KL penalties (Theorem 3.9).
- Empirical results: [2509.26497] demonstrates strong results for general-purpose edge deployment (1B student from 7B teacher, on MMLU, GSM8K, etc.). Lightning OPD targets challenging reasoning and code generation tasks at larger scales (4B/8B students from 8B/32B teachers), achieving 69.9% on AIME 2024 at the 8B scale. Importantly, we show that our offline approach incurs no performance degradation compared to online OPD when teacher consistency is enforced — a stronger claim than demonstrating that offline distillation improves over an SFT baseline.
In light of this, we recognize that offline precomputation of teacher signals should not be positioned as a novel contribution of our work. In the revised version, we will:
- Cite [2509.26497] as prior work that employs offline on-policy teacher signal precomputation in a supervised distillation setting.
- Reframe our contribution claims to emphasize teacher consistency (the principle that couples SFT and distillation, along with its theoretical grounding) and the formal guarantees as the primary contributions, with offline precomputation serving as the mechanism rather than the novelty.
Thank you again for the constructive feedback.
Get this paper in your agent:
hf papers read 2604.13010 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper