Rethinking Generalization in Reasoning SFT

This repository contains model weights associated with the paper "Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability".

The research investigates the factors influencing cross-domain generalization in Large Language Models (LLMs) during reasoning-focused supervised fine-tuning (SFT) with long chain-of-thought (CoT) data.

Key Findings

Optimization Dynamics: Cross-domain performance often follows a dip-and-recovery trajectory. Models may require extended training to reach maximum generalization.
Data Quality and Structure: Verified long-CoT traces yield consistent cross-domain gains, whereas low-quality solutions or No-CoT data can lead to misleading signals or poor transfer.
Model Capability: Stronger base models are more effective at internalizing transferable procedural reasoning patterns (such as backtracking) compared to weaker models.
Asymmetric Generalization: The study finds that while reasoning capabilities improve through long-CoT SFT, model safety can simultaneously degrade. In contrast, No-CoT data leads to less reasoning improvement but better safety outcomes.

Resources

Paper: arXiv:2604.06628
Code: Official GitHub Repository
Model Collection: Hugging Face Collection

Overview of Open-source Models

We have open-sourced ALL models trained in our experiments, including the intermediate checkpoints (you can find them in the stepxxx folder in the repo).

Note that the following model list may include repeated entries, as it is organized by the experiments and conclusions presented in the paper.

Model Name	Hugging Face	ModelScope
Weak cross-domain generalization is more pronounced under short training and smaller learning rates (refer to Sec. 3.1; App. C.1, Table 4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep2_bs256	Hugging Face	ModelScope
Apparent non-generalization can be an under-optimization artifact, with a dip-and-recovery pattern under extended training (refer to Sec. 3.1-3.2, Fig. 3)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
The above optimization dynamics remain robust under a different teacher model (refer to App. C.2, Fig. 7)
Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_DeepSeek-R1-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Under a fixed 640-step budget, repeated exposure is more effective than one-pass coverage (refer to Sec. 3.3, Table 1)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-2.5k_lr5e-5_ep8_bs32	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs32	Hugging Face	ModelScope
Overfitting symptoms emerge mainly under combined aggressive schedules (refer to Sec. 3.4, Fig. 4; App. C.4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR	Hugging Face	ModelScope
Training data quality and structure jointly shape generalization (refer to Sec. 4, Table 2)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Numina-Math-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Numina-Math-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Numina-Math-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Higher-capability models internalize transferable reasoning patterns more effectively and generalize better (refer to Sec. 5, Fig. 5)
Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
The capability-dependent trend extends to another model family (refer to App. C.2/C.5, Fig. 8/14/15)
Qwen2.5-1.5B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen2.5-3B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen2.5-7B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen2.5-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Asymmetric generalization: reasoning improves while safety degrades under long-CoT SFT (refer to Sec. 6, Fig. 6)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Appendix: smaller and mid-scale models across data configurations (refer to App. D)
Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-1.7B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-4B_Countdown-CoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope
Qwen3-4B_Math-NoCoT-20k_lr5e-5_ep8_bs256	Hugging Face	ModelScope

Overview of Open-source Datasets

We provide the main datasets used in our experiments.

Dataset Name	Description	Size	Hugging Face	ModelScope
Math-CoT-20k	Verified long-CoT math reasoning data (default setting in the paper)	20,480	Hugging Face	ModelScope
Math-NoCoT-20k	Math-CoT-20k with CoT traces removed (final summary/answer retained)	20,480	Hugging Face	ModelScope
Countdown-CoT-20k	Countdown arithmetic-game long-CoT data for procedural transfer analysis	20,480	Hugging Face	ModelScope
NuminaMath-20k	No-CoT math data with the matched queries, sourced from NuminaMath-1.5	20,480	Hugging Face	ModelScope
DeepSeek-R1-20k	Verified long-CoT responses from DeepSeek-R1 on the same queries, sourced from the LUFFY dataset	20,480	Hugging Face	ModelScope

Citation

@article{ren2026rethinking_sft_generalization,
  title={Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability},
  author={Qihan Ren and Peng Wang and Ruikun Cai and Shuai Shao and Dadi Guo and Yuejin Xie and Yafu Li and Quanshi Zhang and Xia Hu and Jing Shao and Dongrui Liu},
  journal={arXiv preprint arXiv:2604.06628},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including jasonrqh/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256

Rethink_SFT_generalization

Collection

Repo for paper Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability. • 40 items • Updated 1 day ago • 14

Paper for jasonrqh/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Paper • 2604.06628 • Published 5 days ago • 296