Rethinking Generalization in Reasoning SFT

This repository contains model weights associated with the paper "Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability".

The research investigates the factors influencing cross-domain generalization in Large Language Models (LLMs) during reasoning-focused supervised fine-tuning (SFT) with long chain-of-thought (CoT) data.

Key Findings

  • Optimization Dynamics: Cross-domain performance often follows a dip-and-recovery trajectory. Models may require extended training to reach maximum generalization.
  • Data Quality and Structure: Verified long-CoT traces yield consistent cross-domain gains, whereas low-quality solutions or No-CoT data can lead to misleading signals or poor transfer.
  • Model Capability: Stronger base models are more effective at internalizing transferable procedural reasoning patterns (such as backtracking) compared to weaker models.
  • Asymmetric Generalization: The study finds that while reasoning capabilities improve through long-CoT SFT, model safety can simultaneously degrade. In contrast, No-CoT data leads to less reasoning improvement but better safety outcomes.

Resources

Overview of Open-source Models

We have open-sourced ALL models trained in our experiments, including the intermediate checkpoints (you can find them in the stepxxx folder in the repo).

Note that the following model list may include repeated entries, as it is organized by the experiments and conclusions presented in the paper.

Model Name Hugging Face ModelScope
Weak cross-domain generalization is more pronounced under short training and smaller learning rates (refer to Sec. 3.1; App. C.1, Table 4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-5_ep2_bs256 Hugging Face ModelScope
Apparent non-generalization can be an under-optimization artifact, with a dip-and-recovery pattern under extended training (refer to Sec. 3.1-3.2, Fig. 3)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
The above optimization dynamics remain robust under a different teacher model (refer to App. C.2, Fig. 7)
Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Under a fixed 640-step budget, repeated exposure is more effective than one-pass coverage (refer to Sec. 3.3, Table 1)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-2.5k_lr5e-5_ep8_bs32 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs32 Hugging Face ModelScope
Overfitting symptoms emerge mainly under combined aggressive schedules (refer to Sec. 3.4, Fig. 4; App. C.4)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR Hugging Face ModelScope
Training data quality and structure jointly shape generalization (refer to Sec. 4, Table 2)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Numina-Math-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Numina-Math-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Countdown-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Numina-Math-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Countdown-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Higher-capability models internalize transferable reasoning patterns more effectively and generalize better (refer to Sec. 5, Fig. 5)
Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
The capability-dependent trend extends to another model family (refer to App. C.2/C.5, Fig. 8/14/15)
Qwen2.5-1.5B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen2.5-3B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen2.5-7B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen2.5-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Asymmetric generalization: reasoning improves while safety degrades under long-CoT SFT (refer to Sec. 6, Fig. 6)
Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Appendix: smaller and mid-scale models across data configurations (refer to App. D)
Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-1.7B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-4B_Countdown-CoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope
Qwen3-4B_Math-NoCoT-20k_lr5e-5_ep8_bs256 Hugging Face ModelScope

Overview of Open-source Datasets

We provide the main datasets used in our experiments.

Dataset Name Description Size Hugging Face ModelScope
Math-CoT-20k Verified long-CoT math reasoning data (default setting in the paper) 20,480 Hugging Face ModelScope
Math-NoCoT-20k Math-CoT-20k with CoT traces removed (final summary/answer retained) 20,480 Hugging Face ModelScope
Countdown-CoT-20k Countdown arithmetic-game long-CoT data for procedural transfer analysis 20,480 Hugging Face ModelScope
NuminaMath-20k No-CoT math data with the matched queries, sourced from NuminaMath-1.5 20,480 Hugging Face ModelScope
DeepSeek-R1-20k Verified long-CoT responses from DeepSeek-R1 on the same queries, sourced from the LUFFY dataset 20,480 Hugging Face ModelScope

Citation

@article{ren2026rethinking_sft_generalization,
  title={Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability},
  author={Qihan Ren and Peng Wang and Ruikun Cai and Shuai Shao and Dadi Guo and Yuejin Xie and Yafu Li and Quanshi Zhang and Xia Hu and Jing Shao and Dongrui Liu},
  journal={arXiv preprint arXiv:2604.06628},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including jasonrqh/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256

Paper for jasonrqh/Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256