Rethink_SFT_generalization
Collection
Repo for paper Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability. • 40 items • Updated • 14
This repository contains model weights associated with the paper "Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability".
The research investigates the factors influencing cross-domain generalization in Large Language Models (LLMs) during reasoning-focused supervised fine-tuning (SFT) with long chain-of-thought (CoT) data.
We have open-sourced ALL models trained in our experiments, including the intermediate checkpoints (you can find them in the stepxxx folder in the repo).
Note that the following model list may include repeated entries, as it is organized by the experiments and conclusions presented in the paper.
| Model Name | Hugging Face | ModelScope |
|---|---|---|
| Weak cross-domain generalization is more pronounced under short training and smaller learning rates (refer to Sec. 3.1; App. C.1, Table 4) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-5_ep1_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-5_ep2_bs256 | Hugging Face | ModelScope |
| Apparent non-generalization can be an under-optimization artifact, with a dip-and-recovery pattern under extended training (refer to Sec. 3.1-3.2, Fig. 3) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| The above optimization dynamics remain robust under a different teacher model (refer to App. C.2, Fig. 7) | ||
| Qwen3-14B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_DeepSeek-R1-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Under a fixed 640-step budget, repeated exposure is more effective than one-pass coverage (refer to Sec. 3.3, Table 1) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-2.5k_lr5e-5_ep8_bs32 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep1_bs32 | Hugging Face | ModelScope |
| Overfitting symptoms emerge mainly under combined aggressive schedules (refer to Sec. 3.4, Fig. 4; App. C.4) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep16_bs256_ConstLR | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr1e-4_ep16_bs256_ConstLR | Hugging Face | ModelScope |
| Training data quality and structure jointly shape generalization (refer to Sec. 4, Table 2) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Numina-Math-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Numina-Math-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Numina-Math-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Higher-capability models internalize transferable reasoning patterns more effectively and generalize better (refer to Sec. 5, Fig. 5) | ||
| Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-4B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| The capability-dependent trend extends to another model family (refer to App. C.2/C.5, Fig. 8/14/15) | ||
| Qwen2.5-1.5B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen2.5-3B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen2.5-7B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen2.5-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Asymmetric generalization: reasoning improves while safety degrades under long-CoT SFT (refer to Sec. 6, Fig. 6) | ||
| Qwen3-14B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-14B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-8B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Math-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| InternLM2.5-20B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Appendix: smaller and mid-scale models across data configurations (refer to App. D) | ||
| Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-1.7B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-4B_Countdown-CoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
| Qwen3-4B_Math-NoCoT-20k_lr5e-5_ep8_bs256 | Hugging Face | ModelScope |
We provide the main datasets used in our experiments.
| Dataset Name | Description | Size | Hugging Face | ModelScope |
|---|---|---|---|---|
| Math-CoT-20k | Verified long-CoT math reasoning data (default setting in the paper) | 20,480 | Hugging Face | ModelScope |
| Math-NoCoT-20k | Math-CoT-20k with CoT traces removed (final summary/answer retained) | 20,480 | Hugging Face | ModelScope |
| Countdown-CoT-20k | Countdown arithmetic-game long-CoT data for procedural transfer analysis | 20,480 | Hugging Face | ModelScope |
| NuminaMath-20k | No-CoT math data with the matched queries, sourced from NuminaMath-1.5 | 20,480 | Hugging Face | ModelScope |
| DeepSeek-R1-20k | Verified long-CoT responses from DeepSeek-R1 on the same queries, sourced from the LUFFY dataset | 20,480 | Hugging Face | ModelScope |
@article{ren2026rethinking_sft_generalization,
title={Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability},
author={Qihan Ren and Peng Wang and Ruikun Cai and Shuai Shao and Dadi Guo and Yuejin Xie and Yafu Li and Quanshi Zhang and Xia Hu and Jing Shao and Dongrui Liu},
journal={arXiv preprint arXiv:2604.06628},
year={2026}
}