Rakushaking commited on
Commit
19cb7d3
·
verified ·
1 Parent(s): 80d603f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -37,7 +37,7 @@ This model has been optimized using DPO to align its responses with preferred ou
37
 
38
  ## Training Pipeline
39
  1. **Base**: Qwen/Qwen3-4B-Instruct-2507
40
- 2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5+daichira converted, ~9.3k samples)
41
  3. **DPO Round 1**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot)
42
  4. **DPO Round 2 (this model)**: Format-specific preference optimization with programmatically generated chosen/rejected pairs (~1,150 pairs)
43
 
 
37
 
38
  ## Training Pipeline
39
  1. **Base**: Qwen/Qwen3-4B-Instruct-2507
40
+ 2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5)
41
  3. **DPO Round 1**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot)
42
  4. **DPO Round 2 (this model)**: Format-specific preference optimization with programmatically generated chosen/rejected pairs (~1,150 pairs)
43