Update README.md
Browse files
README.md
CHANGED
|
@@ -37,7 +37,7 @@ This model has been optimized using DPO to align its responses with preferred ou
|
|
| 37 |
|
| 38 |
## Training Pipeline
|
| 39 |
1. **Base**: Qwen/Qwen3-4B-Instruct-2507
|
| 40 |
-
2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5
|
| 41 |
3. **DPO Round 1**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot)
|
| 42 |
4. **DPO Round 2 (this model)**: Format-specific preference optimization with programmatically generated chosen/rejected pairs (~1,150 pairs)
|
| 43 |
|
|
|
|
| 37 |
|
| 38 |
## Training Pipeline
|
| 39 |
1. **Base**: Qwen/Qwen3-4B-Instruct-2507
|
| 40 |
+
2. **SFT**: Structured data generation/conversion with Chain-of-Thought (V4+V5)
|
| 41 |
3. **DPO Round 1**: Generic preference optimization (u-10bei/dpo-dataset-qwen-cot)
|
| 42 |
4. **DPO Round 2 (this model)**: Format-specific preference optimization with programmatically generated chosen/rejected pairs (~1,150 pairs)
|
| 43 |
|