Update README.md

Browse files

Files changed (1) hide show

README.md +2 -178

README.md CHANGED Viewed

@@ -1,179 +1,3 @@
----
-library_name: transformers
-license: apache-2.0
-base_model: Qwen/Qwen3-8B
-tags:
-- generated_from_trainer
-datasets:
-- xiaolesu/OsmosisProofling-v3-SFT
-model-index:
-- name: outputs/OsmosisProofling-v3-SFT/
-  results: []
----
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
-<details><summary>See axolotl config</summary>
-axolotl version: `0.16.0.dev0`
-```yaml
-base_model: Qwen/Qwen3-8B
-load_in_8bit: false
-load_in_4bit: false
-strict: false
-plugins:
-  - axolotl.integrations.liger.LigerPlugin
-liger_rope: true
-liger_rms_norm: true
-liger_glu_activation: true
-liger_layer_norm: true
-liger_fused_linear_cross_entropy: true
-chat_template: qwen3
-chat_template_kwargs:
-  enable_thinking: false
-datasets:
-  - path: xiaolesu/OsmosisProofling-v3-SFT
-    type: alpaca
-    split: train
-test_datasets:
-  - path: xiaolesu/OsmosisProofling-v3-SFT
-    type: alpaca
-    split: validation
-output_dir: ./outputs/OsmosisProofling-v3-SFT/
-sequence_len: 4096
-sample_packing: true
-flex_attention: true
-flex_attn_compile_kwargs:
-  dynamic: false
-  mode: max-autotune-no-cudagraphs
-wandb_project: OsmosisProofling-v3-SFT
-wandb_entity:
-wandb_watch:
-wandb_name: qwen3-8b-sft-v3-run1
-wandb_log_model:
-gradient_accumulation_steps: 1
-micro_batch_size: 2
-num_epochs: 2
-optimizer: adamw_torch_fused
-lr_scheduler: cosine
-learning_rate: 1e-5
-bf16: true
-tf32: true
-resume_from_checkpoint:
-logging_steps: 5
-evals_per_epoch: 10
-saves_per_epoch: 10
-save_total_limit: 3
-warmup_ratio: 0.1
-weight_decay: 0.0
-fsdp:
-  - full_shard
-  - auto_wrap
-fsdp_config:
-  fsdp_version: 2
-  fsdp_offload_params: false
-  fsdp_cpu_ram_efficient_loading: true
-  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
-  fsdp_state_dict_type: FULL_STATE_DICT
-  fsdp_sharding_strategy: FULL_SHARD
-  fsdp_reshard_after_forward: true
-  fsdp_activation_checkpointing: true
-special_tokens:
-```
-</details><br>
-# outputs/OsmosisProofling-v3-SFT/
-This model is a fine-tuned version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) on the xiaolesu/OsmosisProofling-v3-SFT dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3543
-- Ppl: 1.4252
-- Memory/max Active (gib): 20.98
-- Memory/max Allocated (gib): 20.98
-- Memory/device Reserved (gib): 36.0
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 2
-- eval_batch_size: 2
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 7
-- total_train_batch_size: 14
-- total_eval_batch_size: 14
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 21
-- training_steps: 212
-### Training results
-| Training Loss | Epoch  | Step | Validation Loss | Ppl    | Active (gib) | Allocated (gib) | Reserved (gib) |
-|:-------------:|:------:|:----:|:---------------:|:------:|:------------:|:---------------:|:--------------:|
-| No log        | 0      | 0    | 1.3417          | 3.8257 | 16.56        | 16.56           | 20.27          |
-| 1.2425        | 0.1048 | 11   | 0.9643          | 2.6231 | 20.98        | 20.98           | 36.1           |
-| 0.7372        | 0.2095 | 22   | 0.5572          | 1.7458 | 20.98        | 20.98           | 36.0           |
-| 0.5042        | 0.3143 | 33   | 0.4529          | 1.5728 | 20.98        | 20.98           | 36.0           |
-| 0.4350        | 0.4190 | 44   | 0.4158          | 1.5155 | 20.98        | 20.98           | 36.0           |
-| 0.3719        | 0.5238 | 55   | 0.3908          | 1.4782 | 20.98        | 20.98           | 36.0           |
-| 0.3934        | 0.6286 | 66   | 0.3780          | 1.4594 | 20.98        | 20.98           | 36.0           |
-| 0.3594        | 0.7333 | 77   | 0.3696          | 1.4471 | 20.98        | 20.98           | 36.0           |
-| 0.3513        | 0.8381 | 88   | 0.3645          | 1.4398 | 20.98        | 20.98           | 36.0           |
-| 0.3499        | 0.9429 | 99   | 0.3616          | 1.4356 | 20.98        | 20.98           | 36.0           |
-| 0.3517        | 1.0476 | 110  | 0.3583          | 1.4309 | 20.98        | 20.98           | 36.0           |
-| 0.3422        | 1.1524 | 121  | 0.3567          | 1.4286 | 20.98        | 20.98           | 36.0           |
-| 0.3219        | 1.2571 | 132  | 0.3557          | 1.4272 | 20.98        | 20.98           | 36.0           |
-| 0.3098        | 1.3619 | 143  | 0.3552          | 1.4264 | 20.98        | 20.98           | 36.0           |
-| 0.3068        | 1.4667 | 154  | 0.3546          | 1.4257 | 20.98        | 20.98           | 36.0           |
-| 0.3168        | 1.5714 | 165  | 0.3545          | 1.4254 | 20.98        | 20.98           | 36.0           |
-| 0.3198        | 1.6762 | 176  | 0.3546          | 1.4256 | 20.98        | 20.98           | 36.0           |
-| 0.3207        | 1.7810 | 187  | 0.3544          | 1.4253 | 20.98        | 20.98           | 36.0           |
-| 0.3232        | 1.8857 | 198  | 0.3541          | 1.4249 | 20.98        | 20.98           | 36.0           |
-| 0.3441        | 1.9905 | 209  | 0.3543          | 1.4252 | 20.98        | 20.98           | 36.0           |
-### Framework versions
-- Transformers 5.3.0
-- Pytorch 2.9.1+cu128
-- Datasets 4.5.0
-- Tokenizers 0.22.2


1	+ ### xiaolesu/OsmosisProofling-SFT-NT-GRPO-NT-Overlap











2
3	+ Experimental checkpoint from "Data Overlap as a Post-Training Hyperparameter for Autoformalization." This is the SFT+GRPO with 100% overlap variant (Qwen3-8B, thinking disabled) -- the control condition where GRPO reuses SFT data entirely. See the [paper repo](https://github.com/suxls/data-overlap-autoformalization) for details, results, and all artifacts.