| --- |
| language: |
| - en |
| base_model: Qwen/Qwen3-8B |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - axolotl |
| - reasoning |
| - math |
| - commonsense |
| - primeintellect |
| license: apache-2.0 |
| datasets: |
| - NousResearch/Hermes-3-Dataset |
| - QuixiAI/dolphin |
| model-index: |
| - name: Delphermes-8B |
| results: |
| - task: |
| type: text-generation |
| name: Text Generation |
| dataset: |
| name: HellaSwag |
| type: hellaswag |
| metrics: |
| - type: accuracy |
| value: 0.88 |
| name: Accuracy |
| - task: |
| type: text-generation |
| name: Mathematical Reasoning |
| dataset: |
| name: GSM8K |
| type: gsm8k |
| metrics: |
| - type: accuracy |
| value: 0.89 |
| name: Accuracy |
| - task: |
| type: text-generation |
| name: Theory of Mind |
| dataset: |
| name: TheoryPlay |
| type: theoryplay |
| metrics: |
| - type: accuracy |
| value: 0.8 |
| name: Accuracy |
| --- |
| |
| # Delphermes-8B |
|
|
| This is a merged LoRA model based on Qwen/Qwen3-8B, SFT on Hermes3 + Dolphin Dataset. The model demonstrates strong performance across reasoning, mathematical problem-solving, and commonsense understanding tasks. |
|
|
| ## Model Details |
|
|
| - **Base Model**: Qwen/Qwen3-8B |
| - **Language**: English (en) |
| - **Library**: transformers |
| - **Training Method**: LoRA fine-tuning with Axolotl |
| - **Infrastructure**: 8xB200 Cluster from PrimeIntellect |
| - **Training Framework**: DeepSpeed Zero2 |
|
|
| ## Performance |
|
|
| | Benchmark | Score | Description | |
| |-----------|-------|-------------| |
| | **HellaSwag** | 88% | Commonsense reasoning and natural language inference | |
| | **GSM8K** | 89% | Grade school math word problems | |
| | **TheoryPlay** | 80% | Theory of mind and social reasoning tasks | |
|
|
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_name = "justinj92/Delphermes-8B" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| |
| # Example usage for reasoning tasks |
| text = "Sarah believes that her keys are in her purse, but they are actually on the kitchen table. Where will Sarah look for her keys?" |
| inputs = tokenizer(text, return_tensors="pt") |
| outputs = model.generate( |
| **inputs, |
| max_length=200, |
| temperature=0.1, |
| do_sample=True, |
| pad_token_id=tokenizer.eos_token_id |
| ) |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ### Chat Format |
|
|
| This model supports the Hermes chat format: |
|
|
| ```python |
| def format_chat(messages): |
| formatted = "" |
| for message in messages: |
| role = message["role"] |
| content = message["content"] |
| if role == "system": |
| formatted += f"<|im_start|>system\n{content}<|im_end|>\n" |
| elif role == "user": |
| formatted += f"<|im_start|>user\n{content}<|im_end|>\n" |
| elif role == "assistant": |
| formatted += f"<|im_start|>assistant\n{content}<|im_end|>\n" |
| formatted += "<|im_start|>assistant\n" |
| return formatted |
| |
| messages = [ |
| {"role": "system", "content": "You are a helpful assistant."}, |
| {"role": "user", "content": "Solve this math problem: A store has 45 apples. If they sell 1/3 of them in the morning and 1/5 of the remaining apples in the afternoon, how many apples are left?"} |
| ] |
| |
| prompt = format_chat(messages) |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_length=300, temperature=0.1) |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ## Training Details |
|
|
| - **Training Framework**: Axolotl with DeepSpeed Zero2 optimization |
| - **Hardware**: 8x NVIDIA B200 GPUs (PrimeIntellect cluster) |
| - **Base Model**: Qwen/Qwen3-8B |
| - **Training Method**: Low-Rank Adaptation (LoRA) |
| - **Dataset**: NousResearch/Hermes-3-Dataset + QuixiAI/dolphin |
| - **Training Duration**: 28 hours |
| - **Learning Rate**: 0.0004 |
| - **Batch Size**: 8 |
| - **Sequence Length**: 4096 |
|
|
| ## Evaluation Methodology |
|
|
| All evaluations were conducted using: |
| - **HellaSwag**: Standard validation set with 4-way multiple choice accuracy |
| - **GSM8K**: Test set with exact match accuracy on final numerical answers |
| - **TheoryPlay**: Validation set with accuracy on theory of mind reasoning tasks |
|
|
| ## Limitations |
|
|
| - The model may still struggle with very complex mathematical proofs |
| - Performance on non-English languages may be limited |
| - May occasionally generate inconsistent responses in edge cases |
| - Training data cutoff affects knowledge of recent events |
|
|
| ## Ethical Considerations |
|
|
| This model has been trained on curated datasets and should be used responsibly. Users should: |
| - Verify important information from the model |
| - Be aware of potential biases in training data |
| - Use appropriate content filtering for production applications |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{Delphermes-8B, |
| title={Delphermes-8B: A Fine-tuned Language Model for Reasoning Tasks}, |
| author={[Your Name]}, |
| year={2025}, |
| url={https://huggingface.co/justinj92/Delphermes-8B} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the Apache 2.0 license. |