---
license: mit
datasets:
- roneneldan/TinyStories
language:
- en
pipeline_tag: text-generation
tags:
- multi-head-attention
- small-language-model
- pretrain
- custom-model
- slm
- tiny-model
- research
---

*This repository demonstrates a small Multi-Head-Attention language model trained from scratch for educational and research purposes.*

# Stories-SLM 🤖

<!-- Provide a quick summary of what the model is/does. -->
This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains **3** pretrained models (at this moment), more on the way.
The model variants in the collection ranges from standard GPT to **Mixture-Of-Experts** versions built with **RoPE**, **Group Query Attention**, and **RMSNormalization**.

| Model             | Params | Architecture                | Validation Loss |
| ----------------- | ------ | --------------------------- | --------------- |
| **Stories-SLM**   | 53M    | Dense - MHA                 |**1.78**         |
| Stories-SLM 2     | 48M    | Dense - GQA                 | 1.73            |
| Stories-SLM 2-MoE | 127M   | Sparse - Mixture-of-Experts | 1.67            |


**Model Name:** **Stories-SLM**

### Model Description

<!-- Provide a longer summary of what this model is. -->
**Stories-SLM** is a small language model pretrained from scratch on the Tiny Stories Dataset. It has 53 million parameters and is trained for 10,000 steps on a single Tesla T4 GPU.
It is trained on the next token prediction task using Cross-Entropy Loss over 674M tokens. 

- **Developed by:** Namrata Thakur
- **Model type:** Text Generation
- **Language(s) (NLP):** English
- **License:** MIT
- **Training Type:** Pretraining

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [GitHub Repo](https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation)
- **Demo [optional]:** [More Information Needed]

## How to Get Started with the Model

To install Stories-SLM, follow these steps:

```bash
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git

#Create an environment:
python -m venv env

# Install the required packages
pip install -r requirements.txt
```

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Stories-SLM can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.

### Chainlit Interface 🖥️

The easiest way to interact with Stories-SLM is through its Chainlit interface:

```bash
chainlit run app_pretrain.py
```

This will launch a web application where you can input text and see the model's generated responses.

### Downloading from Huggingface 🤗

To interact with the model by downloading from huggingface:

- First clone the repo in the local

```bash
from transformer_blocks.gpt2 import GPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch

model = GPT2.from_pretrained("NamrataThakur/Small_Language_Model_MHA_53M_Pretrained")
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#---------------------------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2', 
                                          arch_type='original')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)
```

## Model Architecture and Objective

Stories-SLM uses a standard GPT decoder-only transformer architecture with:

- Attention Type: Multi Head Attention
- Normalization: LayerNormalization
- Position Embedding: Learned absolute position encoding (similar to GPT2)
- Num transformer blocks: 8
- Num attention heads: 8
- Embedding dimensions: 384
- Vocabulary size:  50,257 tokens
- Context window:  256 tokens
- Feed-Forward Hidden Dimension: 1536
- Parameters: ~53M (52.88M exact)
- Overall Dropout: 0.2

**Optimization Config**:

- Optimizer: AdamW
- Weight Decay: 0.1
- Beta1: 0.9
- Beta2: 0.95
- Warmup Steps: 829 steps
- Total Steps: 10,000
- use_gradient_clip: True
- Initial Learning Rate: 1e-05
- Maximum Learning Rate: 0.0008
- Gradient Accumulation Steps: 16
- Batch Size: 16
- Global Batch Size: 256
- Scheduler: Linear Increase, followed by Cosine Annealing


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. 
This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Stories-SLM was trained using PyTorch on the TinyStories dataset. The training process involved:

1. Tokenizing the input text
2. Creating sliding windows of fixed block size
3. Training the model with cross-entropy loss
4. Applying learning rate scheduling with warmup and cosine decay

**Training Plots**

- Learning Rate Vs Steps:

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/ewsnppSqF7s_VWjy-ssQN.png)

- Loss Vs Steps:

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/rCo5gUHD4K5fl-0Xos43Q.png)


## Inference 

During inference, Stories-SLM uses several techniques to produce high-quality text:

- Temperature scaling for controlling randomness
- Top-k sampling for focus and diversity
- Efficient token generation one at a time
- Max New Tokens to determine generation length


### Results

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/jN-rzzAg13z22LRAdqSKB.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/BdSfqeQEnOCDoJ2lMPJkX.png)


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Single Tesla-T4 16GB
- **Hours used:** [More Information Needed]
- **Cloud Provider:** Lightning-AI

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support ❤️

If you find Stories-SLM useful, please consider starring the repository ⭐