--- license: mit datasets: - roneneldan/TinyStories language: - en pipeline_tag: text-generation tags: - multi-head-attention - small-language-model - pretrain - custom-model - slm - tiny-model - research --- *This repository demonstrates a small Multi-Head-Attention language model trained from scratch for educational and research purposes.* # Stories-SLM 🤖 This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains **3** pretrained models (at this moment), more on the way. The model variants in the collection ranges from standard GPT to **Mixture-Of-Experts** versions built with **RoPE**, **Group Query Attention**, and **RMSNormalization**. | Model | Params | Architecture | Validation Loss | | ----------------- | ------ | --------------------------- | --------------- | | **Stories-SLM** | 53M | Dense - MHA |**1.78** | | Stories-SLM 2 | 48M | Dense - GQA | 1.73 | | Stories-SLM 2-MoE | 127M | Sparse - Mixture-of-Experts | 1.67 | **Model Name:** **Stories-SLM** ### Model Description **Stories-SLM** is a small language model pretrained from scratch on the Tiny Stories Dataset. It has 53 million parameters and is trained for 10,000 steps on a single Tesla T4 GPU. It is trained on the next token prediction task using Cross-Entropy Loss over 674M tokens. - **Developed by:** Namrata Thakur - **Model type:** Text Generation - **Language(s) (NLP):** English - **License:** MIT - **Training Type:** Pretraining ### Model Sources - **Repository:** [GitHub Repo](https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation) - **Demo [optional]:** [More Information Needed] ## How to Get Started with the Model To install Stories-SLM, follow these steps: ```bash # Clone the repository git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git #Create an environment: python -m venv env # Install the required packages pip install -r requirements.txt ``` ## Uses Stories-SLM can be used to generate small, grammatically and semantically coherent simple short stories suitable for children. ### Chainlit Interface 🖥️ The easiest way to interact with Stories-SLM is through its Chainlit interface: ```bash chainlit run app_pretrain.py ``` This will launch a web application where you can input text and see the model's generated responses. ### Downloading from Huggingface 🤗 To interact with the model by downloading from huggingface: - First clone the repo in the local ```bash from transformer_blocks.gpt2 import GPT2 from gpt_Pretraining.text_generation import Text_Generation import torch model = GPT2.from_pretrained("NamrataThakur/Small_Language_Model_MHA_53M_Pretrained") model.eval() device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) #---------------------------- Checking the generation to make everything is okay --------------------------- generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2', arch_type='original') start_context = "One day, a " response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False) print(response) ``` ## Model Architecture and Objective Stories-SLM uses a standard GPT decoder-only transformer architecture with: - Attention Type: Multi Head Attention - Normalization: LayerNormalization - Position Embedding: Learned absolute position encoding (similar to GPT2) - Num transformer blocks: 8 - Num attention heads: 8 - Embedding dimensions: 384 - Vocabulary size: 50,257 tokens - Context window: 256 tokens - Feed-Forward Hidden Dimension: 1536 - Parameters: ~53M (52.88M exact) - Overall Dropout: 0.2 **Optimization Config**: - Optimizer: AdamW - Weight Decay: 0.1 - Beta1: 0.9 - Beta2: 0.95 - Warmup Steps: 829 steps - Total Steps: 10,000 - use_gradient_clip: True - Initial Learning Rate: 1e-05 - Maximum Learning Rate: 0.0008 - Gradient Accumulation Steps: 16 - Batch Size: 16 - Global Batch Size: 256 - Scheduler: Linear Increase, followed by Cosine Annealing ## Training Details ### Training Data The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models. ### Training Procedure Stories-SLM was trained using PyTorch on the TinyStories dataset. The training process involved: 1. Tokenizing the input text 2. Creating sliding windows of fixed block size 3. Training the model with cross-entropy loss 4. Applying learning rate scheduling with warmup and cosine decay **Training Plots** - Learning Rate Vs Steps: ![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/ewsnppSqF7s_VWjy-ssQN.png) - Loss Vs Steps: ![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/rCo5gUHD4K5fl-0Xos43Q.png) ## Inference During inference, Stories-SLM uses several techniques to produce high-quality text: - Temperature scaling for controlling randomness - Top-k sampling for focus and diversity - Efficient token generation one at a time - Max New Tokens to determine generation length ### Results ![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/jN-rzzAg13z22LRAdqSKB.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/BdSfqeQEnOCDoJ2lMPJkX.png) ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** Single Tesla-T4 16GB - **Hours used:** [More Information Needed] - **Cloud Provider:** Lightning-AI ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Support ❤️ If you find Stories-SLM useful, please consider starring the repository ⭐