Upload Shoonya Model v0.2 with DeepSeek CPU optimizations

Browse files

Files changed (6) hide show

README.md +104 -76
config.json +18 -11
model.onnx +3 -0
pytorch_model.bin +3 -0
quantization_note.md +28 -0
tokenizer_config.json +8 -0

README.md CHANGED Viewed

@@ -1,100 +1,128 @@
 ---
 language: en
-license: apache-2.0
-library_name: custom
 tags:
-  - cpu-inference
-  - lightweight
-  - text-generation
-model-index:
-  - name: Shoonya
-    results:
-      - task:
-          type: text-generation
-        metrics:
-          - type: throughput
-            value: "100ms/inference"
-            name: CPU Inference Speed
 ---
-# Shoonya v0.1 - Lightweight CPU-Friendly Language Model
 ## Model Description
-Shoonya is a lightweight transformer-based language model designed specifically for CPU inference. Built with efficiency in mind, it features a compact architecture while maintaining coherent text generation capabilities.
-## Key Features
-- **CPU-Optimized**: Designed to run efficiently on CPU-only environments
-- **Lightweight**: Only 4 transformer layers with 128 hidden dimensions
-- **Memory Efficient**: ~15MB model size (quantized version ~4MB)
-- **Fast Inference**: Suitable for real-time text generation on consumer hardware
-## Technical Details
-- **Architecture**: Transformer-based language model
-  - 4 attention layers
-  - 4 attention heads per layer
-  - 128 hidden dimensions
-  - 256 intermediate size
-  - 128 max sequence length
-- **Vocabulary**: GPT-2 tokenizer (50,257 tokens)
-- **Training**: Fine-tuned on TinyStories dataset (1,000 examples)
-- **Quantization**: 8-bit dynamic quantization available for further size reduction
 ## Usage
 ```python
-from transformers import AutoTokenizer
-from model.transformer import TransformerLM
-# Load model
-model = TransformerLM.from_pretrained("vaidhyamegha/shoonya-v0.1")
-tokenizer = AutoTokenizer.from_pretrained("gpt2")
 # Generate text
-prompt = "Once upon a time"
-generated = model.generate(prompt, max_length=50)
-print(generated)
 ```
-## Performance Characteristics
-- **Memory Usage**: <2GB RAM during inference
-- **Model Size**:
-  - Full model: ~15MB
-  - Quantized version: ~4MB
-- **Speed**: ~100ms per inference on standard CPU
-## Limitations
-- Limited context window (128 tokens)
-- Trained on a small subset of data
-- Best suited for short-form creative writing
-- May produce repetitive text on longer generations
-## Training
-Trained on a curated subset of the TinyStories dataset, focusing on short, coherent narratives. The model uses a custom implementation of the transformer architecture with specific optimizations for CPU inference.
-## License
-[Add your chosen license]
-## Citation
 ```bibtex
-@misc{shoonya2025,
-  author = {VaidhyaMegha},
-  title = {Shoonya: A Lightweight CPU-Friendly Language Model},
-  year = {2025},
-  publisher = {Hugging Face},
-  journal = {Hugging Face Model Hub},
 }
 ```
-## Intended Use
-This model is designed for:
-- Prototyping and experimentation
-- Educational purposes
-- CPU-only environments
-- Resource-constrained settings
-- Short-form text generation
-## Quantization
-The model comes in two variants:
-1. Full precision (shoonya_model_v0_1.pt)
-2. 8-bit quantized (shoonya_model_v0_1_quantized.pt)
-The quantized version offers significant size reduction while maintaining reasonable quality.

 ---
 language: en
+license: mit
+library_name: pytorch
+pipeline_tag: text-generation
 tags:
+- deepseek
+- cpu-optimized
+- transformer
+- language-model
+- tinystories
+- grouped-query-attention
+- rotary-position-embeddings
+- rmsnorm
+- swiglu
+datasets:
+- roneneldan/TinyStories
 ---
+# Shoonya Model v0.2 - DeepSeek CPU-Optimized
+This model is a CPU-optimized version of the Shoonya language model, incorporating techniques from the DeepSeek team for efficient inference on CPU hardware.
 ## Model Description
+**Shoonya Model v0.2** is a lightweight transformer-based language model designed for efficient CPU inference. It incorporates architectural optimizations inspired by DeepSeek's research to achieve better performance on CPU hardware while maintaining good generation quality.
+### Model Details
+- **Developed by:** VaidhyaMegha
+- **Model type:** Transformer-based language model
+- **Language(s):** English
+- **Training Data:** TinyStories dataset
+- **Parameters:** 16.41M
+- **Context Length:** 512 tokens
+- **Hidden Size:** 256
+- **Attention Heads:** 8
+- **Key-Value Heads:** 4
+- **Hidden Layers:** 6
+- **License:** MIT
+- **Repository:** [GitHub - VaidhyaMegha/Shoonya](https://github.com/VaidhyaMegha/Shoonya)
+## DeepSeek CPU Optimizations
+This model incorporates the following optimizations from the DeepSeek team:
+1. **Grouped-Query Attention (GQA)** with a 2:1 ratio - Reduces memory usage and computational cost by sharing key and value projections across multiple query heads
+2. **Rotary Position Embeddings (RoPE)** - Provides better positional encoding with improved extrapolation to longer sequences
+3. **RMSNorm** - Offers improved training stability compared to LayerNorm
+4. **SwiGLU activation** - Provides better performance in feed-forward networks compared to standard GELU
+5. **Sliding Window Attention** with window size 256 - Reduces memory usage for longer sequences by limiting attention to a local window
+6. **ONNX export** - Enables optimized runtime on various hardware platforms
+## Intended Uses & Limitations
+**Intended Uses:**
+- Educational purposes to understand transformer architecture and optimizations
+- Research on efficient language model deployment
+- Text generation for simple creative writing tasks
+- Baseline for further fine-tuning on specific tasks
+**Limitations:**
+- The model is trained on a limited dataset (TinyStories) and has a relatively small parameter count
+- It may not perform well on complex reasoning tasks or specialized domains
+- The model has not been extensively evaluated for biases or harmful outputs
+## Training Procedure
+### Training Data
+The model was trained on the [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories), which contains simple stories suitable for young children, generated by GPT-3.5/4.
+### Training Hyperparameters
+- **Optimizer:** AdamW
+- **Learning Rate:** 5e-5
+- **Batch Size:** 4
+- **Weight Decay:** 0.01
+- **Warmup Steps:** 100
+- **Gradient Accumulation Steps:** 4
+- **Training Device:** CPU (Mac Mini M4)
+- **Training Epochs:** 5
+## Note on Quantization
+The quantized version of this model is not included due to PyTorch quantization limitations on Mac M-series chips. See quantization_note.md for instructions on how to quantize the model on a compatible system.
 ## Usage
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("VaidhyaMegha/Shoonya")
+tokenizer = AutoTokenizer.from_pretrained("VaidhyaMegha/Shoonya")
 # Generate text
+input_text = "Once upon a time"
+input_ids = tokenizer(input_text, return_tensors="pt").input_ids
+output = model.generate(input_ids, max_length=100, temperature=0.7, top_p=0.9, repetition_penalty=1.1)
+generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
+print(generated_text)
 ```
+## Evaluation Results
+The model achieved the following metrics during training:
+- **Final Loss:** 7.21
+- **Final Perplexity:** 1358.28
+## Ethical Considerations
+This model is trained on the TinyStories dataset, which was designed to be suitable for children and contains simple, non-harmful content. However, as with any language model, it may still produce unexpected or potentially problematic outputs. Users should exercise caution and implement appropriate content filtering if deploying this model in production environments.
+## Citations
 ```bibtex
+@article{eldan2023tinystories,
+  title={{TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}},
+  author={Eldan, Ronen and Li, Yuanzhi},
+  journal={arXiv preprint arXiv:2305.07759},
+  year={2023}
 }
 ```
+## License
+This model is released under the MIT License.

config.json CHANGED Viewed

@@ -1,12 +1,19 @@
 {
-  "architectures": ["TransformerLM"],
-  "model_type": "transformer",
-  "n_layer": 4,
-  "n_head": 4,
-  "n_embd": 128,
-  "vocab_size": 50257,
-  "max_position_embeddings": 128,
-  "intermediate_size": 256,
-  "torch_dtype": "float32",
-  "transformers_version": "4.30.0"
-}

 {
+  "attention_probs_dropout_prob": 0.1,
+  "ffn_activation": "silu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 256,
+  "initializer_range": 0.02,
+  "intermediate_size": 512,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "num_attention_heads": 8,
+  "num_hidden_layers": 6,
+  "num_key_value_heads": 4,
+  "position_embedding_type": "rotary",
+  "sliding_window": 256,
+  "tie_word_embeddings": true,
+  "use_rms_norm": true,
+  "use_swiglu": true,
+  "vocab_size": 50257
+}

model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21bb7c352be159145373fc235533ea6327c0010fb9c661d059a494759e80d099
+size 117362381

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:93444b78414f23da694467d67cf6442655a6da0e3e04266f29bbc1e550ceb62a
+size 65683059

quantization_note.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# Note on Quantization
+The quantized version of this model is not included because PyTorch quantization has limited support on Mac M-series chips.
+To quantize this model on a compatible system:
+```python
+import torch
+from model.transformer import TransformerLM, ModelConfig
+# Load the model
+checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
+config = # Load your config
+# Create model instance
+model = TransformerLM(config)
+model.load_state_dict(checkpoint)
+model.eval()
+# Apply dynamic quantization to linear layers
+quantized_model = torch.quantization.quantize_dynamic(
+    model,
+    {torch.nn.Linear},
+    dtype=torch.qint8
+)
+# Save quantized model
+torch.save(quantized_model.state_dict(), "pytorch_model_quantized.bin")
+```

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "model_type": "gpt2",
+  "vocab_size": 50257,
+  "padding_side": "right",
+  "truncation_side": "right",
+  "bos_token_id": 50256,
+  "eos_token_id": 50256
+}