MandarapuMadhulatha commited on
Commit
8493c0e
·
verified ·
1 Parent(s): 46d9707

Upload Shoonya Model v0.2 with DeepSeek CPU optimizations

Browse files
Files changed (6) hide show
  1. README.md +104 -76
  2. config.json +18 -11
  3. model.onnx +3 -0
  4. pytorch_model.bin +3 -0
  5. quantization_note.md +28 -0
  6. tokenizer_config.json +8 -0
README.md CHANGED
@@ -1,100 +1,128 @@
1
  ---
2
  language: en
3
- license: apache-2.0
4
- library_name: custom
 
5
  tags:
6
- - cpu-inference
7
- - lightweight
8
- - text-generation
9
- model-index:
10
- - name: Shoonya
11
- results:
12
- - task:
13
- type: text-generation
14
- metrics:
15
- - type: throughput
16
- value: "100ms/inference"
17
- name: CPU Inference Speed
18
  ---
19
- # Shoonya v0.1 - Lightweight CPU-Friendly Language Model
 
 
 
20
 
21
  ## Model Description
22
- Shoonya is a lightweight transformer-based language model designed specifically for CPU inference. Built with efficiency in mind, it features a compact architecture while maintaining coherent text generation capabilities.
23
-
24
- ## Key Features
25
- - **CPU-Optimized**: Designed to run efficiently on CPU-only environments
26
- - **Lightweight**: Only 4 transformer layers with 128 hidden dimensions
27
- - **Memory Efficient**: ~15MB model size (quantized version ~4MB)
28
- - **Fast Inference**: Suitable for real-time text generation on consumer hardware
29
-
30
- ## Technical Details
31
- - **Architecture**: Transformer-based language model
32
- - 4 attention layers
33
- - 4 attention heads per layer
34
- - 128 hidden dimensions
35
- - 256 intermediate size
36
- - 128 max sequence length
37
- - **Vocabulary**: GPT-2 tokenizer (50,257 tokens)
38
- - **Training**: Fine-tuned on TinyStories dataset (1,000 examples)
39
- - **Quantization**: 8-bit dynamic quantization available for further size reduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ## Usage
42
 
43
  ```python
44
- from transformers import AutoTokenizer
45
- from model.transformer import TransformerLM
46
 
47
- # Load model
48
- model = TransformerLM.from_pretrained("vaidhyamegha/shoonya-v0.1")
49
- tokenizer = AutoTokenizer.from_pretrained("gpt2")
50
 
51
  # Generate text
52
- prompt = "Once upon a time"
53
- generated = model.generate(prompt, max_length=50)
54
- print(generated)
 
 
55
  ```
56
 
57
- ## Performance Characteristics
58
- - **Memory Usage**: <2GB RAM during inference
59
- - **Model Size**:
60
- - Full model: ~15MB
61
- - Quantized version: ~4MB
62
- - **Speed**: ~100ms per inference on standard CPU
63
 
64
- ## Limitations
65
- - Limited context window (128 tokens)
66
- - Trained on a small subset of data
67
- - Best suited for short-form creative writing
68
- - May produce repetitive text on longer generations
69
 
70
- ## Training
71
- Trained on a curated subset of the TinyStories dataset, focusing on short, coherent narratives. The model uses a custom implementation of the transformer architecture with specific optimizations for CPU inference.
72
 
73
- ## License
74
- [Add your chosen license]
 
75
 
76
- ## Citation
77
  ```bibtex
78
- @misc{shoonya2025,
79
- author = {VaidhyaMegha},
80
- title = {Shoonya: A Lightweight CPU-Friendly Language Model},
81
- year = {2025},
82
- publisher = {Hugging Face},
83
- journal = {Hugging Face Model Hub},
84
  }
85
  ```
86
 
87
- ## Intended Use
88
- This model is designed for:
89
- - Prototyping and experimentation
90
- - Educational purposes
91
- - CPU-only environments
92
- - Resource-constrained settings
93
- - Short-form text generation
94
-
95
- ## Quantization
96
- The model comes in two variants:
97
- 1. Full precision (shoonya_model_v0_1.pt)
98
- 2. 8-bit quantized (shoonya_model_v0_1_quantized.pt)
99
 
100
- The quantized version offers significant size reduction while maintaining reasonable quality.
 
1
  ---
2
  language: en
3
+ license: mit
4
+ library_name: pytorch
5
+ pipeline_tag: text-generation
6
  tags:
7
+ - deepseek
8
+ - cpu-optimized
9
+ - transformer
10
+ - language-model
11
+ - tinystories
12
+ - grouped-query-attention
13
+ - rotary-position-embeddings
14
+ - rmsnorm
15
+ - swiglu
16
+ datasets:
17
+ - roneneldan/TinyStories
 
18
  ---
19
+
20
+ # Shoonya Model v0.2 - DeepSeek CPU-Optimized
21
+
22
+ This model is a CPU-optimized version of the Shoonya language model, incorporating techniques from the DeepSeek team for efficient inference on CPU hardware.
23
 
24
  ## Model Description
25
+
26
+ **Shoonya Model v0.2** is a lightweight transformer-based language model designed for efficient CPU inference. It incorporates architectural optimizations inspired by DeepSeek's research to achieve better performance on CPU hardware while maintaining good generation quality.
27
+
28
+ ### Model Details
29
+
30
+ - **Developed by:** VaidhyaMegha
31
+ - **Model type:** Transformer-based language model
32
+ - **Language(s):** English
33
+ - **Training Data:** TinyStories dataset
34
+ - **Parameters:** 16.41M
35
+ - **Context Length:** 512 tokens
36
+ - **Hidden Size:** 256
37
+ - **Attention Heads:** 8
38
+ - **Key-Value Heads:** 4
39
+ - **Hidden Layers:** 6
40
+ - **License:** MIT
41
+ - **Repository:** [GitHub - VaidhyaMegha/Shoonya](https://github.com/VaidhyaMegha/Shoonya)
42
+
43
+ ## DeepSeek CPU Optimizations
44
+
45
+ This model incorporates the following optimizations from the DeepSeek team:
46
+
47
+ 1. **Grouped-Query Attention (GQA)** with a 2:1 ratio - Reduces memory usage and computational cost by sharing key and value projections across multiple query heads
48
+ 2. **Rotary Position Embeddings (RoPE)** - Provides better positional encoding with improved extrapolation to longer sequences
49
+ 3. **RMSNorm** - Offers improved training stability compared to LayerNorm
50
+ 4. **SwiGLU activation** - Provides better performance in feed-forward networks compared to standard GELU
51
+ 5. **Sliding Window Attention** with window size 256 - Reduces memory usage for longer sequences by limiting attention to a local window
52
+ 6. **ONNX export** - Enables optimized runtime on various hardware platforms
53
+
54
+ ## Intended Uses & Limitations
55
+
56
+ **Intended Uses:**
57
+ - Educational purposes to understand transformer architecture and optimizations
58
+ - Research on efficient language model deployment
59
+ - Text generation for simple creative writing tasks
60
+ - Baseline for further fine-tuning on specific tasks
61
+
62
+ **Limitations:**
63
+ - The model is trained on a limited dataset (TinyStories) and has a relatively small parameter count
64
+ - It may not perform well on complex reasoning tasks or specialized domains
65
+ - The model has not been extensively evaluated for biases or harmful outputs
66
+
67
+ ## Training Procedure
68
+
69
+ ### Training Data
70
+
71
+ The model was trained on the [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories), which contains simple stories suitable for young children, generated by GPT-3.5/4.
72
+
73
+ ### Training Hyperparameters
74
+
75
+ - **Optimizer:** AdamW
76
+ - **Learning Rate:** 5e-5
77
+ - **Batch Size:** 4
78
+ - **Weight Decay:** 0.01
79
+ - **Warmup Steps:** 100
80
+ - **Gradient Accumulation Steps:** 4
81
+ - **Training Device:** CPU (Mac Mini M4)
82
+ - **Training Epochs:** 5
83
+
84
+ ## Note on Quantization
85
+
86
+ The quantized version of this model is not included due to PyTorch quantization limitations on Mac M-series chips. See quantization_note.md for instructions on how to quantize the model on a compatible system.
87
 
88
  ## Usage
89
 
90
  ```python
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer
 
92
 
93
+ # Load model and tokenizer
94
+ model = AutoModelForCausalLM.from_pretrained("VaidhyaMegha/Shoonya")
95
+ tokenizer = AutoTokenizer.from_pretrained("VaidhyaMegha/Shoonya")
96
 
97
  # Generate text
98
+ input_text = "Once upon a time"
99
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
100
+ output = model.generate(input_ids, max_length=100, temperature=0.7, top_p=0.9, repetition_penalty=1.1)
101
+ generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
102
+ print(generated_text)
103
  ```
104
 
105
+ ## Evaluation Results
 
 
 
 
 
106
 
107
+ The model achieved the following metrics during training:
108
+ - **Final Loss:** 7.21
109
+ - **Final Perplexity:** 1358.28
 
 
110
 
111
+ ## Ethical Considerations
 
112
 
113
+ This model is trained on the TinyStories dataset, which was designed to be suitable for children and contains simple, non-harmful content. However, as with any language model, it may still produce unexpected or potentially problematic outputs. Users should exercise caution and implement appropriate content filtering if deploying this model in production environments.
114
+
115
+ ## Citations
116
 
 
117
  ```bibtex
118
+ @article{eldan2023tinystories,
119
+ title={{TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}},
120
+ author={Eldan, Ronen and Li, Yuanzhi},
121
+ journal={arXiv preprint arXiv:2305.07759},
122
+ year={2023}
 
123
  }
124
  ```
125
 
126
+ ## License
 
 
 
 
 
 
 
 
 
 
 
127
 
128
+ This model is released under the MIT License.
config.json CHANGED
@@ -1,12 +1,19 @@
1
  {
2
- "architectures": ["TransformerLM"],
3
- "model_type": "transformer",
4
- "n_layer": 4,
5
- "n_head": 4,
6
- "n_embd": 128,
7
- "vocab_size": 50257,
8
- "max_position_embeddings": 128,
9
- "intermediate_size": 256,
10
- "torch_dtype": "float32",
11
- "transformers_version": "4.30.0"
12
- }
 
 
 
 
 
 
 
 
1
  {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "ffn_activation": "silu",
4
+ "hidden_dropout_prob": 0.1,
5
+ "hidden_size": 256,
6
+ "initializer_range": 0.02,
7
+ "intermediate_size": 512,
8
+ "layer_norm_eps": 1e-12,
9
+ "max_position_embeddings": 512,
10
+ "num_attention_heads": 8,
11
+ "num_hidden_layers": 6,
12
+ "num_key_value_heads": 4,
13
+ "position_embedding_type": "rotary",
14
+ "sliding_window": 256,
15
+ "tie_word_embeddings": true,
16
+ "use_rms_norm": true,
17
+ "use_swiglu": true,
18
+ "vocab_size": 50257
19
+ }
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21bb7c352be159145373fc235533ea6327c0010fb9c661d059a494759e80d099
3
+ size 117362381
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93444b78414f23da694467d67cf6442655a6da0e3e04266f29bbc1e550ceb62a
3
+ size 65683059
quantization_note.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Note on Quantization
2
+
3
+ The quantized version of this model is not included because PyTorch quantization has limited support on Mac M-series chips.
4
+
5
+ To quantize this model on a compatible system:
6
+ ```python
7
+ import torch
8
+ from model.transformer import TransformerLM, ModelConfig
9
+
10
+ # Load the model
11
+ checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
12
+ config = # Load your config
13
+
14
+ # Create model instance
15
+ model = TransformerLM(config)
16
+ model.load_state_dict(checkpoint)
17
+ model.eval()
18
+
19
+ # Apply dynamic quantization to linear layers
20
+ quantized_model = torch.quantization.quantize_dynamic(
21
+ model,
22
+ {torch.nn.Linear},
23
+ dtype=torch.qint8
24
+ )
25
+
26
+ # Save quantized model
27
+ torch.save(quantized_model.state_dict(), "pytorch_model_quantized.bin")
28
+ ```
tokenizer_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "gpt2",
3
+ "vocab_size": 50257,
4
+ "padding_side": "right",
5
+ "truncation_side": "right",
6
+ "bos_token_id": 50256,
7
+ "eos_token_id": 50256
8
+ }