janakhpon commited on
Commit
335b557
·
verified ·
1 Parent(s): 2fe8a15

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -1,26 +1,26 @@
1
  ---
2
  language:
3
- - mnw
4
  license: mit
5
  base_model: Qwen/Qwen2.5-1.5B
6
  tags:
7
- - mon
8
- - mnw
9
- - qwen
10
- - qwen2.5
11
- - cpt
12
- - continual-pretraining
13
- - tokenizer-expansion
14
  datasets:
15
- - janakhpon/mon-corpus-collection
16
  model-index:
17
- - name: Mon-LM-Qwen2.5-1.5B
18
- results: []
19
  ---
20
 
21
  # Mon-LM (Qwen2.5-1.5B)
22
 
23
- **Mon-LM** is a production-grade Large Language Model for the **Mon language (mnw)**. It is based on **Qwen2.5-1.5B** and has undergone **Continual Pre-Training (CPT)** on a high-quality Mon language corpus.
24
 
25
  ## Model Details
26
 
@@ -32,11 +32,11 @@ model-index:
32
 
33
  ## Vocabulary Expansion
34
 
35
- The base Qwen2.5 tokenizer was expanded to better handle the Mon script. We injected the top-performing Mon subwords into the embedding layer, significantly improving the compression ratio and linguistic atomicity for Mon text.
36
 
37
  ## Usage
38
 
39
- You can use this model directly with the Hugging Face `transformers` library:
40
 
41
  ```python
42
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -55,4 +55,4 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
55
 
56
  ## Acknowledgments
57
 
58
- This model was trained as part of the Mon Language AI initiative. Special thanks to the Mon community for the corpus collection efforts.
 
1
  ---
2
  language:
3
+ - mnw
4
  license: mit
5
  base_model: Qwen/Qwen2.5-1.5B
6
  tags:
7
+ - mon
8
+ - mnw
9
+ - qwen
10
+ - qwen2.5
11
+ - cpt
12
+ - continual-pretraining
13
+ - tokenizer-expansion
14
  datasets:
15
+ - janakhpon/mon-corpus-collection
16
  model-index:
17
+ - name: Mon-LM-Qwen2.5-1.5B
18
+ results: []
19
  ---
20
 
21
  # Mon-LM (Qwen2.5-1.5B)
22
 
23
+ Mon-LM is a Large Language Model for the Mon language (mnw). It is based on Qwen2.5-1.5B and has undergone Continual Pre-Training (CPT) on a Mon language corpus.
24
 
25
  ## Model Details
26
 
 
32
 
33
  ## Vocabulary Expansion
34
 
35
+ The base Qwen2.5 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text.
36
 
37
  ## Usage
38
 
39
+ Use this model with the Hugging Face `transformers` library:
40
 
41
  ```python
42
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
55
 
56
  ## Acknowledgments
57
 
58
+ This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.