File size: 1,842 Bytes
48e1142 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | ---
language:
- mnw
license: mit
base_model: SeaLLMs/SeaLLMs-v3-1.5B
tags:
- mon
- mnw
- seallm
- qwen2.5
- cpt
- continual-pretraining
- tokenizer-expansion
datasets:
- janakhpon/mon-corpus-collection
model-index:
- name: Mon-LM-SeaLLMs-v3-1.5B
results: []
---
# Mon-LM (SeaLLMs-v3-1.5B)
Mon-LM is a Large Language Model for the Mon language (mnw). This variant is based on **SeaLLMs-v3-1.5B** (a Qwen2.5-based model optimized for Southeast Asian languages) and has undergone Continual Pre-Training (CPT) on a Mon language corpus.
## Model Details
- **Base Model:** [SeaLLMs/SeaLLMs-v3-1.5B](https://huggingface.co/SeaLLMs/SeaLLMs-v3-1.5B)
- **Language:** Mon (mnw)
- **Training Method:** Continual Pre-Training (CPT) via QLoRA
- **Tokenizer:** Expanded SeaLLM tokenizer with ~3,000 Mon-specific tokens (SentencePiece Unigram)
- **Normalization:** All Mon text is NFC normalized.
## Vocabulary Expansion
The base SeaLLM-v3 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text.
## Usage
Use this model with the Hugging Face `transformers` library:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "janakhpon/mon-lm-seallm-v3-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "ပ္ဍဲကွာန်ဗော်ဒိုဟ်"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Acknowledgments
This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.
|