File size: 1,842 Bytes

48e1142

---
language:
  - mnw
license: mit
base_model: SeaLLMs/SeaLLMs-v3-1.5B
tags:
  - mon
  - mnw
  - seallm
  - qwen2.5
  - cpt
  - continual-pretraining
  - tokenizer-expansion
datasets:
  - janakhpon/mon-corpus-collection
model-index:
  - name: Mon-LM-SeaLLMs-v3-1.5B
    results: []
---

# Mon-LM (SeaLLMs-v3-1.5B)

Mon-LM is a Large Language Model for the Mon language (mnw). This variant is based on **SeaLLMs-v3-1.5B** (a Qwen2.5-based model optimized for Southeast Asian languages) and has undergone Continual Pre-Training (CPT) on a Mon language corpus.

## Model Details

- **Base Model:** [SeaLLMs/SeaLLMs-v3-1.5B](https://huggingface.co/SeaLLMs/SeaLLMs-v3-1.5B)
- **Language:** Mon (mnw)
- **Training Method:** Continual Pre-Training (CPT) via QLoRA
- **Tokenizer:** Expanded SeaLLM tokenizer with ~3,000 Mon-specific tokens (SentencePiece Unigram)
- **Normalization:** All Mon text is NFC normalized.

## Vocabulary Expansion

The base SeaLLM-v3 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text.

## Usage

Use this model with the Hugging Face `transformers` library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "janakhpon/mon-lm-seallm-v3-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "ပ္ဍဲကွာန်ဗော်ဒိုဟ်"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Acknowledgments

This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.