Mon-LM (SeaLLMs-v3-1.5B)
Mon-LM is a Large Language Model for the Mon language (mnw). This variant is based on SeaLLMs-v3-1.5B (a Qwen2.5-based model optimized for Southeast Asian languages) and has undergone Continual Pre-Training (CPT) on a Mon language corpus.
Model Details
- Base Model: SeaLLMs/SeaLLMs-v3-1.5B
- Language: Mon (mnw)
- Training Method: Continual Pre-Training (CPT) via QLoRA
- Tokenizer: Expanded SeaLLM tokenizer with ~3,000 Mon-specific tokens (SentencePiece Unigram)
- Normalization: All Mon text is NFC normalized.
Vocabulary Expansion
The base SeaLLM-v3 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text.
Usage
Use this model with the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "janakhpon/mon-lm-seallm-v3-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "ပ္ဍဲကွာန်ဗော်ဒိုဟ်"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Acknowledgments
This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for janakhpon/mon-lm-seallm-v3-1.5b
Base model
SeaLLMs/SeaLLMs-v3-1.5B