--- language: - mnw license: mit base_model: SeaLLMs/SeaLLMs-v3-1.5B tags: - mon - mnw - seallm - qwen2.5 - cpt - continual-pretraining - tokenizer-expansion datasets: - janakhpon/mon-corpus-collection model-index: - name: Mon-LM-SeaLLMs-v3-1.5B results: [] --- # Mon-LM (SeaLLMs-v3-1.5B) Mon-LM is a Large Language Model for the Mon language (mnw). This variant is based on **SeaLLMs-v3-1.5B** (a Qwen2.5-based model optimized for Southeast Asian languages) and has undergone Continual Pre-Training (CPT) on a Mon language corpus. ## Model Details - **Base Model:** [SeaLLMs/SeaLLMs-v3-1.5B](https://huggingface.co/SeaLLMs/SeaLLMs-v3-1.5B) - **Language:** Mon (mnw) - **Training Method:** Continual Pre-Training (CPT) via QLoRA - **Tokenizer:** Expanded SeaLLM tokenizer with ~3,000 Mon-specific tokens (SentencePiece Unigram) - **Normalization:** All Mon text is NFC normalized. ## Vocabulary Expansion The base SeaLLM-v3 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text. ## Usage Use this model with the Hugging Face `transformers` library: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "janakhpon/mon-lm-seallm-v3-1.5b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = "ပ္ဍဲကွာန်ဗော်ဒိုဟ်" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Acknowledgments This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.