Mon-LM (Qwen2.5-0.5B)

Mon-LM is a Large Language Model for the Mon language (mnw). It is based on Qwen2.5-0.5B and has undergone Continual Pre-Training (CPT) on a Mon language corpus.

Model Details

Base Model: Qwen/Qwen2.5-0.5B
Language: Mon (mnw)
Training Method: Continual Pre-Training (CPT) via QLoRA
Tokenizer: Expanded Qwen2.5 tokenizer with ~3,000 Mon-specific tokens (SentencePiece Unigram)
Normalization: All Mon text is NFC normalized.

Vocabulary Expansion

The base Qwen2.5 tokenizer was expanded for the Mon script. Mon subwords were injected into the embedding layer to adjust the compression ratio and linguistic atomicity for Mon text.

Usage

Use this model with the Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "janakhpon/mon-lm-qwen2.5-0.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "ပ္ဍဲကွာန်ဗော်ဒိုဟ်"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Acknowledgments

This model was trained as part of the Mon Language AI initiative. Credits to the Mon community for the corpus collection efforts.

Downloads last month: 4

Safetensors

Model size

0.6B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for janakhpon/mon-lm-qwen2.5-0.5b

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(622)

this model