ReySajju742/shaistagi_clean
Viewer • Updated • 127M • 1.41k
Continued-pretrained Urdu language model based on LiquidAI/LFM2-350M.
LiquidAI/LFM2-350MReySajju742/shaistagi_clean (train split)textnum_train_epochs = 3per_device_train_batch_size = 1gradient_accumulation_steps = 16 (effective batch size = 16)learning_rate = 2.0e-4weight_decay = 0.01warmup_ratio = 0.03max_grad_norm = 1.0max_seq_length = 512bf16 (fallback to fp16)gradient_checkpointing = Truesave_strategy = "epoch", save_total_limit = 3load_dataset(..., streaming=True).map() calls: load_from_cache_file=False, keep_in_memory=False, writer_batch_size=1000text column retainedاردو ایک خوبصورت زبان ہے۔مجھے پاکستان کی تاریخ کے بارے میں بتائیں۔آج موسم کیسا ہے؟آپ کا پسندیدہ شاعر کون ہے؟from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mahwizzzz/qyra-350m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "اردو ایک خوبصورت زبان ہے۔"
inputs = tokenizer(prompt, return_tensors="pt")
gen = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(gen[0], skip_special_tokens=True))
Training code available at: GitHub Repository
Key files:
src/pretrain.py - Continued pretraining scriptsrc/data_utils.py - Memory-safe data pipelinesrc/tokenizer_utils.py - Tokenizer expansionconfig.yaml - ConfigurationThis model is intended for:
Not intended for:
@misc{qyra-350m,
title={mahwizzzz/qyra-350m: Continued Pretraining of LFM-2 for Urdu},
author={Mahwiz Khalil},
year={2025},
url={https://huggingface.co/mahwizzzz/qyra-350m}
}