CADT-IDRI/Khmer_News_classification
Viewer β’ Updated β’ 7.34k β’ 134
This model is a fine-tuned version of FacebookAI/xlm-roberta-base for Khmer-language news text classification. It was trained to categorize Cambodian news articles into predefined topic categories.
| Metric | Value |
|---|---|
| Accuracy | 94% |
| F1 (macro) | 94% |
| AUC | 0.9933 |
| Error rate | 0.056 |
| Category | Precision | Recall | F1 |
|---|---|---|---|
| economic | 0.91 | 0.94 | 0.93 |
| entertainment | 0.94 | 0.97 | 0.96 |
| life | 0.87 | 0.82 | 0.85 |
| politic | 0.97 | 0.96 | 0.97 |
| sport | 0.97 | 0.99 | 0.98 |
| technology | 0.93 | 0.92 | 0.92 |
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="kidkidmoon/xlm-r-khmer-news-classification"
)
text = "ααΆαααααααααααααΈααΆαααααααα
αααα»αααααα·ααΈαααΆαααααααΆα" # Khmer text
result = classifier(text)
print(result)
Or manually with the tokenizer:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "kidkidmoon/xlm-r-khmer-news-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("your Khmer text here", return_tensors="pt", truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
predicted_class = logits.argmax().item()
print(model.config.id2label[predicted_class])
If you use this model, please cite:
@misc{kimlangsrun2025khmer,
author = {Srun Kimlang},
title = {XLM-RoBERTa Fine-tuned for Khmer News Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/kidkidmoon/xlm-r-khmer-news-classification}
}
Base model
FacebookAI/xlm-roberta-base