You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

PersianPunc: ParsBERT for Persian Punctuation Restoration

This model is a fine-tuned version of ParsBERT for Persian punctuation restoration.

Model Description

  • Language: Persian (Farsi)
  • Task: Token Classification (Punctuation Restoration)
  • Base Model: ParsBERT
  • Dataset: PersianPunc (17M samples)
  • Training Subset: 1M samples

Performance

Evaluated on 1,000 test sentences:

  • Macro-averaged F1: 91.33%
  • Micro-averaged F1: 97.28%
  • Full Sentence Match Rate: 61.80%

Per-class Performance:

Punctuation Precision Recall F1-Score
Persian Comma (،) 84.08% 76.35% 80.03%
Period (.) 98.55% 98.86% 98.71%
Question (؟) 87.50% 90.32% 88.89%
Colon (:) 91.37% 89.55% 90.45%

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "MohammadJRanjbar/parsbert-persian-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Prepare input (text without punctuation)
text = "سلام چطوری امروز هوا خیلی خوبه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Label mapping
id2label = {
    0: "EMPTY",
    1: "COMMA",
    2: "QUESTION", 
    3: "PERIOD",
    4: "COLON"
}

# Map predictions to punctuation
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]

# Reconstruct text with punctuation
punct_map = {
    "COMMA": "،",
    "QUESTION": "؟",
    "PERIOD": ".",
    "COLON": ":"
}

result = []
for token, label in zip(tokens, labels):
    if token in ["[CLS]", "[SEP]", "[PAD]"]:
        continue
    result.append(token)
    if label in punct_map:
        result.append(punct_map[label])

punctuated_text = " ".join(result).replace(" ##", "")
print(punctuated_text)

Training Details

  • Training Data: 989,000 samples from PersianPunc dataset
  • Validation Data: 10,000 samples
  • Test Data: 1,000 samples
  • Epochs: 3
  • Batch Size: 680 (effective, with gradient accumulation)
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Weight Decay: 0.01

Citation

If you use this model, please cite:

@inproceedings{kalahroodi-etal-2026-persianpunc,
    title = "{P}ersian{P}unc: A Large-Scale Dataset and {BERT}-Based Approach for {P}ersian Punctuation Restoration",
    author = "Kalahroodi, Mohammad Javad Ranjbar  and
      Faili, Heshaam  and
      Shakery, Azadeh",
    editor = "Merchant, Rayyan  and
      Megerdoomian, Karine",
    booktitle = "The Proceedings of the First Workshop on {NLP} and {LLM}s for the {I}ranian Language Family",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.silkroadnlp-1.11/",
    doi = "10.18653/v1/2026.silkroadnlp-1.11",
    pages = "105--113",
    ISBN = "979-8-89176-371-5",
    abstract = "Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33{\%} on our test set while maintaining efficiency suitable for real-time applications. We make our dataset and model publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages."
}

License

MIT License

Authors

  • Mohammad Javad Ranjbar Kalahroodi (University of Tehran)
  • Heshaam Faili (University of Tehran)
  • Azadeh Shakery (University of Tehran & IPM)

Contact

For questions or issues, please contact: mohammadjranjbar@ut.ac.ir

Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train MohammadJRanjbar/parsbert-persian-punctuation