nerdv2's picture
Initial commit
097488c verified
---
language:
- id
license: mit
tags:
- spacy
- token-classification
- named-entity-recognition
- indonesian
- ner
datasets:
- grit-id/id_nergrit_corpus
metrics:
- precision
- recall
- f1
model-index:
- name: id_nergrit_indonesian_spacy
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: Nergrit Corpus
type: grit-id/id_nergrit_corpus
metrics:
- type: f1
value: 0.7484
name: F1 Score
- type: precision
value: 0.7748
name: Precision
- type: recall
value: 0.7237
name: Recall
widget:
- text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
- text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen."
- text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
---
# Indonesian Named Entity Recognition Model
This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition.
## Model Description
This model recognizes 19 entity types in Indonesian text:
- **PER** (Person): Names of people
- **ORG** (Organization): Companies, institutions
- **GPE** (Geopolitical Entity): Countries, cities, states
- **LOC** (Location): Non-GPE locations, facilities
- **DAT** (Date): Absolute or relative dates
- **MON** (Money): Monetary values
- **PRC** (Percent): Percentages
- **TIM** (Time): Times of day
- **QTY** (Quantity): Measurements and quantities
- **CRD** (Cardinal): Cardinal numbers
- **ORD** (Ordinal): Ordinal numbers
- **EVT** (Event): Named events
- **FAC** (Facility): Buildings, airports, stations
- **LAW** (Law): Legal documents, laws
- **LAN** (Language): Named languages
- **NOR** (Political Organization): Political entities
- **PRD** (Product): Products, brands
- **REG** (Religion): Religious groups
- **WOA** (Work of Art): Titles of books, songs, etc.
## Performance
| Metric | Score |
|--------|-------|
| **F1 Score** | 74.84% |
| **Precision** | 77.48% |
| **Recall** | 72.37% |
### Top Performing Entities
| Entity | F1 Score |
|--------|----------|
| PRC (Percent) | 93.72% |
| DAT (Date) | 92.41% |
| MON (Money) | 92.56% |
| TIM (Time) | 88.51% |
| CRD (Cardinal) | 86.23% |
## Usage
### Installation
```bash
pip install spacy
pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl
```
### Basic Usage
```python
import spacy
# Load the model
nlp = spacy.load("id_nergrit_indonesian_spacy")
# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
```
Output:
```
Joko Widodo -> PER
Jakarta -> GPE
17 Agustus 2023 -> DAT
```
### Batch Processing
```python
import spacy
nlp = spacy.load("id_nergrit_indonesian_spacy")
texts = [
"Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
"Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
]
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])
```
### Using with Hugging Face Hub
```python
import spacy
# Load directly from Hugging Face
nlp = spacy.load("id_nergrit_indonesian_spacy")
doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")
```
## Training Data
The model was trained on the Nergrit Corpus dataset:
- **Training examples**: 12,532
- **Validation examples**: 2,521
- **Test examples**: 2,399
Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus)
## Training Procedure
### Model Architecture
- **Base**: spaCy's Tok2Vec + Transition-based NER
- **Embedding**: MultiHashEmbed with 96-dimensional vectors
- **Encoder**: MaxoutWindowEncoder (depth=4, window=1)
- **Parser**: Transition-based with 64 hidden units
### Training Configuration
- **Optimizer**: Adam (lr=0.001)
- **Batch size**: Dynamic (100-1000 words)
- **Max steps**: 20,000
- **Dropout**: 0.1
- **Evaluation frequency**: Every 200 steps
## Limitations
- The model performs best on formal Indonesian text
- Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
- May not generalize well to informal/colloquial Indonesian or social media text
- Performance may vary on domain-specific texts
## Citation
If you use this model, please cite:
```bibtex
@misc{id_nergrit_indonesian_spacy,
author = {nerdv2},
title = {Indonesian Named Entity Recognition Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
}
```
## Acknowledgments
- Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus
- Framework: [spaCy](https://spacy.io/)
- Training: Based on the Nergrit Corpus dataset
## License
MIT License
## Contact
For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).