File size: 5,351 Bytes
396b648 097488c 396b648 097488c 396b648 097488c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
language:
- id
license: mit
tags:
- spacy
- token-classification
- named-entity-recognition
- indonesian
- ner
datasets:
- grit-id/id_nergrit_corpus
metrics:
- precision
- recall
- f1
model-index:
- name: id_nergrit_indonesian_spacy
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: Nergrit Corpus
type: grit-id/id_nergrit_corpus
metrics:
- type: f1
value: 0.7484
name: F1 Score
- type: precision
value: 0.7748
name: Precision
- type: recall
value: 0.7237
name: Recall
widget:
- text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
- text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen."
- text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
---
# Indonesian Named Entity Recognition Model
This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition.
## Model Description
This model recognizes 19 entity types in Indonesian text:
- **PER** (Person): Names of people
- **ORG** (Organization): Companies, institutions
- **GPE** (Geopolitical Entity): Countries, cities, states
- **LOC** (Location): Non-GPE locations, facilities
- **DAT** (Date): Absolute or relative dates
- **MON** (Money): Monetary values
- **PRC** (Percent): Percentages
- **TIM** (Time): Times of day
- **QTY** (Quantity): Measurements and quantities
- **CRD** (Cardinal): Cardinal numbers
- **ORD** (Ordinal): Ordinal numbers
- **EVT** (Event): Named events
- **FAC** (Facility): Buildings, airports, stations
- **LAW** (Law): Legal documents, laws
- **LAN** (Language): Named languages
- **NOR** (Political Organization): Political entities
- **PRD** (Product): Products, brands
- **REG** (Religion): Religious groups
- **WOA** (Work of Art): Titles of books, songs, etc.
## Performance
| Metric | Score |
|--------|-------|
| **F1 Score** | 74.84% |
| **Precision** | 77.48% |
| **Recall** | 72.37% |
### Top Performing Entities
| Entity | F1 Score |
|--------|----------|
| PRC (Percent) | 93.72% |
| DAT (Date) | 92.41% |
| MON (Money) | 92.56% |
| TIM (Time) | 88.51% |
| CRD (Cardinal) | 86.23% |
## Usage
### Installation
```bash
pip install spacy
pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl
```
### Basic Usage
```python
import spacy
# Load the model
nlp = spacy.load("id_nergrit_indonesian_spacy")
# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
```
Output:
```
Joko Widodo -> PER
Jakarta -> GPE
17 Agustus 2023 -> DAT
```
### Batch Processing
```python
import spacy
nlp = spacy.load("id_nergrit_indonesian_spacy")
texts = [
"Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
"Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
]
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])
```
### Using with Hugging Face Hub
```python
import spacy
# Load directly from Hugging Face
nlp = spacy.load("id_nergrit_indonesian_spacy")
doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")
```
## Training Data
The model was trained on the Nergrit Corpus dataset:
- **Training examples**: 12,532
- **Validation examples**: 2,521
- **Test examples**: 2,399
Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus)
## Training Procedure
### Model Architecture
- **Base**: spaCy's Tok2Vec + Transition-based NER
- **Embedding**: MultiHashEmbed with 96-dimensional vectors
- **Encoder**: MaxoutWindowEncoder (depth=4, window=1)
- **Parser**: Transition-based with 64 hidden units
### Training Configuration
- **Optimizer**: Adam (lr=0.001)
- **Batch size**: Dynamic (100-1000 words)
- **Max steps**: 20,000
- **Dropout**: 0.1
- **Evaluation frequency**: Every 200 steps
## Limitations
- The model performs best on formal Indonesian text
- Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
- May not generalize well to informal/colloquial Indonesian or social media text
- Performance may vary on domain-specific texts
## Citation
If you use this model, please cite:
```bibtex
@misc{id_nergrit_indonesian_spacy,
author = {nerdv2},
title = {Indonesian Named Entity Recognition Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
}
```
## Acknowledgments
- Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus
- Framework: [spaCy](https://spacy.io/)
- Training: Based on the Nergrit Corpus dataset
## License
MIT License
## Contact
For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).
|