---
language:
- id
license: mit
tags:
- spacy
- token-classification
- named-entity-recognition
- indonesian
- ner
datasets:
- grit-id/id_nergrit_corpus
metrics:
- precision
- recall
- f1
model-index:
- name: id_nergrit_indonesian_spacy
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: Nergrit Corpus
      type: grit-id/id_nergrit_corpus
    metrics:
    - type: f1
      value: 0.7484
      name: F1 Score
    - type: precision
      value: 0.7748
      name: Precision
    - type: recall
      value: 0.7237
      name: Recall
widget:
- text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
- text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen."
- text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
---

# Indonesian Named Entity Recognition Model

This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition.

## Model Description

This model recognizes 19 entity types in Indonesian text:

- **PER** (Person): Names of people
- **ORG** (Organization): Companies, institutions
- **GPE** (Geopolitical Entity): Countries, cities, states
- **LOC** (Location): Non-GPE locations, facilities
- **DAT** (Date): Absolute or relative dates
- **MON** (Money): Monetary values
- **PRC** (Percent): Percentages
- **TIM** (Time): Times of day
- **QTY** (Quantity): Measurements and quantities
- **CRD** (Cardinal): Cardinal numbers
- **ORD** (Ordinal): Ordinal numbers
- **EVT** (Event): Named events
- **FAC** (Facility): Buildings, airports, stations
- **LAW** (Law): Legal documents, laws
- **LAN** (Language): Named languages
- **NOR** (Political Organization): Political entities
- **PRD** (Product): Products, brands
- **REG** (Religion): Religious groups
- **WOA** (Work of Art): Titles of books, songs, etc.

## Performance

| Metric | Score |
|--------|-------|
| **F1 Score** | 74.84% |
| **Precision** | 77.48% |
| **Recall** | 72.37% |

### Top Performing Entities

| Entity | F1 Score |
|--------|----------|
| PRC (Percent) | 93.72% |
| DAT (Date) | 92.41% |
| MON (Money) | 92.56% |
| TIM (Time) | 88.51% |
| CRD (Cardinal) | 86.23% |

## Usage

### Installation

```bash
pip install spacy
pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl
```

### Basic Usage

```python
import spacy

# Load the model
nlp = spacy.load("id_nergrit_indonesian_spacy")

# Process text
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")
```

Output:
```
Joko Widodo -> PER
Jakarta -> GPE
17 Agustus 2023 -> DAT
```

### Batch Processing

```python
import spacy

nlp = spacy.load("id_nergrit_indonesian_spacy")

texts = [
    "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
    "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
]

for doc in nlp.pipe(texts):
    print([(ent.text, ent.label_) for ent in doc.ents])
```

### Using with Hugging Face Hub

```python
import spacy

# Load directly from Hugging Face
nlp = spacy.load("id_nergrit_indonesian_spacy")
doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")

for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")
```

## Training Data

The model was trained on the Nergrit Corpus dataset:
- **Training examples**: 12,532
- **Validation examples**: 2,521
- **Test examples**: 2,399

Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus)

## Training Procedure

### Model Architecture

- **Base**: spaCy's Tok2Vec + Transition-based NER
- **Embedding**: MultiHashEmbed with 96-dimensional vectors
- **Encoder**: MaxoutWindowEncoder (depth=4, window=1)
- **Parser**: Transition-based with 64 hidden units

### Training Configuration

- **Optimizer**: Adam (lr=0.001)
- **Batch size**: Dynamic (100-1000 words)
- **Max steps**: 20,000
- **Dropout**: 0.1
- **Evaluation frequency**: Every 200 steps

## Limitations

- The model performs best on formal Indonesian text
- Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
- May not generalize well to informal/colloquial Indonesian or social media text
- Performance may vary on domain-specific texts

## Citation

If you use this model, please cite:

```bibtex
@misc{id_nergrit_indonesian_spacy,
  author = {nerdv2},
  title = {Indonesian Named Entity Recognition Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
}
```

## Acknowledgments

- Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus
- Framework: [spaCy](https://spacy.io/)
- Training: Based on the Nergrit Corpus dataset

## License

MIT License

## Contact

For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).