| ---
|
| language:
|
| - id
|
| license: mit
|
| tags:
|
| - spacy
|
| - token-classification
|
| - named-entity-recognition
|
| - indonesian
|
| - ner
|
| datasets:
|
| - grit-id/id_nergrit_corpus
|
| metrics:
|
| - precision
|
| - recall
|
| - f1
|
| model-index:
|
| - name: id_nergrit_indonesian_spacy
|
| results:
|
| - task:
|
| type: token-classification
|
| name: Named Entity Recognition
|
| dataset:
|
| name: Nergrit Corpus
|
| type: grit-id/id_nergrit_corpus
|
| metrics:
|
| - type: f1
|
| value: 0.7484
|
| name: F1 Score
|
| - type: precision
|
| value: 0.7748
|
| name: Precision
|
| - type: recall
|
| value: 0.7237
|
| name: Recall
|
| widget:
|
| - text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
|
| - text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen."
|
| - text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
|
| ---
|
|
|
| # Indonesian Named Entity Recognition Model
|
|
|
| This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition.
|
|
|
| ## Model Description
|
|
|
| This model recognizes 19 entity types in Indonesian text:
|
|
|
| - **PER** (Person): Names of people
|
| - **ORG** (Organization): Companies, institutions
|
| - **GPE** (Geopolitical Entity): Countries, cities, states
|
| - **LOC** (Location): Non-GPE locations, facilities
|
| - **DAT** (Date): Absolute or relative dates
|
| - **MON** (Money): Monetary values
|
| - **PRC** (Percent): Percentages
|
| - **TIM** (Time): Times of day
|
| - **QTY** (Quantity): Measurements and quantities
|
| - **CRD** (Cardinal): Cardinal numbers
|
| - **ORD** (Ordinal): Ordinal numbers
|
| - **EVT** (Event): Named events
|
| - **FAC** (Facility): Buildings, airports, stations
|
| - **LAW** (Law): Legal documents, laws
|
| - **LAN** (Language): Named languages
|
| - **NOR** (Political Organization): Political entities
|
| - **PRD** (Product): Products, brands
|
| - **REG** (Religion): Religious groups
|
| - **WOA** (Work of Art): Titles of books, songs, etc.
|
|
|
| ## Performance
|
|
|
| | Metric | Score |
|
| |--------|-------|
|
| | **F1 Score** | 74.84% |
|
| | **Precision** | 77.48% |
|
| | **Recall** | 72.37% |
|
|
|
| ### Top Performing Entities
|
|
|
| | Entity | F1 Score |
|
| |--------|----------|
|
| | PRC (Percent) | 93.72% |
|
| | DAT (Date) | 92.41% |
|
| | MON (Money) | 92.56% |
|
| | TIM (Time) | 88.51% |
|
| | CRD (Cardinal) | 86.23% |
|
|
|
| ## Usage
|
|
|
| ### Installation
|
|
|
| ```bash
|
| pip install spacy
|
| pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl
|
| ```
|
|
|
| ### Basic Usage
|
|
|
| ```python
|
| import spacy
|
|
|
| # Load the model
|
| nlp = spacy.load("id_nergrit_indonesian_spacy")
|
|
|
| # Process text
|
| text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
|
| doc = nlp(text)
|
|
|
| # Extract entities
|
| for ent in doc.ents:
|
| print(f"{ent.text} -> {ent.label_}")
|
| ```
|
|
|
| Output:
|
| ```
|
| Joko Widodo -> PER
|
| Jakarta -> GPE
|
| 17 Agustus 2023 -> DAT
|
| ```
|
|
|
| ### Batch Processing
|
|
|
| ```python
|
| import spacy
|
|
|
| nlp = spacy.load("id_nergrit_indonesian_spacy")
|
|
|
| texts = [
|
| "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
|
| "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
|
| ]
|
|
|
| for doc in nlp.pipe(texts):
|
| print([(ent.text, ent.label_) for ent in doc.ents])
|
| ```
|
|
|
| ### Using with Hugging Face Hub
|
|
|
| ```python
|
| import spacy
|
|
|
| # Load directly from Hugging Face
|
| nlp = spacy.load("id_nergrit_indonesian_spacy")
|
| doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")
|
|
|
| for ent in doc.ents:
|
| print(f"{ent.text} ({ent.label_})")
|
| ```
|
|
|
| ## Training Data
|
|
|
| The model was trained on the Nergrit Corpus dataset:
|
| - **Training examples**: 12,532
|
| - **Validation examples**: 2,521
|
| - **Test examples**: 2,399
|
|
|
| Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus)
|
|
|
| ## Training Procedure
|
|
|
| ### Model Architecture
|
|
|
| - **Base**: spaCy's Tok2Vec + Transition-based NER
|
| - **Embedding**: MultiHashEmbed with 96-dimensional vectors
|
| - **Encoder**: MaxoutWindowEncoder (depth=4, window=1)
|
| - **Parser**: Transition-based with 64 hidden units
|
|
|
| ### Training Configuration
|
|
|
| - **Optimizer**: Adam (lr=0.001)
|
| - **Batch size**: Dynamic (100-1000 words)
|
| - **Max steps**: 20,000
|
| - **Dropout**: 0.1
|
| - **Evaluation frequency**: Every 200 steps
|
|
|
| ## Limitations
|
|
|
| - The model performs best on formal Indonesian text
|
| - Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
|
| - May not generalize well to informal/colloquial Indonesian or social media text
|
| - Performance may vary on domain-specific texts
|
|
|
| ## Citation
|
|
|
| If you use this model, please cite:
|
|
|
| ```bibtex
|
| @misc{id_nergrit_indonesian_spacy,
|
| author = {nerdv2},
|
| title = {Indonesian Named Entity Recognition Model},
|
| year = {2025},
|
| publisher = {Hugging Face},
|
| url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
|
| }
|
| ```
|
|
|
| ## Acknowledgments
|
|
|
| - Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus
|
| - Framework: [spaCy](https://spacy.io/)
|
| - Training: Based on the Nergrit Corpus dataset
|
|
|
| ## License
|
|
|
| MIT License
|
|
|
| ## Contact
|
|
|
| For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).
|
|
|