--- language: - id license: mit tags: - spacy - token-classification - named-entity-recognition - indonesian - ner datasets: - grit-id/id_nergrit_corpus metrics: - precision - recall - f1 model-index: - name: id_nergrit_indonesian_spacy results: - task: type: token-classification name: Named Entity Recognition dataset: name: Nergrit Corpus type: grit-id/id_nergrit_corpus metrics: - type: f1 value: 0.7484 name: F1 Score - type: precision value: 0.7748 name: Precision - type: recall value: 0.7237 name: Recall widget: - text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023." - text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen." - text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun." --- # Indonesian Named Entity Recognition Model This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition. ## Model Description This model recognizes 19 entity types in Indonesian text: - **PER** (Person): Names of people - **ORG** (Organization): Companies, institutions - **GPE** (Geopolitical Entity): Countries, cities, states - **LOC** (Location): Non-GPE locations, facilities - **DAT** (Date): Absolute or relative dates - **MON** (Money): Monetary values - **PRC** (Percent): Percentages - **TIM** (Time): Times of day - **QTY** (Quantity): Measurements and quantities - **CRD** (Cardinal): Cardinal numbers - **ORD** (Ordinal): Ordinal numbers - **EVT** (Event): Named events - **FAC** (Facility): Buildings, airports, stations - **LAW** (Law): Legal documents, laws - **LAN** (Language): Named languages - **NOR** (Political Organization): Political entities - **PRD** (Product): Products, brands - **REG** (Religion): Religious groups - **WOA** (Work of Art): Titles of books, songs, etc. ## Performance | Metric | Score | |--------|-------| | **F1 Score** | 74.84% | | **Precision** | 77.48% | | **Recall** | 72.37% | ### Top Performing Entities | Entity | F1 Score | |--------|----------| | PRC (Percent) | 93.72% | | DAT (Date) | 92.41% | | MON (Money) | 92.56% | | TIM (Time) | 88.51% | | CRD (Cardinal) | 86.23% | ## Usage ### Installation ```bash pip install spacy pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl ``` ### Basic Usage ```python import spacy # Load the model nlp = spacy.load("id_nergrit_indonesian_spacy") # Process text text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023." doc = nlp(text) # Extract entities for ent in doc.ents: print(f"{ent.text} -> {ent.label_}") ``` Output: ``` Joko Widodo -> PER Jakarta -> GPE 17 Agustus 2023 -> DAT ``` ### Batch Processing ```python import spacy nlp = spacy.load("id_nergrit_indonesian_spacy") texts = [ "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.", "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun." ] for doc in nlp.pipe(texts): print([(ent.text, ent.label_) for ent in doc.ents]) ``` ### Using with Hugging Face Hub ```python import spacy # Load directly from Hugging Face nlp = spacy.load("id_nergrit_indonesian_spacy") doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.") for ent in doc.ents: print(f"{ent.text} ({ent.label_})") ``` ## Training Data The model was trained on the Nergrit Corpus dataset: - **Training examples**: 12,532 - **Validation examples**: 2,521 - **Test examples**: 2,399 Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) ## Training Procedure ### Model Architecture - **Base**: spaCy's Tok2Vec + Transition-based NER - **Embedding**: MultiHashEmbed with 96-dimensional vectors - **Encoder**: MaxoutWindowEncoder (depth=4, window=1) - **Parser**: Transition-based with 64 hidden units ### Training Configuration - **Optimizer**: Adam (lr=0.001) - **Batch size**: Dynamic (100-1000 words) - **Max steps**: 20,000 - **Dropout**: 0.1 - **Evaluation frequency**: Every 200 steps ## Limitations - The model performs best on formal Indonesian text - Some entity types (WOA, FAC, PRD) have lower performance due to limited training data - May not generalize well to informal/colloquial Indonesian or social media text - Performance may vary on domain-specific texts ## Citation If you use this model, please cite: ```bibtex @misc{id_nergrit_indonesian_spacy, author = {nerdv2}, title = {Indonesian Named Entity Recognition Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy} } ``` ## Acknowledgments - Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus - Framework: [spaCy](https://spacy.io/) - Training: Based on the Nergrit Corpus dataset ## License MIT License ## Contact For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).