Initial commit

097488c verified 7 months ago

5.35 kB

	---
	language:
	- id
	license: mit
	tags:
	- spacy
	- token-classification
	- named-entity-recognition
	- indonesian
	- ner
	datasets:
	- grit-id/id_nergrit_corpus
	metrics:
	- precision
	- recall
	- f1
	model-index:
	- name: id_nergrit_indonesian_spacy
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: Nergrit Corpus
	type: grit-id/id_nergrit_corpus
	metrics:
	- type: f1
	value: 0.7484
	name: F1 Score
	- type: precision
	value: 0.7748
	name: Precision
	- type: recall
	value: 0.7237
	name: Recall
	widget:
	- text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
	- text: "Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen."
	- text: "Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
	---

	# Indonesian Named Entity Recognition Model

	This is a spaCy model trained on the [Nergrit Corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus) for Indonesian Named Entity Recognition.

	## Model Description

	This model recognizes 19 entity types in Indonesian text:

	- PER (Person): Names of people
	- ORG (Organization): Companies, institutions
	- GPE (Geopolitical Entity): Countries, cities, states
	- LOC (Location): Non-GPE locations, facilities
	- DAT (Date): Absolute or relative dates
	- MON (Money): Monetary values
	- PRC (Percent): Percentages
	- TIM (Time): Times of day
	- QTY (Quantity): Measurements and quantities
	- CRD (Cardinal): Cardinal numbers
	- ORD (Ordinal): Ordinal numbers
	- EVT (Event): Named events
	- FAC (Facility): Buildings, airports, stations
	- LAW (Law): Legal documents, laws
	- LAN (Language): Named languages
	- NOR (Political Organization): Political entities
	- PRD (Product): Products, brands
	- REG (Religion): Religious groups
	- WOA (Work of Art): Titles of books, songs, etc.

	## Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| F1 Score \| 74.84% \|
	\| Precision \| 77.48% \|
	\| Recall \| 72.37% \|

	### Top Performing Entities

	\| Entity \| F1 Score \|
	\|--------\|----------\|
	\| PRC (Percent) \| 93.72% \|
	\| DAT (Date) \| 92.41% \|
	\| MON (Money) \| 92.56% \|
	\| TIM (Time) \| 88.51% \|
	\| CRD (Cardinal) \| 86.23% \|

	## Usage

	### Installation

	```bash
	pip install spacy
	pip install https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy/resolve/main/id_nergrit_indonesian_spacy-1.0.0-py3-none-any.whl
	```

	### Basic Usage

	```python
	import spacy

	# Load the model
	nlp = spacy.load("id_nergrit_indonesian_spacy")

	# Process text
	text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2023."
	doc = nlp(text)

	# Extract entities
	for ent in doc.ents:
	print(f"{ent.text} -> {ent.label_}")
	```

	Output:
	```
	Joko Widodo -> PER
	Jakarta -> GPE
	17 Agustus 2023 -> DAT
	```

	### Batch Processing

	```python
	import spacy

	nlp = spacy.load("id_nergrit_indonesian_spacy")

	texts = [
	"Bank Indonesia mengumumkan suku bunga sebesar 5.75 persen.",
	"Menteri Keuangan Sri Mulyani menyatakan APBN 2023 mencapai Rp 3000 triliun."
	]

	for doc in nlp.pipe(texts):
	print([(ent.text, ent.label_) for ent in doc.ents])
	```

	### Using with Hugging Face Hub

	```python
	import spacy

	# Load directly from Hugging Face
	nlp = spacy.load("id_nergrit_indonesian_spacy")
	doc = nlp("Universitas Indonesia terletak di Depok, Jawa Barat.")

	for ent in doc.ents:
	print(f"{ent.text} ({ent.label_})")
	```

	## Training Data

	The model was trained on the Nergrit Corpus dataset:
	- Training examples: 12,532
	- Validation examples: 2,521
	- Test examples: 2,399

	Dataset source: [grit-id/id_nergrit_corpus](https://huggingface.co/datasets/grit-id/id_nergrit_corpus)

	## Training Procedure

	### Model Architecture

	- Base: spaCy's Tok2Vec + Transition-based NER
	- Embedding: MultiHashEmbed with 96-dimensional vectors
	- Encoder: MaxoutWindowEncoder (depth=4, window=1)
	- Parser: Transition-based with 64 hidden units

	### Training Configuration

	- Optimizer: Adam (lr=0.001)
	- Batch size: Dynamic (100-1000 words)
	- Max steps: 20,000
	- Dropout: 0.1
	- Evaluation frequency: Every 200 steps

	## Limitations

	- The model performs best on formal Indonesian text
	- Some entity types (WOA, FAC, PRD) have lower performance due to limited training data
	- May not generalize well to informal/colloquial Indonesian or social media text
	- Performance may vary on domain-specific texts

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{id_nergrit_indonesian_spacy,
	author = {nerdv2},
	title = {Indonesian Named Entity Recognition Model},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy}
	}
	```

	## Acknowledgments

	- Dataset: [PT Gria Inovasi Teknologi (GRIT)](https://grit.id/) for the Nergrit Corpus
	- Framework: [spaCy](https://spacy.io/)
	- Training: Based on the Nergrit Corpus dataset

	## License

	MIT License

	## Contact

	For questions or issues, please open an issue on the [Hugging Face model page](https://huggingface.co/nerdv2/id_nergrit_indonesian_spacy).