Indonesian NER spaCy Model

This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy.

Model Details

Language: Indonesian (id)
Pipeline: ner
spaCy Version: >=3.8.7,<3.9.0
Model Architecture: Transition-based parser with HashEmbedCNN tok2vec

Supported Entity Types

The model recognizes the following entity types:

CARDINAL - Cardinal numbers
DATE - Date expressions
EVENT - Events
FACILITY - Facilities
GPE - Geopolitical entities
LANGUAGE - Languages
LAW - Legal documents
LOCATION - Locations
MISC - Miscellaneous
MONEY - Monetary values
NORP - Nationalities or religious/political groups
ORDINAL - Ordinal numbers
ORGANIZATION - Organizations
PERCENT - Percentages
PERSON - People
PRODUCT - Products
QUANTITY - Quantities
TIME - Time expressions
TITLE - Titles

Usage

import spacy

# Load the model
nlp = spacy.load("asmud/ner-spacy-indonesian")

# Process text
doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.")

# Extract entities
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Installation

pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl

Or use with spaCy:

import spacy
nlp = spacy.load("asmud/ner-spacy-indonesian")

Model Architecture

tok2vec: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000
ner: Transition-based parser with 64 hidden units, maxout pieces 2
Training: 100 iterations with dropout 0.5, compounding batch sizes (4-32)
Optimizer: Adam (lr=0.001, L2=0.01, grad_clip=1.0)

Training Configuration

Training Data Format

The model was trained on data with custom XML-like tags:

Presiden <PERSON>Joko Widodo</PERSON> mengunjungi <GPE>Jakarta</GPE> pada <DATE>17 Agustus 2024</DATE>.

Training Parameters

Iterations: 100 training iterations
Dropout: 0.5 during training
Batch Size: Compounding from 4 to 32 examples
Text Preprocessing: Lowercased input text
Data Shuffling: Random shuffling each iteration

Architecture Details

Embedding Width: 96 dimensions
Hidden Width: 64 units
Embed Size: 2000 features
Window Size: 1
Maxout Pieces: 3 (tok2vec), 2 (parser)
Subword Features: Enabled

Model Evaluation

Performance Metrics

The model was evaluated on 2,987 examples from the training data with the following results:

Overall Performance

Precision: 0.9846
Recall: 0.9865
F1-score: 0.9856

Per-Entity Performance

Entity	Precision	Recall	F1-score
PRODUCT	1.0000	1.0000	1.0000
LOCATION	1.0000	1.0000	1.0000
LANGUAGE	1.0000	1.0000	1.0000
EVENT	0.9962	1.0000	0.9981
MISC	0.9973	0.9960	0.9966
FACILITY	0.9923	1.0000	0.9961
LAW	1.0000	0.9919	0.9959
TITLE	0.9947	0.9947	0.9947
GPE	1.0000	0.9886	0.9943
NORP	0.9872	1.0000	0.9935
PERSON	0.9935	0.9935	0.9935
DATE	0.9926	0.9830	0.9878
ORDINAL	0.9750	1.0000	0.9873
MONEY	0.9683	0.9946	0.9812
ORGANIZATION	0.9457	0.9905	0.9676
TIME	0.9476	0.9819	0.9645
QUANTITY	0.9874	0.9291	0.9574
PERCENT	0.8600	1.0000	0.9247
CARDINAL	0.9620	0.8736	0.9157

Evaluation Features

You can reproduce these metrics using the included analyzer script:

# Install required dependencies
pip install streamlit pandas

# Run the analyzer
streamlit run spacy_model_analyzer.py

The analyzer provides:

Interactive Analysis: Real-time entity recognition testing
Detailed Metrics: Precision, recall, and F1-score calculations
Text Alignment: Automatic handling of entity boundary alignment
Visualization: Entity highlighting and analysis tools