Indonesian NER spaCy Model
This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy.
Model Details
- Language: Indonesian (
id
) - Pipeline:
ner
- spaCy Version:
>=3.8.7,<3.9.0
- Model Architecture: Transition-based parser with HashEmbedCNN tok2vec
Supported Entity Types
The model recognizes the following entity types:
CARDINAL
- Cardinal numbersDATE
- Date expressionsEVENT
- EventsFACILITY
- FacilitiesGPE
- Geopolitical entitiesLANGUAGE
- LanguagesLAW
- Legal documentsLOCATION
- LocationsMISC
- MiscellaneousMONEY
- Monetary valuesNORP
- Nationalities or religious/political groupsORDINAL
- Ordinal numbersORGANIZATION
- OrganizationsPERCENT
- PercentagesPERSON
- PeoplePRODUCT
- ProductsQUANTITY
- QuantitiesTIME
- Time expressionsTITLE
- Titles
Usage
import spacy
# Load the model
nlp = spacy.load("asmud/ner-spacy-indonesian")
# Process text
doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.")
# Extract entities
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
Installation
pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl
Or use with spaCy:
import spacy
nlp = spacy.load("asmud/ner-spacy-indonesian")
Model Architecture
- tok2vec: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000
- ner: Transition-based parser with 64 hidden units, maxout pieces 2
- Training: 100 iterations with dropout 0.5, compounding batch sizes (4-32)
- Optimizer: Adam (lr=0.001, L2=0.01, grad_clip=1.0)
Training Configuration
Training Data Format
The model was trained on data with custom XML-like tags:
Presiden <PERSON>Joko Widodo</PERSON> mengunjungi <GPE>Jakarta</GPE> pada <DATE>17 Agustus 2024</DATE>.
Training Parameters
- Iterations: 100 training iterations
- Dropout: 0.5 during training
- Batch Size: Compounding from 4 to 32 examples
- Text Preprocessing: Lowercased input text
- Data Shuffling: Random shuffling each iteration
Architecture Details
- Embedding Width: 96 dimensions
- Hidden Width: 64 units
- Embed Size: 2000 features
- Window Size: 1
- Maxout Pieces: 3 (tok2vec), 2 (parser)
- Subword Features: Enabled
Model Evaluation
Performance Metrics
The model was evaluated on 2,987 examples from the training data with the following results:
Overall Performance
- Precision: 0.9846
- Recall: 0.9865
- F1-score: 0.9856
Per-Entity Performance
Entity | Precision | Recall | F1-score |
---|---|---|---|
PRODUCT | 1.0000 | 1.0000 | 1.0000 |
LOCATION | 1.0000 | 1.0000 | 1.0000 |
LANGUAGE | 1.0000 | 1.0000 | 1.0000 |
EVENT | 0.9962 | 1.0000 | 0.9981 |
MISC | 0.9973 | 0.9960 | 0.9966 |
FACILITY | 0.9923 | 1.0000 | 0.9961 |
LAW | 1.0000 | 0.9919 | 0.9959 |
TITLE | 0.9947 | 0.9947 | 0.9947 |
GPE | 1.0000 | 0.9886 | 0.9943 |
NORP | 0.9872 | 1.0000 | 0.9935 |
PERSON | 0.9935 | 0.9935 | 0.9935 |
DATE | 0.9926 | 0.9830 | 0.9878 |
ORDINAL | 0.9750 | 1.0000 | 0.9873 |
MONEY | 0.9683 | 0.9946 | 0.9812 |
ORGANIZATION | 0.9457 | 0.9905 | 0.9676 |
TIME | 0.9476 | 0.9819 | 0.9645 |
QUANTITY | 0.9874 | 0.9291 | 0.9574 |
PERCENT | 0.8600 | 1.0000 | 0.9247 |
CARDINAL | 0.9620 | 0.8736 | 0.9157 |
Evaluation Features
You can reproduce these metrics using the included analyzer script:
# Install required dependencies
pip install streamlit pandas
# Run the analyzer
streamlit run spacy_model_analyzer.py
The analyzer provides:
- Interactive Analysis: Real-time entity recognition testing
- Detailed Metrics: Precision, recall, and F1-score calculations
- Text Alignment: Automatic handling of entity boundary alignment
- Visualization: Entity highlighting and analysis tools
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support