Indonesian NER spaCy Model

This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy.

Model Details

  • Language: Indonesian (id)
  • Pipeline: ner
  • spaCy Version: >=3.8.7,<3.9.0
  • Model Architecture: Transition-based parser with HashEmbedCNN tok2vec

Supported Entity Types

The model recognizes the following entity types:

  • CARDINAL - Cardinal numbers
  • DATE - Date expressions
  • EVENT - Events
  • FACILITY - Facilities
  • GPE - Geopolitical entities
  • LANGUAGE - Languages
  • LAW - Legal documents
  • LOCATION - Locations
  • MISC - Miscellaneous
  • MONEY - Monetary values
  • NORP - Nationalities or religious/political groups
  • ORDINAL - Ordinal numbers
  • ORGANIZATION - Organizations
  • PERCENT - Percentages
  • PERSON - People
  • PRODUCT - Products
  • QUANTITY - Quantities
  • TIME - Time expressions
  • TITLE - Titles

Usage

import spacy

# Load the model
nlp = spacy.load("asmud/ner-spacy-indonesian")

# Process text
doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.")

# Extract entities
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Installation

pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl

Or use with spaCy:

import spacy
nlp = spacy.load("asmud/ner-spacy-indonesian")

Model Architecture

  • tok2vec: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000
  • ner: Transition-based parser with 64 hidden units, maxout pieces 2
  • Training: 100 iterations with dropout 0.5, compounding batch sizes (4-32)
  • Optimizer: Adam (lr=0.001, L2=0.01, grad_clip=1.0)

Training Configuration

Training Data Format

The model was trained on data with custom XML-like tags:

Presiden <PERSON>Joko Widodo</PERSON> mengunjungi <GPE>Jakarta</GPE> pada <DATE>17 Agustus 2024</DATE>.

Training Parameters

  • Iterations: 100 training iterations
  • Dropout: 0.5 during training
  • Batch Size: Compounding from 4 to 32 examples
  • Text Preprocessing: Lowercased input text
  • Data Shuffling: Random shuffling each iteration

Architecture Details

  • Embedding Width: 96 dimensions
  • Hidden Width: 64 units
  • Embed Size: 2000 features
  • Window Size: 1
  • Maxout Pieces: 3 (tok2vec), 2 (parser)
  • Subword Features: Enabled

Model Evaluation

Performance Metrics

The model was evaluated on 2,987 examples from the training data with the following results:

Overall Performance

  • Precision: 0.9846
  • Recall: 0.9865
  • F1-score: 0.9856

Per-Entity Performance

Entity Precision Recall F1-score
PRODUCT 1.0000 1.0000 1.0000
LOCATION 1.0000 1.0000 1.0000
LANGUAGE 1.0000 1.0000 1.0000
EVENT 0.9962 1.0000 0.9981
MISC 0.9973 0.9960 0.9966
FACILITY 0.9923 1.0000 0.9961
LAW 1.0000 0.9919 0.9959
TITLE 0.9947 0.9947 0.9947
GPE 1.0000 0.9886 0.9943
NORP 0.9872 1.0000 0.9935
PERSON 0.9935 0.9935 0.9935
DATE 0.9926 0.9830 0.9878
ORDINAL 0.9750 1.0000 0.9873
MONEY 0.9683 0.9946 0.9812
ORGANIZATION 0.9457 0.9905 0.9676
TIME 0.9476 0.9819 0.9645
QUANTITY 0.9874 0.9291 0.9574
PERCENT 0.8600 1.0000 0.9247
CARDINAL 0.9620 0.8736 0.9157

Evaluation Features

You can reproduce these metrics using the included analyzer script:

# Install required dependencies
pip install streamlit pandas

# Run the analyzer
streamlit run spacy_model_analyzer.py

The analyzer provides:

  • Interactive Analysis: Real-time entity recognition testing
  • Detailed Metrics: Precision, recall, and F1-score calculations
  • Text Alignment: Automatic handling of entity boundary alignment
  • Visualization: Entity highlighting and analysis tools
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support