--- license: gpl-2.0 language: id tags: - spacy - ner - token-classification - indonesian library_name: spacy --- # Indonesian NER spaCy Model This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy. ## Model Details - **Language**: Indonesian (`id`) - **Pipeline**: `ner` - **spaCy Version**: `>=3.8.7,<3.9.0` - **Model Architecture**: Transition-based parser with HashEmbedCNN tok2vec ## Supported Entity Types The model recognizes the following entity types: - `CARDINAL` - Cardinal numbers - `DATE` - Date expressions - `EVENT` - Events - `FACILITY` - Facilities - `GPE` - Geopolitical entities - `LANGUAGE` - Languages - `LAW` - Legal documents - `LOCATION` - Locations - `MISC` - Miscellaneous - `MONEY` - Monetary values - `NORP` - Nationalities or religious/political groups - `ORDINAL` - Ordinal numbers - `ORGANIZATION` - Organizations - `PERCENT` - Percentages - `PERSON` - People - `PRODUCT` - Products - `QUANTITY` - Quantities - `TIME` - Time expressions - `TITLE` - Titles ## Usage ```python import spacy # Load the model nlp = spacy.load("asmud/ner-spacy-indonesian") # Process text doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.") # Extract entities for ent in doc.ents: print(f"{ent.text} -> {ent.label_}") ``` ## Installation ```bash pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl ``` Or use with spaCy: ```python import spacy nlp = spacy.load("asmud/ner-spacy-indonesian") ``` ## Model Architecture - **tok2vec**: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000 - **ner**: Transition-based parser with 64 hidden units, maxout pieces 2 - **Training**: 100 iterations with dropout 0.5, compounding batch sizes (4-32) - **Optimizer**: Adam (lr=0.001, L2=0.01, grad_clip=1.0) ## Training Configuration ### Training Data Format The model was trained on data with custom XML-like tags: ``` Presiden Joko Widodo mengunjungi Jakarta pada 17 Agustus 2024. ``` ### Training Parameters - **Iterations**: 100 training iterations - **Dropout**: 0.5 during training - **Batch Size**: Compounding from 4 to 32 examples - **Text Preprocessing**: Lowercased input text - **Data Shuffling**: Random shuffling each iteration ### Architecture Details - **Embedding Width**: 96 dimensions - **Hidden Width**: 64 units - **Embed Size**: 2000 features - **Window Size**: 1 - **Maxout Pieces**: 3 (tok2vec), 2 (parser) - **Subword Features**: Enabled ## Model Evaluation ### Performance Metrics The model was evaluated on 2,987 examples from the training data with the following results: #### Overall Performance - **Precision**: 0.9846 - **Recall**: 0.9865 - **F1-score**: 0.9856 #### Per-Entity Performance | Entity | Precision | Recall | F1-score | |--------|-----------|--------|----------| | PRODUCT | 1.0000 | 1.0000 | 1.0000 | | LOCATION | 1.0000 | 1.0000 | 1.0000 | | LANGUAGE | 1.0000 | 1.0000 | 1.0000 | | EVENT | 0.9962 | 1.0000 | 0.9981 | | MISC | 0.9973 | 0.9960 | 0.9966 | | FACILITY | 0.9923 | 1.0000 | 0.9961 | | LAW | 1.0000 | 0.9919 | 0.9959 | | TITLE | 0.9947 | 0.9947 | 0.9947 | | GPE | 1.0000 | 0.9886 | 0.9943 | | NORP | 0.9872 | 1.0000 | 0.9935 | | PERSON | 0.9935 | 0.9935 | 0.9935 | | DATE | 0.9926 | 0.9830 | 0.9878 | | ORDINAL | 0.9750 | 1.0000 | 0.9873 | | MONEY | 0.9683 | 0.9946 | 0.9812 | | ORGANIZATION | 0.9457 | 0.9905 | 0.9676 | | TIME | 0.9476 | 0.9819 | 0.9645 | | QUANTITY | 0.9874 | 0.9291 | 0.9574 | | PERCENT | 0.8600 | 1.0000 | 0.9247 | | CARDINAL | 0.9620 | 0.8736 | 0.9157 | ### Evaluation Features You can reproduce these metrics using the included analyzer script: ```bash # Install required dependencies pip install streamlit pandas # Run the analyzer streamlit run spacy_model_analyzer.py ``` The analyzer provides: - **Interactive Analysis**: Real-time entity recognition testing - **Detailed Metrics**: Precision, recall, and F1-score calculations - **Text Alignment**: Automatic handling of entity boundary alignment - **Visualization**: Entity highlighting and analysis tools