mac
commited on
Commit
·
47301d1
1
Parent(s):
5f2bda5
Add Indonesian NER spaCy model with comprehensive documentation
Browse files- Add complete spaCy model files with NER pipeline for Indonesian text
- Include detailed performance metrics (F1: 0.9856, Precision: 0.9846, Recall: 0.9865)
- Add comprehensive README with training configuration and architecture details
- Support 19 entity types with per-entity performance breakdown
- Configure LFS tracking for large model files
- Include evaluation tools and usage instructions
Author: Asep Muhamad <[email protected]>
- .gitattributes +2 -0
- README.md +147 -0
- config.cfg +130 -0
- meta.json +51 -0
- ner/cfg +13 -0
- ner/model +3 -0
- ner/moves +1 -0
- tokenizer +0 -0
- vocab/key2row +1 -0
- vocab/lookups.bin +3 -0
- vocab/strings.json +0 -0
- vocab/vectors +3 -0
- vocab/vectors.cfg +3 -0
.gitattributes
CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
vocab/vectors filter=lfs diff=lfs merge=lfs -text
|
37 |
+
ner/model filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,150 @@
|
|
1 |
---
|
2 |
license: gpl-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: gpl-2.0
|
3 |
+
language: id
|
4 |
+
tags:
|
5 |
+
- spacy
|
6 |
+
- ner
|
7 |
+
- token-classification
|
8 |
+
- indonesian
|
9 |
+
library_name: spacy
|
10 |
---
|
11 |
+
|
12 |
+
# Indonesian NER spaCy Model
|
13 |
+
|
14 |
+
This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy.
|
15 |
+
|
16 |
+
## Model Details
|
17 |
+
|
18 |
+
- **Language**: Indonesian (`id`)
|
19 |
+
- **Pipeline**: `ner`
|
20 |
+
- **spaCy Version**: `>=3.8.7,<3.9.0`
|
21 |
+
- **Model Architecture**: Transition-based parser with HashEmbedCNN tok2vec
|
22 |
+
|
23 |
+
## Supported Entity Types
|
24 |
+
|
25 |
+
The model recognizes the following entity types:
|
26 |
+
|
27 |
+
- `CARDINAL` - Cardinal numbers
|
28 |
+
- `DATE` - Date expressions
|
29 |
+
- `EVENT` - Events
|
30 |
+
- `FACILITY` - Facilities
|
31 |
+
- `GPE` - Geopolitical entities
|
32 |
+
- `LANGUAGE` - Languages
|
33 |
+
- `LAW` - Legal documents
|
34 |
+
- `LOCATION` - Locations
|
35 |
+
- `MISC` - Miscellaneous
|
36 |
+
- `MONEY` - Monetary values
|
37 |
+
- `NORP` - Nationalities or religious/political groups
|
38 |
+
- `ORDINAL` - Ordinal numbers
|
39 |
+
- `ORGANIZATION` - Organizations
|
40 |
+
- `PERCENT` - Percentages
|
41 |
+
- `PERSON` - People
|
42 |
+
- `PRODUCT` - Products
|
43 |
+
- `QUANTITY` - Quantities
|
44 |
+
- `TIME` - Time expressions
|
45 |
+
- `TITLE` - Titles
|
46 |
+
|
47 |
+
## Usage
|
48 |
+
|
49 |
+
```python
|
50 |
+
import spacy
|
51 |
+
|
52 |
+
# Load the model
|
53 |
+
nlp = spacy.load("asmud/ner-spacy-indonesian")
|
54 |
+
|
55 |
+
# Process text
|
56 |
+
doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.")
|
57 |
+
|
58 |
+
# Extract entities
|
59 |
+
for ent in doc.ents:
|
60 |
+
print(f"{ent.text} -> {ent.label_}")
|
61 |
+
```
|
62 |
+
|
63 |
+
## Installation
|
64 |
+
|
65 |
+
```bash
|
66 |
+
pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl
|
67 |
+
```
|
68 |
+
|
69 |
+
Or use with spaCy:
|
70 |
+
|
71 |
+
```python
|
72 |
+
import spacy
|
73 |
+
nlp = spacy.load("asmud/ner-spacy-indonesian")
|
74 |
+
```
|
75 |
+
|
76 |
+
## Model Architecture
|
77 |
+
|
78 |
+
- **tok2vec**: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000
|
79 |
+
- **ner**: Transition-based parser with 64 hidden units, maxout pieces 2
|
80 |
+
- **Training**: 100 iterations with dropout 0.5, compounding batch sizes (4-32)
|
81 |
+
- **Optimizer**: Adam (lr=0.001, L2=0.01, grad_clip=1.0)
|
82 |
+
|
83 |
+
## Training Configuration
|
84 |
+
|
85 |
+
### Training Data Format
|
86 |
+
The model was trained on data with custom XML-like tags:
|
87 |
+
```
|
88 |
+
Presiden <PERSON>Joko Widodo</PERSON> mengunjungi <GPE>Jakarta</GPE> pada <DATE>17 Agustus 2024</DATE>.
|
89 |
+
```
|
90 |
+
|
91 |
+
### Training Parameters
|
92 |
+
- **Iterations**: 100 training iterations
|
93 |
+
- **Dropout**: 0.5 during training
|
94 |
+
- **Batch Size**: Compounding from 4 to 32 examples
|
95 |
+
- **Text Preprocessing**: Lowercased input text
|
96 |
+
- **Data Shuffling**: Random shuffling each iteration
|
97 |
+
|
98 |
+
### Architecture Details
|
99 |
+
- **Embedding Width**: 96 dimensions
|
100 |
+
- **Hidden Width**: 64 units
|
101 |
+
- **Embed Size**: 2000 features
|
102 |
+
- **Window Size**: 1
|
103 |
+
- **Maxout Pieces**: 3 (tok2vec), 2 (parser)
|
104 |
+
- **Subword Features**: Enabled
|
105 |
+
|
106 |
+
## Model Evaluation
|
107 |
+
|
108 |
+
### Performance Metrics
|
109 |
+
The model was evaluated on 2,987 examples from the training data with the following results:
|
110 |
+
|
111 |
+
#### Overall Performance
|
112 |
+
- **Precision**: 0.9846
|
113 |
+
- **Recall**: 0.9865
|
114 |
+
- **F1-score**: 0.9856
|
115 |
+
|
116 |
+
#### Per-Entity Performance
|
117 |
+
| Entity | Precision | Recall | F1-score |
|
118 |
+
|--------|-----------|--------|----------|
|
119 |
+
| PRODUCT | 1.0000 | 1.0000 | 1.0000 |
|
120 |
+
| LOCATION | 1.0000 | 1.0000 | 1.0000 |
|
121 |
+
| LANGUAGE | 1.0000 | 1.0000 | 1.0000 |
|
122 |
+
| EVENT | 0.9962 | 1.0000 | 0.9981 |
|
123 |
+
| MISC | 0.9973 | 0.9960 | 0.9966 |
|
124 |
+
| FACILITY | 0.9923 | 1.0000 | 0.9961 |
|
125 |
+
| LAW | 1.0000 | 0.9919 | 0.9959 |
|
126 |
+
| TITLE | 0.9947 | 0.9947 | 0.9947 |
|
127 |
+
| GPE | 1.0000 | 0.9886 | 0.9943 |
|
128 |
+
| NORP | 0.9872 | 1.0000 | 0.9935 |
|
129 |
+
| PERSON | 0.9935 | 0.9935 | 0.9935 |
|
130 |
+
| DATE | 0.9926 | 0.9830 | 0.9878 |
|
131 |
+
| ORDINAL | 0.9750 | 1.0000 | 0.9873 |
|
132 |
+
| MONEY | 0.9683 | 0.9946 | 0.9812 |
|
133 |
+
| ORGANIZATION | 0.9457 | 0.9905 | 0.9676 |
|
134 |
+
| TIME | 0.9476 | 0.9819 | 0.9645 |
|
135 |
+
| QUANTITY | 0.9874 | 0.9291 | 0.9574 |
|
136 |
+
| PERCENT | 0.8600 | 1.0000 | 0.9247 |
|
137 |
+
| CARDINAL | 0.9620 | 0.8736 | 0.9157 |
|
138 |
+
|
139 |
+
### Evaluation Features
|
140 |
+
You can reproduce these metrics using the included analyzer script:
|
141 |
+
|
142 |
+
```bash
|
143 |
+
streamlit run spacy_model_analyzer.py
|
144 |
+
```
|
145 |
+
|
146 |
+
The analyzer provides:
|
147 |
+
- **Interactive Analysis**: Real-time entity recognition testing
|
148 |
+
- **Detailed Metrics**: Precision, recall, and F1-score calculations
|
149 |
+
- **Text Alignment**: Automatic handling of entity boundary alignment
|
150 |
+
- **Visualization**: Entity highlighting and analysis tools
|
config.cfg
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[paths]
|
2 |
+
train = null
|
3 |
+
dev = null
|
4 |
+
vectors = null
|
5 |
+
init_tok2vec = null
|
6 |
+
|
7 |
+
[system]
|
8 |
+
seed = 0
|
9 |
+
gpu_allocator = null
|
10 |
+
|
11 |
+
[nlp]
|
12 |
+
lang = "id"
|
13 |
+
pipeline = ["ner"]
|
14 |
+
disabled = []
|
15 |
+
before_creation = null
|
16 |
+
after_creation = null
|
17 |
+
after_pipeline_creation = null
|
18 |
+
batch_size = 1000
|
19 |
+
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
20 |
+
vectors = {"@vectors":"spacy.Vectors.v1"}
|
21 |
+
|
22 |
+
[components]
|
23 |
+
|
24 |
+
[components.ner]
|
25 |
+
factory = "ner"
|
26 |
+
incorrect_spans_key = null
|
27 |
+
moves = null
|
28 |
+
scorer = {"@scorers":"spacy.ner_scorer.v1"}
|
29 |
+
update_with_oracle_cut_size = 100
|
30 |
+
|
31 |
+
[components.ner.model]
|
32 |
+
@architectures = "spacy.TransitionBasedParser.v2"
|
33 |
+
state_type = "ner"
|
34 |
+
extra_state_tokens = false
|
35 |
+
hidden_width = 64
|
36 |
+
maxout_pieces = 2
|
37 |
+
use_upper = true
|
38 |
+
nO = null
|
39 |
+
|
40 |
+
[components.ner.model.tok2vec]
|
41 |
+
@architectures = "spacy.HashEmbedCNN.v2"
|
42 |
+
pretrained_vectors = null
|
43 |
+
width = 96
|
44 |
+
depth = 4
|
45 |
+
embed_size = 2000
|
46 |
+
window_size = 1
|
47 |
+
maxout_pieces = 3
|
48 |
+
subword_features = true
|
49 |
+
|
50 |
+
[corpora]
|
51 |
+
|
52 |
+
[corpora.dev]
|
53 |
+
@readers = "spacy.Corpus.v1"
|
54 |
+
path = ${paths.dev}
|
55 |
+
gold_preproc = false
|
56 |
+
max_length = 0
|
57 |
+
limit = 0
|
58 |
+
augmenter = null
|
59 |
+
|
60 |
+
[corpora.train]
|
61 |
+
@readers = "spacy.Corpus.v1"
|
62 |
+
path = ${paths.train}
|
63 |
+
gold_preproc = false
|
64 |
+
max_length = 0
|
65 |
+
limit = 0
|
66 |
+
augmenter = null
|
67 |
+
|
68 |
+
[training]
|
69 |
+
seed = ${system.seed}
|
70 |
+
gpu_allocator = ${system.gpu_allocator}
|
71 |
+
dropout = 0.1
|
72 |
+
accumulate_gradient = 1
|
73 |
+
patience = 1600
|
74 |
+
max_epochs = 0
|
75 |
+
max_steps = 20000
|
76 |
+
eval_frequency = 200
|
77 |
+
frozen_components = []
|
78 |
+
annotating_components = []
|
79 |
+
dev_corpus = "corpora.dev"
|
80 |
+
train_corpus = "corpora.train"
|
81 |
+
before_to_disk = null
|
82 |
+
before_update = null
|
83 |
+
|
84 |
+
[training.batcher]
|
85 |
+
@batchers = "spacy.batch_by_words.v1"
|
86 |
+
discard_oversize = false
|
87 |
+
tolerance = 0.2
|
88 |
+
get_length = null
|
89 |
+
|
90 |
+
[training.batcher.size]
|
91 |
+
@schedules = "compounding.v1"
|
92 |
+
start = 100
|
93 |
+
stop = 1000
|
94 |
+
compound = 1.001
|
95 |
+
t = 0.0
|
96 |
+
|
97 |
+
[training.logger]
|
98 |
+
@loggers = "spacy.ConsoleLogger.v1"
|
99 |
+
progress_bar = false
|
100 |
+
|
101 |
+
[training.optimizer]
|
102 |
+
@optimizers = "Adam.v1"
|
103 |
+
beta1 = 0.9
|
104 |
+
beta2 = 0.999
|
105 |
+
L2_is_weight_decay = true
|
106 |
+
L2 = 0.01
|
107 |
+
grad_clip = 1.0
|
108 |
+
use_averages = false
|
109 |
+
eps = 0.00000001
|
110 |
+
learn_rate = 0.001
|
111 |
+
|
112 |
+
[training.score_weights]
|
113 |
+
ents_f = 1.0
|
114 |
+
ents_p = 0.0
|
115 |
+
ents_r = 0.0
|
116 |
+
ents_per_type = null
|
117 |
+
|
118 |
+
[pretraining]
|
119 |
+
|
120 |
+
[initialize]
|
121 |
+
vectors = ${paths.vectors}
|
122 |
+
init_tok2vec = ${paths.init_tok2vec}
|
123 |
+
vocab_data = null
|
124 |
+
lookups = null
|
125 |
+
before_init = null
|
126 |
+
after_init = null
|
127 |
+
|
128 |
+
[initialize.components]
|
129 |
+
|
130 |
+
[initialize.tokenizer]
|
meta.json
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"lang":"id",
|
3 |
+
"name":"pipeline",
|
4 |
+
"version":"0.0.0",
|
5 |
+
"spacy_version":">=3.8.7,<3.9.0",
|
6 |
+
"description":"",
|
7 |
+
"author":"",
|
8 |
+
"email":"",
|
9 |
+
"url":"",
|
10 |
+
"license":"",
|
11 |
+
"spacy_git_version":"4b65aa7",
|
12 |
+
"vectors":{
|
13 |
+
"width":0,
|
14 |
+
"vectors":0,
|
15 |
+
"keys":0,
|
16 |
+
"name":null,
|
17 |
+
"mode":"default"
|
18 |
+
},
|
19 |
+
"labels":{
|
20 |
+
"ner":[
|
21 |
+
"CARDINAL",
|
22 |
+
"DATE",
|
23 |
+
"EVENT",
|
24 |
+
"FACILITY",
|
25 |
+
"GPE",
|
26 |
+
"LANGUAGE",
|
27 |
+
"LAW",
|
28 |
+
"LOCATION",
|
29 |
+
"MISC",
|
30 |
+
"MONEY",
|
31 |
+
"NORP",
|
32 |
+
"ORDINAL",
|
33 |
+
"ORGANIZATION",
|
34 |
+
"PERCENT",
|
35 |
+
"PERSON",
|
36 |
+
"PRODUCT",
|
37 |
+
"QUANTITY",
|
38 |
+
"TIME",
|
39 |
+
"TITLE"
|
40 |
+
]
|
41 |
+
},
|
42 |
+
"pipeline":[
|
43 |
+
"ner"
|
44 |
+
],
|
45 |
+
"components":[
|
46 |
+
"ner"
|
47 |
+
],
|
48 |
+
"disabled":[
|
49 |
+
|
50 |
+
]
|
51 |
+
}
|
ner/cfg
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"moves":null,
|
3 |
+
"update_with_oracle_cut_size":100,
|
4 |
+
"multitasks":[
|
5 |
+
|
6 |
+
],
|
7 |
+
"min_action_freq":1,
|
8 |
+
"learn_tokens":false,
|
9 |
+
"beam_width":1,
|
10 |
+
"beam_density":0.0,
|
11 |
+
"beam_update_prob":0.0,
|
12 |
+
"incorrect_spans_key":null
|
13 |
+
}
|
ner/model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f6fd3968fdd127d227a20c184745f81f86fa63c1af9f8de767d2c2db70e83a73
|
3 |
+
size 3851641
|
ner/moves
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
��moves��{"0":{},"1":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"2":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"3":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"4":{"":1,"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"5":{"":1}}�cfg��neg_key�
|
tokenizer
ADDED
The diff for this file is too large to render.
See raw diff
|
|
vocab/key2row
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
�
|
vocab/lookups.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:76be8b528d0075f7aae98d6fa57a6d3c83ae480a8469e668d7b0af968995ac71
|
3 |
+
size 1
|
vocab/strings.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
vocab/vectors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:14772b683e726436d5948ad3fff2b43d036ef2ebbe3458aafed6004e05a40706
|
3 |
+
size 128
|
vocab/vectors.cfg
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"mode":"default"
|
3 |
+
}
|