mac commited on
Commit
47301d1
·
1 Parent(s): 5f2bda5

Add Indonesian NER spaCy model with comprehensive documentation

Browse files

- Add complete spaCy model files with NER pipeline for Indonesian text
- Include detailed performance metrics (F1: 0.9856, Precision: 0.9846, Recall: 0.9865)
- Add comprehensive README with training configuration and architecture details
- Support 19 entity types with per-entity performance breakdown
- Configure LFS tracking for large model files
- Include evaluation tools and usage instructions

Author: Asep Muhamad <[email protected]>

Files changed (13) hide show
  1. .gitattributes +2 -0
  2. README.md +147 -0
  3. config.cfg +130 -0
  4. meta.json +51 -0
  5. ner/cfg +13 -0
  6. ner/model +3 -0
  7. ner/moves +1 -0
  8. tokenizer +0 -0
  9. vocab/key2row +1 -0
  10. vocab/lookups.bin +3 -0
  11. vocab/strings.json +0 -0
  12. vocab/vectors +3 -0
  13. vocab/vectors.cfg +3 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ vocab/vectors filter=lfs diff=lfs merge=lfs -text
37
+ ner/model filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,150 @@
1
  ---
2
  license: gpl-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: gpl-2.0
3
+ language: id
4
+ tags:
5
+ - spacy
6
+ - ner
7
+ - token-classification
8
+ - indonesian
9
+ library_name: spacy
10
  ---
11
+
12
+ # Indonesian NER spaCy Model
13
+
14
+ This model is a Named Entity Recognition (NER) model for Indonesian language built with spaCy.
15
+
16
+ ## Model Details
17
+
18
+ - **Language**: Indonesian (`id`)
19
+ - **Pipeline**: `ner`
20
+ - **spaCy Version**: `>=3.8.7,<3.9.0`
21
+ - **Model Architecture**: Transition-based parser with HashEmbedCNN tok2vec
22
+
23
+ ## Supported Entity Types
24
+
25
+ The model recognizes the following entity types:
26
+
27
+ - `CARDINAL` - Cardinal numbers
28
+ - `DATE` - Date expressions
29
+ - `EVENT` - Events
30
+ - `FACILITY` - Facilities
31
+ - `GPE` - Geopolitical entities
32
+ - `LANGUAGE` - Languages
33
+ - `LAW` - Legal documents
34
+ - `LOCATION` - Locations
35
+ - `MISC` - Miscellaneous
36
+ - `MONEY` - Monetary values
37
+ - `NORP` - Nationalities or religious/political groups
38
+ - `ORDINAL` - Ordinal numbers
39
+ - `ORGANIZATION` - Organizations
40
+ - `PERCENT` - Percentages
41
+ - `PERSON` - People
42
+ - `PRODUCT` - Products
43
+ - `QUANTITY` - Quantities
44
+ - `TIME` - Time expressions
45
+ - `TITLE` - Titles
46
+
47
+ ## Usage
48
+
49
+ ```python
50
+ import spacy
51
+
52
+ # Load the model
53
+ nlp = spacy.load("asmud/ner-spacy-indonesian")
54
+
55
+ # Process text
56
+ doc = nlp("Presiden Joko Widodo mengunjungi Jakarta pada tanggal 17 Agustus 2024.")
57
+
58
+ # Extract entities
59
+ for ent in doc.ents:
60
+ print(f"{ent.text} -> {ent.label_}")
61
+ ```
62
+
63
+ ## Installation
64
+
65
+ ```bash
66
+ pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/ner-spacy-indonesian-any-py3-none-any.whl
67
+ ```
68
+
69
+ Or use with spaCy:
70
+
71
+ ```python
72
+ import spacy
73
+ nlp = spacy.load("asmud/ner-spacy-indonesian")
74
+ ```
75
+
76
+ ## Model Architecture
77
+
78
+ - **tok2vec**: HashEmbedCNN with 96-dimensional embeddings, depth 4, embed size 2000
79
+ - **ner**: Transition-based parser with 64 hidden units, maxout pieces 2
80
+ - **Training**: 100 iterations with dropout 0.5, compounding batch sizes (4-32)
81
+ - **Optimizer**: Adam (lr=0.001, L2=0.01, grad_clip=1.0)
82
+
83
+ ## Training Configuration
84
+
85
+ ### Training Data Format
86
+ The model was trained on data with custom XML-like tags:
87
+ ```
88
+ Presiden <PERSON>Joko Widodo</PERSON> mengunjungi <GPE>Jakarta</GPE> pada <DATE>17 Agustus 2024</DATE>.
89
+ ```
90
+
91
+ ### Training Parameters
92
+ - **Iterations**: 100 training iterations
93
+ - **Dropout**: 0.5 during training
94
+ - **Batch Size**: Compounding from 4 to 32 examples
95
+ - **Text Preprocessing**: Lowercased input text
96
+ - **Data Shuffling**: Random shuffling each iteration
97
+
98
+ ### Architecture Details
99
+ - **Embedding Width**: 96 dimensions
100
+ - **Hidden Width**: 64 units
101
+ - **Embed Size**: 2000 features
102
+ - **Window Size**: 1
103
+ - **Maxout Pieces**: 3 (tok2vec), 2 (parser)
104
+ - **Subword Features**: Enabled
105
+
106
+ ## Model Evaluation
107
+
108
+ ### Performance Metrics
109
+ The model was evaluated on 2,987 examples from the training data with the following results:
110
+
111
+ #### Overall Performance
112
+ - **Precision**: 0.9846
113
+ - **Recall**: 0.9865
114
+ - **F1-score**: 0.9856
115
+
116
+ #### Per-Entity Performance
117
+ | Entity | Precision | Recall | F1-score |
118
+ |--------|-----------|--------|----------|
119
+ | PRODUCT | 1.0000 | 1.0000 | 1.0000 |
120
+ | LOCATION | 1.0000 | 1.0000 | 1.0000 |
121
+ | LANGUAGE | 1.0000 | 1.0000 | 1.0000 |
122
+ | EVENT | 0.9962 | 1.0000 | 0.9981 |
123
+ | MISC | 0.9973 | 0.9960 | 0.9966 |
124
+ | FACILITY | 0.9923 | 1.0000 | 0.9961 |
125
+ | LAW | 1.0000 | 0.9919 | 0.9959 |
126
+ | TITLE | 0.9947 | 0.9947 | 0.9947 |
127
+ | GPE | 1.0000 | 0.9886 | 0.9943 |
128
+ | NORP | 0.9872 | 1.0000 | 0.9935 |
129
+ | PERSON | 0.9935 | 0.9935 | 0.9935 |
130
+ | DATE | 0.9926 | 0.9830 | 0.9878 |
131
+ | ORDINAL | 0.9750 | 1.0000 | 0.9873 |
132
+ | MONEY | 0.9683 | 0.9946 | 0.9812 |
133
+ | ORGANIZATION | 0.9457 | 0.9905 | 0.9676 |
134
+ | TIME | 0.9476 | 0.9819 | 0.9645 |
135
+ | QUANTITY | 0.9874 | 0.9291 | 0.9574 |
136
+ | PERCENT | 0.8600 | 1.0000 | 0.9247 |
137
+ | CARDINAL | 0.9620 | 0.8736 | 0.9157 |
138
+
139
+ ### Evaluation Features
140
+ You can reproduce these metrics using the included analyzer script:
141
+
142
+ ```bash
143
+ streamlit run spacy_model_analyzer.py
144
+ ```
145
+
146
+ The analyzer provides:
147
+ - **Interactive Analysis**: Real-time entity recognition testing
148
+ - **Detailed Metrics**: Precision, recall, and F1-score calculations
149
+ - **Text Alignment**: Automatic handling of entity boundary alignment
150
+ - **Visualization**: Entity highlighting and analysis tools
config.cfg ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = null
3
+ dev = null
4
+ vectors = null
5
+ init_tok2vec = null
6
+
7
+ [system]
8
+ seed = 0
9
+ gpu_allocator = null
10
+
11
+ [nlp]
12
+ lang = "id"
13
+ pipeline = ["ner"]
14
+ disabled = []
15
+ before_creation = null
16
+ after_creation = null
17
+ after_pipeline_creation = null
18
+ batch_size = 1000
19
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+ vectors = {"@vectors":"spacy.Vectors.v1"}
21
+
22
+ [components]
23
+
24
+ [components.ner]
25
+ factory = "ner"
26
+ incorrect_spans_key = null
27
+ moves = null
28
+ scorer = {"@scorers":"spacy.ner_scorer.v1"}
29
+ update_with_oracle_cut_size = 100
30
+
31
+ [components.ner.model]
32
+ @architectures = "spacy.TransitionBasedParser.v2"
33
+ state_type = "ner"
34
+ extra_state_tokens = false
35
+ hidden_width = 64
36
+ maxout_pieces = 2
37
+ use_upper = true
38
+ nO = null
39
+
40
+ [components.ner.model.tok2vec]
41
+ @architectures = "spacy.HashEmbedCNN.v2"
42
+ pretrained_vectors = null
43
+ width = 96
44
+ depth = 4
45
+ embed_size = 2000
46
+ window_size = 1
47
+ maxout_pieces = 3
48
+ subword_features = true
49
+
50
+ [corpora]
51
+
52
+ [corpora.dev]
53
+ @readers = "spacy.Corpus.v1"
54
+ path = ${paths.dev}
55
+ gold_preproc = false
56
+ max_length = 0
57
+ limit = 0
58
+ augmenter = null
59
+
60
+ [corpora.train]
61
+ @readers = "spacy.Corpus.v1"
62
+ path = ${paths.train}
63
+ gold_preproc = false
64
+ max_length = 0
65
+ limit = 0
66
+ augmenter = null
67
+
68
+ [training]
69
+ seed = ${system.seed}
70
+ gpu_allocator = ${system.gpu_allocator}
71
+ dropout = 0.1
72
+ accumulate_gradient = 1
73
+ patience = 1600
74
+ max_epochs = 0
75
+ max_steps = 20000
76
+ eval_frequency = 200
77
+ frozen_components = []
78
+ annotating_components = []
79
+ dev_corpus = "corpora.dev"
80
+ train_corpus = "corpora.train"
81
+ before_to_disk = null
82
+ before_update = null
83
+
84
+ [training.batcher]
85
+ @batchers = "spacy.batch_by_words.v1"
86
+ discard_oversize = false
87
+ tolerance = 0.2
88
+ get_length = null
89
+
90
+ [training.batcher.size]
91
+ @schedules = "compounding.v1"
92
+ start = 100
93
+ stop = 1000
94
+ compound = 1.001
95
+ t = 0.0
96
+
97
+ [training.logger]
98
+ @loggers = "spacy.ConsoleLogger.v1"
99
+ progress_bar = false
100
+
101
+ [training.optimizer]
102
+ @optimizers = "Adam.v1"
103
+ beta1 = 0.9
104
+ beta2 = 0.999
105
+ L2_is_weight_decay = true
106
+ L2 = 0.01
107
+ grad_clip = 1.0
108
+ use_averages = false
109
+ eps = 0.00000001
110
+ learn_rate = 0.001
111
+
112
+ [training.score_weights]
113
+ ents_f = 1.0
114
+ ents_p = 0.0
115
+ ents_r = 0.0
116
+ ents_per_type = null
117
+
118
+ [pretraining]
119
+
120
+ [initialize]
121
+ vectors = ${paths.vectors}
122
+ init_tok2vec = ${paths.init_tok2vec}
123
+ vocab_data = null
124
+ lookups = null
125
+ before_init = null
126
+ after_init = null
127
+
128
+ [initialize.components]
129
+
130
+ [initialize.tokenizer]
meta.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lang":"id",
3
+ "name":"pipeline",
4
+ "version":"0.0.0",
5
+ "spacy_version":">=3.8.7,<3.9.0",
6
+ "description":"",
7
+ "author":"",
8
+ "email":"",
9
+ "url":"",
10
+ "license":"",
11
+ "spacy_git_version":"4b65aa7",
12
+ "vectors":{
13
+ "width":0,
14
+ "vectors":0,
15
+ "keys":0,
16
+ "name":null,
17
+ "mode":"default"
18
+ },
19
+ "labels":{
20
+ "ner":[
21
+ "CARDINAL",
22
+ "DATE",
23
+ "EVENT",
24
+ "FACILITY",
25
+ "GPE",
26
+ "LANGUAGE",
27
+ "LAW",
28
+ "LOCATION",
29
+ "MISC",
30
+ "MONEY",
31
+ "NORP",
32
+ "ORDINAL",
33
+ "ORGANIZATION",
34
+ "PERCENT",
35
+ "PERSON",
36
+ "PRODUCT",
37
+ "QUANTITY",
38
+ "TIME",
39
+ "TITLE"
40
+ ]
41
+ },
42
+ "pipeline":[
43
+ "ner"
44
+ ],
45
+ "components":[
46
+ "ner"
47
+ ],
48
+ "disabled":[
49
+
50
+ ]
51
+ }
ner/cfg ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "moves":null,
3
+ "update_with_oracle_cut_size":100,
4
+ "multitasks":[
5
+
6
+ ],
7
+ "min_action_freq":1,
8
+ "learn_tokens":false,
9
+ "beam_width":1,
10
+ "beam_density":0.0,
11
+ "beam_update_prob":0.0,
12
+ "incorrect_spans_key":null
13
+ }
ner/model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6fd3968fdd127d227a20c184745f81f86fa63c1af9f8de767d2c2db70e83a73
3
+ size 3851641
ner/moves ADDED
@@ -0,0 +1 @@
 
 
1
+ ��moves��{"0":{},"1":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"2":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"3":{"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"4":{"":1,"MISC":-1,"MONEY":-2,"LAW":-3,"DATE":-4,"FACILITY":-5,"PRODUCT":-6,"EVENT":-7,"ORDINAL":-8,"GPE":-9,"PERCENT":-10,"LOCATION":-11,"TITLE":-12,"ORGANIZATION":-13,"TIME":-14,"QUANTITY":-15,"NORP":-16,"LANGUAGE":-17,"CARDINAL":-18,"PERSON":-19},"5":{"":1}}�cfg��neg_key�
tokenizer ADDED
The diff for this file is too large to render. See raw diff
 
vocab/key2row ADDED
@@ -0,0 +1 @@
 
 
1
+
vocab/lookups.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76be8b528d0075f7aae98d6fa57a6d3c83ae480a8469e668d7b0af968995ac71
3
+ size 1
vocab/strings.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab/vectors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14772b683e726436d5948ad3fff2b43d036ef2ebbe3458aafed6004e05a40706
3
+ size 128
vocab/vectors.cfg ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "mode":"default"
3
+ }