jakelever commited on
Commit
272cb7c
·
verified ·
1 Parent(s): c111394

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ task: token-classification
3
+ tags:
4
+ - biomedical
5
+ - bionlp
6
+ license: mit
7
+ base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
8
+ ---
9
+
10
+ # bioner_gnormplus
11
+
12
+ This is a named entity recognition model fine-tuned from the [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) model. It predicts spans with 3 possible labels. The labels are **DomainMotif, FamilyName and Gene**.
13
+
14
+ The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.
15
+
16
+ ## Example Usage
17
+
18
+ The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.
19
+
20
+ ```python
21
+ from transformers import pipeline
22
+
23
+ # Load the model as part of an NER pipeline
24
+ ner_pipeline = pipeline("token-classification",
25
+ model="Glasgow-AI4BioMed/bioner_gnormplus",
26
+ aggregation_strategy="max")
27
+
28
+ # Apply it to some text
29
+ ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")
30
+
31
+ # Output:
32
+ # [ {"entity_group": "FamilyName", "score": 0.44405, "word": "egfr", "start": 0, "end": 4},
33
+ ```
34
+
35
+ ## Dataset Info
36
+
37
+ **Source:** The GNormPlus dataset was downloaded from: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/
38
+
39
+ The dataset should be cited with: Wei, Chih-Hsuan, Hung-Yu Kao, and Zhiyong Lu. "GNormPlus: an integrative approach for tagging genes, gene families, and protein domains." BioMed research international 2015.1 (2015): 918710. DOI: [10.1155/2015/918710](https://doi.org/10.1155/2015/918710)
40
+
41
+ **Preprocessing:** The training set was split 75/25 to create a training and validation set. No changes were made to the annotations. The preprocessing script for this dataset is [prepare_bc5cdr.py](https://github.com/Glasgow-AI4BioMed/bioner/blob/main/prepare_bc5cdr.py).
42
+
43
+ ## Performance
44
+
45
+ The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).
46
+
47
+ | Label | Precision | Recall | F1-score | Support |
48
+ | --- | --- | --- | --- | --- |
49
+ | DomainMotif | 0.602 | 0.670 | 0.634 | 361 |
50
+ | FamilyName | 0.497 | 0.569 | 0.530 | 1250 |
51
+ | Gene | 0.856 | 0.923 | 0.888 | 3225 |
52
+ | macro avg | 0.651 | 0.721 | 0.684 | 4836 |
53
+ | weighted avg | 0.744 | 0.812 | 0.777 | 4836 |
54
+
55
+
56
+ ## Hyperparameters
57
+
58
+ Hyperparameter tuning was done with [optuna](https://optuna.org/) and the [hyperparameter_search](https://huggingface.co/docs/transformers/en/hpo_train) functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.
59
+
60
+ | Hyperparameter | Value |
61
+ |----------------|-------|
62
+ | epochs | 13.0 |
63
+ | learning_rate | 4.312024782506724e-05 |
64
+ | per_device_train_batch_size | 16 |
65
+ | weight_decay | 0.004637010897989902 |
66
+ | warmup_ratio | 0.00046632724074153857 |
67
+
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ task: token-classification
3
+ tags:
4
+ - biomedical
5
+ - bionlp
6
+ license: mit
7
+ base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
8
+ ---
9
+
10
+ # bioner_gnormplus
11
+
12
+ This is a named entity recognition model fine-tuned from the [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext) model. It predicts spans with 3 possible labels. The labels are **DomainMotif, FamilyName and Gene**.
13
+
14
+ The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.
15
+
16
+ ## Example Usage
17
+
18
+ The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.
19
+
20
+ ```python
21
+ from transformers import pipeline
22
+
23
+ # Load the model as part of an NER pipeline
24
+ ner_pipeline = pipeline("token-classification",
25
+ model="Glasgow-AI4BioMed/bioner_gnormplus",
26
+ aggregation_strategy="max")
27
+
28
+ # Apply it to some text
29
+ ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")
30
+
31
+ # Output:
32
+ # [ {"entity_group": "FamilyName", "score": 0.44405, "word": "egfr", "start": 0, "end": 4},
33
+ ```
34
+
35
+ ## Dataset Info
36
+
37
+ **Source:** The GNormPlus dataset was downloaded from: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/
38
+
39
+ The dataset should be cited with: Wei, Chih-Hsuan, Hung-Yu Kao, and Zhiyong Lu. "GNormPlus: an integrative approach for tagging genes, gene families, and protein domains." BioMed research international 2015.1 (2015): 918710. DOI: [10.1155/2015/918710](https://doi.org/10.1155/2015/918710)
40
+
41
+ **Preprocessing:** The training set was split 75/25 to create a training and validation set. No changes were made to the annotations. The preprocessing script for this dataset is [prepare_bc5cdr.py](https://github.com/Glasgow-AI4BioMed/bioner/blob/main/prepare_bc5cdr.py).
42
+
43
+ ## Performance
44
+
45
+ The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).
46
+
47
+ | Label | Precision | Recall | F1-score | Support |
48
+ | --- | --- | --- | --- | --- |
49
+ | DomainMotif | 0.602 | 0.670 | 0.634 | 361 |
50
+ | FamilyName | 0.497 | 0.569 | 0.530 | 1250 |
51
+ | Gene | 0.856 | 0.923 | 0.888 | 3225 |
52
+ | macro avg | 0.651 | 0.721 | 0.684 | 4836 |
53
+ | weighted avg | 0.744 | 0.812 | 0.777 | 4836 |
54
+
55
+
56
+ ## Hyperparameters
57
+
58
+ Hyperparameter tuning was done with [optuna](https://optuna.org/) and the [hyperparameter_search](https://huggingface.co/docs/transformers/en/hpo_train) functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.
59
+
60
+ | Hyperparameter | Value |
61
+ |----------------|-------|
62
+ | epochs | 13.0 |
63
+ | learning_rate | 4.312024782506724e-05 |
64
+ | per_device_train_batch_size | 16 |
65
+ | weight_decay | 0.004637010897989902 |
66
+ | warmup_ratio | 0.00046632724074153857 |
67
+
best_hyperparameters.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "epochs": 13.0,
3
+ "learning_rate": 4.312024782506724e-05,
4
+ "per_device_train_batch_size": 16,
5
+ "weight_decay": 0.004637010897989902,
6
+ "warmup_ratio": 0.00046632724074153857
7
+ }
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "O",
13
+ "1": "B-DomainMotif",
14
+ "2": "I-DomainMotif",
15
+ "3": "B-FamilyName",
16
+ "4": "I-FamilyName",
17
+ "5": "B-Gene",
18
+ "6": "I-Gene"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "layer_norm_eps": 1e-12,
23
+ "max_position_embeddings": 512,
24
+ "model_type": "bert",
25
+ "num_attention_heads": 12,
26
+ "num_hidden_layers": 12,
27
+ "pad_token_id": 0,
28
+ "position_embedding_type": "absolute",
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.48.1",
31
+ "type_vocab_size": 2,
32
+ "use_cache": true,
33
+ "vocab_size": 30522
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc802b8a0ad2e9ab907de5f825b001e5d91b8dffb6d9345c4c4020952b4171a8
3
+ size 435611468
performance_report.json ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "train": {
3
+ "token_level": {
4
+ "O": {
5
+ "precision": 0.9992820624388121,
6
+ "recall": 0.9996517802733524,
7
+ "f1-score": 0.9994668871650365,
8
+ "support": 45948.0
9
+ },
10
+ "B-DomainMotif": {
11
+ "precision": 0.9736842105263158,
12
+ "recall": 0.9487179487179487,
13
+ "f1-score": 0.961038961038961,
14
+ "support": 195.0
15
+ },
16
+ "I-DomainMotif": {
17
+ "precision": 0.9962406015037594,
18
+ "recall": 0.9888059701492538,
19
+ "f1-score": 0.9925093632958801,
20
+ "support": 536.0
21
+ },
22
+ "B-FamilyName": {
23
+ "precision": 0.9926289926289926,
24
+ "recall": 0.9901960784313726,
25
+ "f1-score": 0.9914110429447853,
26
+ "support": 816.0
27
+ },
28
+ "I-FamilyName": {
29
+ "precision": 0.9940054495912807,
30
+ "recall": 0.9972662657189721,
31
+ "f1-score": 0.9956331877729258,
32
+ "support": 1829.0
33
+ },
34
+ "B-Gene": {
35
+ "precision": 0.9959641255605381,
36
+ "recall": 0.991960696739616,
37
+ "f1-score": 0.993958379950772,
38
+ "support": 2239.0
39
+ },
40
+ "I-Gene": {
41
+ "precision": 0.9986388384754991,
42
+ "recall": 0.9979596463386987,
43
+ "f1-score": 0.9982991268851343,
44
+ "support": 4411.0
45
+ },
46
+ "accuracy": 0.9987136884982313,
47
+ "macro avg": {
48
+ "precision": 0.9929206115321711,
49
+ "recall": 0.987794055195602,
50
+ "f1-score": 0.9903309927219279,
51
+ "support": 55974.0
52
+ },
53
+ "weighted avg": {
54
+ "precision": 0.9987109444979878,
55
+ "recall": 0.9987136884982313,
56
+ "f1-score": 0.9987113109741671,
57
+ "support": 55974.0
58
+ }
59
+ },
60
+ "span_level": {
61
+ "DomainMotif": {
62
+ "precision": 0.9067357512953368,
63
+ "recall": 0.8883248730964467,
64
+ "f1-score": 0.8974358974358975,
65
+ "support": 197
66
+ },
67
+ "FamilyName": {
68
+ "precision": 0.9755799755799756,
69
+ "recall": 0.9803680981595092,
70
+ "f1-score": 0.97796817625459,
71
+ "support": 815
72
+ },
73
+ "Gene": {
74
+ "precision": 0.9882671480144405,
75
+ "recall": 0.9763709317877842,
76
+ "f1-score": 0.9822830230993497,
77
+ "support": 2243
78
+ },
79
+ "macro avg": {
80
+ "precision": 0.9568609582965842,
81
+ "recall": 0.9483546343479133,
82
+ "f1-score": 0.9525623655966124,
83
+ "support": 3255
84
+ },
85
+ "weighted avg": {
86
+ "precision": 0.9801560172347932,
87
+ "recall": 0.9720430107526882,
88
+ "f1-score": 0.9760675134421517,
89
+ "support": 3255
90
+ }
91
+ }
92
+ },
93
+ "val": {
94
+ "token_level": {
95
+ "O": {
96
+ "precision": 0.982154530003922,
97
+ "recall": 0.982796964939822,
98
+ "f1-score": 0.9824756424507944,
99
+ "support": 15288.0
100
+ },
101
+ "B-DomainMotif": {
102
+ "precision": 0.7536231884057971,
103
+ "recall": 0.6419753086419753,
104
+ "f1-score": 0.6933333333333334,
105
+ "support": 81.0
106
+ },
107
+ "I-DomainMotif": {
108
+ "precision": 0.8457142857142858,
109
+ "recall": 0.714975845410628,
110
+ "f1-score": 0.774869109947644,
111
+ "support": 207.0
112
+ },
113
+ "B-FamilyName": {
114
+ "precision": 0.6816479400749064,
115
+ "recall": 0.6086956521739131,
116
+ "f1-score": 0.6431095406360424,
117
+ "support": 299.0
118
+ },
119
+ "I-FamilyName": {
120
+ "precision": 0.7571428571428571,
121
+ "recall": 0.6656200941915228,
122
+ "f1-score": 0.70843776106934,
123
+ "support": 637.0
124
+ },
125
+ "B-Gene": {
126
+ "precision": 0.8721071863580999,
127
+ "recall": 0.9384010484927916,
128
+ "f1-score": 0.9040404040404041,
129
+ "support": 763.0
130
+ },
131
+ "I-Gene": {
132
+ "precision": 0.8863487916394513,
133
+ "recall": 0.9384508990318119,
134
+ "f1-score": 0.9116560295599597,
135
+ "support": 1446.0
136
+ },
137
+ "accuracy": 0.9563591688478179,
138
+ "macro avg": {
139
+ "precision": 0.8255341113341885,
140
+ "recall": 0.784416544697495,
141
+ "f1-score": 0.8025602601482168,
142
+ "support": 18721.0
143
+ },
144
+ "weighted avg": {
145
+ "precision": 0.9553162576832412,
146
+ "recall": 0.9563591688478179,
147
+ "f1-score": 0.9555177384234166,
148
+ "support": 18721.0
149
+ }
150
+ },
151
+ "span_level": {
152
+ "DomainMotif": {
153
+ "precision": 0.6133333333333333,
154
+ "recall": 0.5679012345679012,
155
+ "f1-score": 0.5897435897435898,
156
+ "support": 81
157
+ },
158
+ "FamilyName": {
159
+ "precision": 0.5387323943661971,
160
+ "recall": 0.5134228187919463,
161
+ "f1-score": 0.5257731958762886,
162
+ "support": 298
163
+ },
164
+ "Gene": {
165
+ "precision": 0.8486682808716707,
166
+ "recall": 0.9127604166666666,
167
+ "f1-score": 0.8795483061480552,
168
+ "support": 768
169
+ },
170
+ "macro avg": {
171
+ "precision": 0.6669113361904003,
172
+ "recall": 0.6646948233421713,
173
+ "f1-score": 0.6650216972559778,
174
+ "support": 1147
175
+ },
176
+ "weighted avg": {
177
+ "precision": 0.7515252774460068,
178
+ "recall": 0.7846556233653008,
179
+ "f1-score": 0.7671689121726863,
180
+ "support": 1147
181
+ }
182
+ }
183
+ },
184
+ "test": {
185
+ "token_level": {
186
+ "O": {
187
+ "precision": 0.9869390488948426,
188
+ "recall": 0.9731650937657508,
189
+ "f1-score": 0.9800036754732172,
190
+ "support": 57537.0
191
+ },
192
+ "B-DomainMotif": {
193
+ "precision": 0.7365591397849462,
194
+ "recall": 0.7611111111111111,
195
+ "f1-score": 0.7486338797814208,
196
+ "support": 360.0
197
+ },
198
+ "I-DomainMotif": {
199
+ "precision": 0.8471337579617835,
200
+ "recall": 0.81511746680286,
201
+ "f1-score": 0.8308172826652785,
202
+ "support": 979.0
203
+ },
204
+ "B-FamilyName": {
205
+ "precision": 0.6186627479794269,
206
+ "recall": 0.6736,
207
+ "f1-score": 0.6449636154729989,
208
+ "support": 1250.0
209
+ },
210
+ "I-FamilyName": {
211
+ "precision": 0.6251338807568726,
212
+ "recall": 0.705763804917372,
213
+ "f1-score": 0.6630064369556986,
214
+ "support": 2481.0
215
+ },
216
+ "B-Gene": {
217
+ "precision": 0.8868312757201646,
218
+ "recall": 0.9337666357164964,
219
+ "f1-score": 0.9096939544700738,
220
+ "support": 3231.0
221
+ },
222
+ "I-Gene": {
223
+ "precision": 0.8947873361048929,
224
+ "recall": 0.9283344392833444,
225
+ "f1-score": 0.9112522390490149,
226
+ "support": 6028.0
227
+ },
228
+ "accuracy": 0.9499763448640526,
229
+ "macro avg": {
230
+ "precision": 0.7994353124575614,
231
+ "recall": 0.8272655073709906,
232
+ "f1-score": 0.8126244405525289,
233
+ "support": 71866.0
234
+ },
235
+ "weighted avg": {
236
+ "precision": 0.9526540061037759,
237
+ "recall": 0.9499763448640526,
238
+ "f1-score": 0.9511135021493018,
239
+ "support": 71866.0
240
+ }
241
+ },
242
+ "span_level": {
243
+ "DomainMotif": {
244
+ "precision": 0.6019900497512438,
245
+ "recall": 0.6703601108033241,
246
+ "f1-score": 0.6343381389252949,
247
+ "support": 361
248
+ },
249
+ "FamilyName": {
250
+ "precision": 0.4965083798882682,
251
+ "recall": 0.5688,
252
+ "f1-score": 0.5302013422818792,
253
+ "support": 1250
254
+ },
255
+ "Gene": {
256
+ "precision": 0.8559102674719585,
257
+ "recall": 0.9227906976744186,
258
+ "f1-score": 0.8880931065353624,
259
+ "support": 3225
260
+ },
261
+ "macro avg": {
262
+ "precision": 0.6514695657038235,
263
+ "recall": 0.7206502694925808,
264
+ "f1-score": 0.6842108625808455,
265
+ "support": 4836
266
+ },
267
+ "weighted avg": {
268
+ "precision": 0.7440580015338297,
269
+ "recall": 0.8124483043837882,
270
+ "f1-score": 0.7766435100456833,
271
+ "support": 4836
272
+ }
273
+ }
274
+ }
275
+ }
performance_report.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Performance on Training Set
2
+
3
+ ## Span Level
4
+
5
+ | Label | Precision | Recall | F1-score | Support |
6
+ | --- | --- | --- | --- | --- |
7
+ | DomainMotif | 0.907 | 0.888 | 0.897 | 197 |
8
+ | FamilyName | 0.976 | 0.980 | 0.978 | 815 |
9
+ | Gene | 0.988 | 0.976 | 0.982 | 2243 |
10
+ | macro avg | 0.957 | 0.948 | 0.953 | 3255 |
11
+ | weighted avg | 0.980 | 0.972 | 0.976 | 3255 |
12
+
13
+ ## Token Level
14
+
15
+ | Label | Precision | Recall | F1-score | Support |
16
+ | --- | --- | --- | --- | --- |
17
+ | O | 0.999 | 1.000 | 0.999 | 45948 |
18
+ | B-DomainMotif | 0.974 | 0.949 | 0.961 | 195 |
19
+ | I-DomainMotif | 0.996 | 0.989 | 0.993 | 536 |
20
+ | B-FamilyName | 0.993 | 0.990 | 0.991 | 816 |
21
+ | I-FamilyName | 0.994 | 0.997 | 0.996 | 1829 |
22
+ | B-Gene | 0.996 | 0.992 | 0.994 | 2239 |
23
+ | I-Gene | 0.999 | 0.998 | 0.998 | 4411 |
24
+ | macro avg | 0.993 | 0.988 | 0.990 | 55974 |
25
+ | weighted avg | 0.999 | 0.999 | 0.999 | 55974 |
26
+
27
+
28
+ # Performance on Validation Set
29
+
30
+ ## Span Level
31
+
32
+ | Label | Precision | Recall | F1-score | Support |
33
+ | --- | --- | --- | --- | --- |
34
+ | DomainMotif | 0.613 | 0.568 | 0.590 | 81 |
35
+ | FamilyName | 0.539 | 0.513 | 0.526 | 298 |
36
+ | Gene | 0.849 | 0.913 | 0.880 | 768 |
37
+ | macro avg | 0.667 | 0.665 | 0.665 | 1147 |
38
+ | weighted avg | 0.752 | 0.785 | 0.767 | 1147 |
39
+
40
+ ## Token Level
41
+
42
+ | Label | Precision | Recall | F1-score | Support |
43
+ | --- | --- | --- | --- | --- |
44
+ | O | 0.982 | 0.983 | 0.982 | 15288 |
45
+ | B-DomainMotif | 0.754 | 0.642 | 0.693 | 81 |
46
+ | I-DomainMotif | 0.846 | 0.715 | 0.775 | 207 |
47
+ | B-FamilyName | 0.682 | 0.609 | 0.643 | 299 |
48
+ | I-FamilyName | 0.757 | 0.666 | 0.708 | 637 |
49
+ | B-Gene | 0.872 | 0.938 | 0.904 | 763 |
50
+ | I-Gene | 0.886 | 0.938 | 0.912 | 1446 |
51
+ | macro avg | 0.826 | 0.784 | 0.803 | 18721 |
52
+ | weighted avg | 0.955 | 0.956 | 0.956 | 18721 |
53
+
54
+
55
+ # Performance on Testing Set
56
+
57
+ ## Span Level
58
+
59
+ | Label | Precision | Recall | F1-score | Support |
60
+ | --- | --- | --- | --- | --- |
61
+ | DomainMotif | 0.602 | 0.670 | 0.634 | 361 |
62
+ | FamilyName | 0.497 | 0.569 | 0.530 | 1250 |
63
+ | Gene | 0.856 | 0.923 | 0.888 | 3225 |
64
+ | macro avg | 0.651 | 0.721 | 0.684 | 4836 |
65
+ | weighted avg | 0.744 | 0.812 | 0.777 | 4836 |
66
+
67
+ ## Token Level
68
+
69
+ | Label | Precision | Recall | F1-score | Support |
70
+ | --- | --- | --- | --- | --- |
71
+ | O | 0.987 | 0.973 | 0.980 | 57537 |
72
+ | B-DomainMotif | 0.737 | 0.761 | 0.749 | 360 |
73
+ | I-DomainMotif | 0.847 | 0.815 | 0.831 | 979 |
74
+ | B-FamilyName | 0.619 | 0.674 | 0.645 | 1250 |
75
+ | I-FamilyName | 0.625 | 0.706 | 0.663 | 2481 |
76
+ | B-Gene | 0.887 | 0.934 | 0.910 | 3231 |
77
+ | I-Gene | 0.895 | 0.928 | 0.911 | 6028 |
78
+ | macro avg | 0.799 | 0.827 | 0.813 | 71866 |
79
+ | weighted avg | 0.953 | 0.950 | 0.951 | 71866 |
80
+
81
+
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3956f5f0d871975d38a47184ff093b702da31d5c65f3d8334be329cc35ea62cf
3
+ size 14244
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
trainer_state.json ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.8025602601482168,
3
+ "best_model_checkpoint": "tmp_ner_fantastic-bale-17_39/run-51/checkpoint-351",
4
+ "epoch": 13.0,
5
+ "eval_steps": 500,
6
+ "global_step": 351,
7
+ "is_hyper_param_search": true,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 1.0,
13
+ "eval_accuracy": 0.8966935526948346,
14
+ "eval_loss": 0.3334161937236786,
15
+ "eval_macro_f1": 0.3481690804569024,
16
+ "eval_macro_precision": 0.381260717978419,
17
+ "eval_macro_recall": 0.35979719914273406,
18
+ "eval_runtime": 0.6891,
19
+ "eval_samples_per_second": 206.072,
20
+ "eval_steps_per_second": 26.122,
21
+ "step": 27
22
+ },
23
+ {
24
+ "epoch": 2.0,
25
+ "eval_accuracy": 0.930559264996528,
26
+ "eval_loss": 0.21216891705989838,
27
+ "eval_macro_f1": 0.6053382787959644,
28
+ "eval_macro_precision": 0.6107512517099946,
29
+ "eval_macro_recall": 0.6460256647872952,
30
+ "eval_runtime": 0.6859,
31
+ "eval_samples_per_second": 207.037,
32
+ "eval_steps_per_second": 26.244,
33
+ "step": 54
34
+ },
35
+ {
36
+ "epoch": 3.0,
37
+ "eval_accuracy": 0.9471716254473586,
38
+ "eval_loss": 0.17325381934642792,
39
+ "eval_macro_f1": 0.654390209074469,
40
+ "eval_macro_precision": 0.8778760893535847,
41
+ "eval_macro_recall": 0.6000593968379591,
42
+ "eval_runtime": 0.6857,
43
+ "eval_samples_per_second": 207.088,
44
+ "eval_steps_per_second": 26.251,
45
+ "step": 81
46
+ },
47
+ {
48
+ "epoch": 4.0,
49
+ "eval_accuracy": 0.9528337161476417,
50
+ "eval_loss": 0.15815088152885437,
51
+ "eval_macro_f1": 0.780354726682967,
52
+ "eval_macro_precision": 0.7951782992828399,
53
+ "eval_macro_recall": 0.7756083475295977,
54
+ "eval_runtime": 0.6835,
55
+ "eval_samples_per_second": 207.761,
56
+ "eval_steps_per_second": 26.336,
57
+ "step": 108
58
+ },
59
+ {
60
+ "epoch": 5.0,
61
+ "eval_accuracy": 0.9519256449975962,
62
+ "eval_loss": 0.17041093111038208,
63
+ "eval_macro_f1": 0.7750872516025015,
64
+ "eval_macro_precision": 0.819684829234941,
65
+ "eval_macro_recall": 0.7465932066328682,
66
+ "eval_runtime": 0.6848,
67
+ "eval_samples_per_second": 207.368,
68
+ "eval_steps_per_second": 26.286,
69
+ "step": 135
70
+ },
71
+ {
72
+ "epoch": 6.0,
73
+ "eval_accuracy": 0.9541156989477058,
74
+ "eval_loss": 0.18207110464572906,
75
+ "eval_macro_f1": 0.7757922057464975,
76
+ "eval_macro_precision": 0.8139015262713462,
77
+ "eval_macro_recall": 0.7479356706189721,
78
+ "eval_runtime": 0.6847,
79
+ "eval_samples_per_second": 207.398,
80
+ "eval_steps_per_second": 26.29,
81
+ "step": 162
82
+ },
83
+ {
84
+ "epoch": 7.0,
85
+ "eval_accuracy": 0.9547032743977352,
86
+ "eval_loss": 0.1869019716978073,
87
+ "eval_macro_f1": 0.7909756567968369,
88
+ "eval_macro_precision": 0.818653317630622,
89
+ "eval_macro_recall": 0.7694133266548752,
90
+ "eval_runtime": 0.6858,
91
+ "eval_samples_per_second": 207.047,
92
+ "eval_steps_per_second": 26.245,
93
+ "step": 189
94
+ },
95
+ {
96
+ "epoch": 8.0,
97
+ "eval_accuracy": 0.9542225308477111,
98
+ "eval_loss": 0.19935451447963715,
99
+ "eval_macro_f1": 0.7742556094296894,
100
+ "eval_macro_precision": 0.829305822509576,
101
+ "eval_macro_recall": 0.7366273963466895,
102
+ "eval_runtime": 0.6857,
103
+ "eval_samples_per_second": 207.102,
104
+ "eval_steps_per_second": 26.252,
105
+ "step": 216
106
+ },
107
+ {
108
+ "epoch": 9.0,
109
+ "eval_accuracy": 0.9549169381977458,
110
+ "eval_loss": 0.19827748835086823,
111
+ "eval_macro_f1": 0.7890513633726142,
112
+ "eval_macro_precision": 0.8190924393392907,
113
+ "eval_macro_recall": 0.7654512068808164,
114
+ "eval_runtime": 0.6844,
115
+ "eval_samples_per_second": 207.478,
116
+ "eval_steps_per_second": 26.3,
117
+ "step": 243
118
+ },
119
+ {
120
+ "epoch": 10.0,
121
+ "eval_accuracy": 0.9539554510976977,
122
+ "eval_loss": 0.21724671125411987,
123
+ "eval_macro_f1": 0.7951226311261627,
124
+ "eval_macro_precision": 0.8076887756893915,
125
+ "eval_macro_recall": 0.7863682941618138,
126
+ "eval_runtime": 0.6853,
127
+ "eval_samples_per_second": 207.214,
128
+ "eval_steps_per_second": 26.267,
129
+ "step": 270
130
+ },
131
+ {
132
+ "epoch": 11.0,
133
+ "eval_accuracy": 0.9528871320976443,
134
+ "eval_loss": 0.24698173999786377,
135
+ "eval_macro_f1": 0.7700880502690434,
136
+ "eval_macro_precision": 0.837595387901314,
137
+ "eval_macro_recall": 0.7281291396407018,
138
+ "eval_runtime": 0.6867,
139
+ "eval_samples_per_second": 206.792,
140
+ "eval_steps_per_second": 26.213,
141
+ "step": 297
142
+ },
143
+ {
144
+ "epoch": 12.0,
145
+ "eval_accuracy": 0.9551306019977566,
146
+ "eval_loss": 0.23617151379585266,
147
+ "eval_macro_f1": 0.7840279783447406,
148
+ "eval_macro_precision": 0.8305447049199165,
149
+ "eval_macro_recall": 0.7522814462795042,
150
+ "eval_runtime": 0.6856,
151
+ "eval_samples_per_second": 207.122,
152
+ "eval_steps_per_second": 26.255,
153
+ "step": 324
154
+ },
155
+ {
156
+ "epoch": 13.0,
157
+ "eval_accuracy": 0.9563591688478179,
158
+ "eval_loss": 0.22979934513568878,
159
+ "eval_macro_f1": 0.8025602601482168,
160
+ "eval_macro_precision": 0.8255341113341885,
161
+ "eval_macro_recall": 0.784416544697495,
162
+ "eval_runtime": 0.6867,
163
+ "eval_samples_per_second": 206.792,
164
+ "eval_steps_per_second": 26.213,
165
+ "step": 351
166
+ }
167
+ ],
168
+ "logging_steps": 500,
169
+ "max_steps": 864,
170
+ "num_input_tokens_seen": 0,
171
+ "num_train_epochs": 32,
172
+ "save_steps": 500,
173
+ "stateful_callbacks": {
174
+ "EarlyStoppingCallback": {
175
+ "args": {
176
+ "early_stopping_patience": 3,
177
+ "early_stopping_threshold": 0.001
178
+ },
179
+ "attributes": {
180
+ "early_stopping_patience_counter": 0
181
+ }
182
+ },
183
+ "TrainerControl": {
184
+ "args": {
185
+ "should_epoch_stop": false,
186
+ "should_evaluate": false,
187
+ "should_log": false,
188
+ "should_save": true,
189
+ "should_training_stop": false
190
+ },
191
+ "attributes": {}
192
+ }
193
+ },
194
+ "total_flos": 0,
195
+ "train_batch_size": 16,
196
+ "trial_name": null,
197
+ "trial_params": {
198
+ "learning_rate": 4.312024782506724e-05,
199
+ "per_device_train_batch_size": 16,
200
+ "warmup_ratio": 0.00046632724074153857,
201
+ "weight_decay": 0.004637010897989902
202
+ }
203
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff