Update README.md
Browse files
README.md
CHANGED
@@ -14,10 +14,10 @@ widget:
|
|
14 |
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
|
15 |
---
|
16 |
|
17 |
-
# Biomedical language model for Spanish
|
|
|
18 |
|
19 |
## BibTeX citation
|
20 |
-
|
21 |
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
22 |
|
23 |
```bibtex
|
@@ -31,18 +31,20 @@ If you use any of these resources (datasets or models) in your work, please cite
|
|
31 |
}
|
32 |
```
|
33 |
|
34 |
-
##
|
35 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
36 |
-
**biomedical-clinical** corpus collected from several sources (see next section).
|
|
|
|
|
37 |
|
38 |
## Training corpora and preprocessing
|
39 |
|
40 |
-
The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
|
41 |
|
42 |
| Name | No. tokens | Description |
|
43 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
44 |
| [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
45 |
-
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical
|
46 |
| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
47 |
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
48 |
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
|
@@ -51,7 +53,7 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
|
|
51 |
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
|
52 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
53 |
|
54 |
-
To obtain a high-quality training corpus, a cleaning pipeline
|
55 |
|
56 |
- data parsing in different formats
|
57 |
- sentence splitting
|
@@ -60,8 +62,8 @@ To obtain a high-quality training corpus, a cleaning pipeline with the following
|
|
60 |
- deduplication of repetitive contents
|
61 |
- keep the original document boundaries
|
62 |
|
63 |
-
|
64 |
-
|
65 |
|
66 |
## Evaluation and results
|
67 |
|
@@ -75,11 +77,11 @@ The model has been evaluated on the Named Entity Recognition (NER) using the fol
|
|
75 |
|
76 |
The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
|
77 |
|
78 |
-
| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
|
79 |
|---------------------------|----------------------------|-------------------------------|-------------------------|
|
80 |
-
| PharmaCoNER | **
|
81 |
-
| CANTEMIST | **83.
|
82 |
-
| ICTUSnet | **88.
|
83 |
|
84 |
|
85 |
## Intended uses & limitations
|
|
|
14 |
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
|
15 |
---
|
16 |
|
17 |
+
# Biomedical-clinical language model for Spanish
|
18 |
+
Biomedical-clinical pretrained language model for Spanish. For more details about the corpus and the training strategy used, see the paper below.
|
19 |
|
20 |
## BibTeX citation
|
|
|
21 |
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
|
22 |
|
23 |
```bibtex
|
|
|
31 |
}
|
32 |
```
|
33 |
|
34 |
+
## Tokenization and model pretraining
|
35 |
This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
|
36 |
+
**biomedical-clinical** corpus in Spanish collected from several sources (see next section).
|
37 |
+
The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
|
38 |
+
used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
|
39 |
|
40 |
## Training corpora and preprocessing
|
41 |
|
42 |
+
The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus (Clinical cases misc.):
|
43 |
|
44 |
| Name | No. tokens | Description |
|
45 |
|-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
46 |
| [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
|
47 |
+
| Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
|
48 |
| [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
|
49 |
| [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
|
50 |
| Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
|
|
|
53 |
| [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
|
54 |
| PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
|
55 |
|
56 |
+
To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus (Clinical cases misc.) uncleaned. Essentially, the cleaning operations used are:
|
57 |
|
58 |
- data parsing in different formats
|
59 |
- sentence splitting
|
|
|
62 |
- deduplication of repetitive contents
|
63 |
- keep the original document boundaries
|
64 |
|
65 |
+
Then, the biomedical corpora are concatenated and further global deduplication among the corpora have been applied.
|
66 |
+
Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus resulting in a medium-size biomedical-clinical corpus for Spanish composed of about 963M tokens.
|
67 |
|
68 |
## Evaluation and results
|
69 |
|
|
|
77 |
|
78 |
The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
|
79 |
|
80 |
+
| F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT | BETO |
|
81 |
|---------------------------|----------------------------|-------------------------------|-------------------------|
|
82 |
+
| PharmaCoNER | **90.04** - **88.92** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
|
83 |
+
| CANTEMIST | **83.34** - **81.48** - **85.30** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
|
84 |
+
| ICTUSnet | **88.08** - **84.92** - **91.50** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
|
85 |
|
86 |
|
87 |
## Intended uses & limitations
|