ccasimiro commited on
Commit
9aafcd9
·
1 Parent(s): 63e7718

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -13
README.md CHANGED
@@ -14,10 +14,10 @@ widget:
14
  - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
15
  ---
16
 
17
- # Biomedical language model for Spanish
 
18
 
19
  ## BibTeX citation
20
-
21
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
22
 
23
  ```bibtex
@@ -31,18 +31,20 @@ If you use any of these resources (datasets or models) in your work, please cite
31
  }
32
  ```
33
 
34
- ## Model and tokenization
35
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
36
- **biomedical-clinical** corpus collected from several sources (see next section).
 
 
37
 
38
  ## Training corpora and preprocessing
39
 
40
- The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
41
 
42
  | Name | No. tokens | Description |
43
  |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
44
  | [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
45
- | Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical case. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
46
  | [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
47
  | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
48
  | Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
@@ -51,7 +53,7 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
51
  | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
52
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
53
 
54
- To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
55
 
56
  - data parsing in different formats
57
  - sentence splitting
@@ -60,8 +62,8 @@ To obtain a high-quality training corpus, a cleaning pipeline with the following
60
  - deduplication of repetitive contents
61
  - keep the original document boundaries
62
 
63
- Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
64
- The result is a medium-size biomedical corpus for Spanish composed of about 860M tokens.
65
 
66
  ## Evaluation and results
67
 
@@ -75,11 +77,11 @@ The model has been evaluated on the Named Entity Recognition (NER) using the fol
75
 
76
  The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
77
 
78
- | F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
79
  |---------------------------|----------------------------|-------------------------------|-------------------------|
80
- | PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
81
- | CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
82
- | ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
83
 
84
 
85
  ## Intended uses & limitations
 
14
  - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
15
  ---
16
 
17
+ # Biomedical-clinical language model for Spanish
18
+ Biomedical-clinical pretrained language model for Spanish. For more details about the corpus and the training strategy used, see the paper below.
19
 
20
  ## BibTeX citation
 
21
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
22
 
23
  ```bibtex
 
31
  }
32
  ```
33
 
34
+ ## Tokenization and model pretraining
35
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
36
+ **biomedical-clinical** corpus in Spanish collected from several sources (see next section).
37
+ The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
38
+ used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
39
 
40
  ## Training corpora and preprocessing
41
 
42
+ The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus (Clinical cases misc.):
43
 
44
  | Name | No. tokens | Description |
45
  |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
46
  | [Medical crawler](https://zenodo.org/record/4561970) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
47
+ | Clinical cases misc. | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document. |
48
  | [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
49
  | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
50
  | Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
 
53
  | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
54
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
55
 
56
+ To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus (Clinical cases misc.) uncleaned. Essentially, the cleaning operations used are:
57
 
58
  - data parsing in different formats
59
  - sentence splitting
 
62
  - deduplication of repetitive contents
63
  - keep the original document boundaries
64
 
65
+ Then, the biomedical corpora are concatenated and further global deduplication among the corpora have been applied.
66
+ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus resulting in a medium-size biomedical-clinical corpus for Spanish composed of about 963M tokens.
67
 
68
  ## Evaluation and results
69
 
 
77
 
78
  The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
79
 
80
+ | F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT | BETO |
81
  |---------------------------|----------------------------|-------------------------------|-------------------------|
82
+ | PharmaCoNER | **90.04** - **88.92** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
83
+ | CANTEMIST | **83.34** - **81.48** - **85.30** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
84
+ | ICTUSnet | **88.08** - **84.92** - **91.50** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
85
 
86
 
87
  ## Intended uses & limitations