BSC-LT
/

roberta-base-biomedical-clinical-es

@@ -14,10 +14,10 @@ widget:
 - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
 ---
-# Biomedical language model for Spanish
 ## BibTeX  citation
 If you use any of these resources (datasets or models) in your work, please cite our latest paper:
 ```bibtex
@@ -31,18 +31,20 @@ If you use any of these resources (datasets or models) in your work, please cite
 }
 ```
-## Model and tokenization
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
-**biomedical-clinical** corpus collected from several sources (see next section).
 ## Training corpora and preprocessing
-The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers:
 | Name                                                                                    | No. tokens  | Description                                                                                                                                                                                                                                          |
 |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | [Medical crawler](https://zenodo.org/record/4561970)                                    | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.                                                                                                                                                                                 |
-| Clinical cases misc.                                                                    | 102,855,267 | A miscellany of medical content, essentially clinical case. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.                                                                                                                                                                                 |
 | [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler)                        | 60,007,289  | Publications written in Spanish crawled from the Spanish SciELO server in 2017.                                                                                                                                       |
 | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442  | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.                                                                                       |
 | Wikipedia_life_sciences                                                                 | 13,890,501  | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021                                                                                                                                                                      |
@@ -51,7 +53,7 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
 | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR)                        | 4,166,077   | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature.  The collection of parallel resources are aggregated from the MedlinePlus source. |
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
-To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
 - data parsing in different formats
   - sentence splitting
@@ -60,8 +62,8 @@ To obtain a high-quality training corpus, a cleaning pipeline with the following
   - deduplication of repetitive contents
   - keep the original document boundaries
-Finally, the corpora are concatenated and further global deduplication among the corpora have been applied.
-The result is a medium-size biomedical corpus for Spanish composed of about 860M tokens.
 ## Evaluation and results
@@ -75,11 +77,11 @@ The model has been evaluated on the Named Entity Recognition (NER) using the fol
 The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
-| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT                   | BETO                    |
 |---------------------------|----------------------------|-------------------------------|-------------------------|
-| PharmaCoNER               | **89.48** - **87.85** - **91.18**    | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
-| CANTEMIST                 | **83.87** - **81.70** - **86.17**    | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
-| ICTUSnet                  | **88.12** - **85.56** - **90.83**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
 ## Intended uses & limitations

 - text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
 ---
+# Biomedical-clinical language model for Spanish
+Biomedical-clinical pretrained language model for Spanish. For more details about the corpus and the training strategy used, see the paper below.
 ## BibTeX  citation
 If you use any of these resources (datasets or models) in your work, please cite our latest paper:
 ```bibtex
 }
 ```
+## Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
+**biomedical-clinical** corpus in Spanish collected from several sources (see next section).
+The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
+used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
 ## Training corpora and preprocessing
+The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus (Clinical cases misc.):
 | Name                                                                                    | No. tokens  | Description                                                                                                                                                                                                                                          |
 |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | [Medical crawler](https://zenodo.org/record/4561970)                                    | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains.                                                                                                                                                                                 |
+| Clinical cases misc.                                                                    | 102,855,267 | A miscellany of medical content, essentially clinical cases. Note that a clinical case report is different from a scientific publication where medical practitioners share patient cases and it is different from a clinical note or document.                                                                                                                                                                                 |
 | [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler)                        | 60,007,289  | Publications written in Spanish crawled from the Spanish SciELO server in 2017.                                                                                                                                       |
 | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442  | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines.                                                                                       |
 | Wikipedia_life_sciences                                                                 | 13,890,501  | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021                                                                                                                                                                      |
 | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR)                        | 4,166,077   | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature.  The collection of parallel resources are aggregated from the MedlinePlus source. |
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
+To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus (Clinical cases misc.) uncleaned. Essentially, the cleaning operations used are:
 - data parsing in different formats
   - sentence splitting
   - deduplication of repetitive contents
   - keep the original document boundaries
+Then, the biomedical corpora are concatenated and further global deduplication among the corpora have been applied.
+Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus resulting in a medium-size biomedical-clinical corpus for Spanish composed of about 963M tokens.
 ## Evaluation and results
 The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
+| F1 - Precision - Recall | roberta-base-biomedical-clinical-es | mBERT                   | BETO                    |
 |---------------------------|----------------------------|-------------------------------|-------------------------|
+| PharmaCoNER               | **90.04** - **88.92** - **91.18**    | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
+| CANTEMIST                 | **83.34** - **81.48** - **85.30**    | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
+| ICTUSnet                  | **88.08** - **84.92** - **91.50**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
 ## Intended uses & limitations