--- language: - english thumbnail: tags: - language model license: datasets: - EMBO/biolang metrics: - --- # MyModelName ## Model description This model is a [RoBERTa base model](https://huggingface.co/roberta-base) pre-trained model further trained with masked language modeling task on a compendium of english scientific textual examples from the life sciences using the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang). ## Intended uses & limitations #### How to use The intended use of this model is to be fine-tuned for downstream tasks, token classification in particular. To have a quick check of the model as-is in a fill-mask task: ```python from transformers import pipeline, RobertaTokenizerFast tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_len=512) text = "Let us try this model to see if it ." fill_mask = pipeline( "fill-mask", model='EMBO/bio-lm', tokenizer=tokenizer ) fill_mask(text) ``` #### Limitations and bias This model should be fine-tuned on a specifi task like token classification. The model must be used with the `roberta-base` tokenizer. ## Training data The model was trained with a masked language modeling taskon the [BioLang dataset](https://huggingface.co/datasets/EMBO/biolang) wich includes 12Mio examples from abstracts and figure legends extracted from papers published in life sciences. ## Training procedure The training was run on a NVIDIA DGX Station with 4XTesla V100 GPUs. Training code is available at https://github.com/source-data/soda-roberta - Command: `python -m lm.train /data/json/oapmc_abstracts_figs/ MLM` - Tokenizer vocab size: 50265 - Training data: bio_lang/MLM - Training with: 12005390 examples. - Evaluating on: 36713 examples. - Epochs :3.0 - per_device_train_batch_size: 16, - per_device_eval_batch_size; 16, - learning_rate: 5e-05, - weight_decay: 0.0, - adam_beta1: 0.9, - adam_beta2: 0.999, - adam_epsilon: 1e-08, - max_grad_norm: 1.0, - tensorboard run: lm-MLM-2021-01-27T15-17-43.113766 End of training: ``` trainset: 'loss': 0.8653350830078125 validation set: 'eval_loss': 0.8192330598831177, 'eval_recall': 0.8154601116513597 ``` ## Eval results Eval on test set: ``` recall: 0.814471959728645 ```