metadata
library_name: transformers
license: mit
base_model:
- FacebookAI/xlm-roberta-base
Biomedical-Enriched Classifier
This is the model used to create the Biomed-Enriched dataset.
Model Details
- Base Model:
xlm-roberta-base
- Model Type: Multi-task model combining multi-label classification and regression.
- Description: This model was fine-tuned to classify paragraphs from biomedical texts for their domain and document type, and predict an educational quality score via regression.
Training
The model was trained on a set of 400,000 paragraphs from PubMed Central, which were annotated by the Llama 3.1 70B Instruct model.
Purpose
This classifier was created to scale the initial high-quality annotations to the entire PubMed Open Access dataset. This distillation process enabled the creation of the large-scale Biomed-Enriched dataset while maintaining annotation consistency.
Model Outputs
The model predicts the following outputs:
Domain (Classification)
Clinical
Biomedical
Other
Document Type (Classification)
Clinical Case
Study
Review
Other
Educational Quality (Regression)
- A regression score from
1
(low quality) to5
(high quality).