|
--- |
|
library_name: transformers |
|
license: mit |
|
base_model: |
|
- FacebookAI/xlm-roberta-base |
|
--- |
|
|
|
# Biomedical-Enriched Classifier |
|
|
|
This is the model used to create the [**Biomed-Enriched**](https://huggingface.co/datasets/almanach/Biomed-Enriched) dataset. |
|
## Model Details |
|
|
|
- **Base Model:** `xlm-roberta-base` |
|
- **Model Type:** Multi-task model combining multi-label classification and regression. |
|
- **Description:** This model was fine-tuned to classify paragraphs from biomedical texts for their domain and document type, and predict an educational quality score via regression. |
|
|
|
## Training |
|
|
|
The model was trained on a set of 400,000 paragraphs from PubMed Central, which were annotated by the [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model. |
|
|
|
## Purpose |
|
|
|
This classifier was created to scale the initial high-quality annotations to the entire PubMed Open Access dataset. This distillation process enabled the creation of the large-scale Biomed-Enriched dataset while maintaining annotation consistency. |
|
|
|
## Model Outputs |
|
|
|
The model predicts the following outputs: |
|
|
|
### Domain (Classification) |
|
- `Clinical` |
|
- `Biomedical` |
|
- `Other` |
|
|
|
### Document Type (Classification) |
|
- `Clinical Case` |
|
- `Study` |
|
- `Review` |
|
- `Other` |
|
|
|
### Educational Quality (Regression) |
|
- A regression score from `1` (low quality) to `5` (high quality). |