File size: 1,361 Bytes
a0c118c
 
9c384e0
 
 
a0c118c
 
9c384e0
a0c118c
9c384e0
a0c118c
 
9c384e0
 
 
a0c118c
9c384e0
a0c118c
9c384e0
a0c118c
9c384e0
a0c118c
9c384e0
a0c118c
9c384e0
a0c118c
9c384e0
a0c118c
9c384e0
 
 
 
a0c118c
9c384e0
 
 
 
 
a0c118c
9c384e0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
library_name: transformers
license: mit
base_model:
- FacebookAI/xlm-roberta-base
---

# Biomedical-Enriched Classifier

This is the model used to create the [**Biomed-Enriched**](https://huggingface.co/datasets/almanach/Biomed-Enriched) dataset.
## Model Details

- **Base Model:** `xlm-roberta-base`
- **Model Type:** Multi-task model combining multi-label classification and regression.
- **Description:** This model was fine-tuned to classify paragraphs from biomedical texts for their domain and document type, and predict an educational quality score via regression.

## Training

The model was trained on a set of 400,000 paragraphs from PubMed Central, which were annotated by the [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model.

## Purpose

This classifier was created to scale the initial high-quality annotations to the entire PubMed Open Access dataset. This distillation process enabled the creation of the large-scale Biomed-Enriched dataset while maintaining annotation consistency.

## Model Outputs

The model predicts the following outputs:

### Domain (Classification)
- `Clinical`
- `Biomedical`
- `Other`

### Document Type (Classification)
- `Clinical Case`
- `Study`
- `Review`
- `Other`

### Educational Quality (Regression)
- A regression score from `1` (low quality) to `5` (high quality).