Biomedical-Enriched Classifier

This is the model used to create the Biomed-Enriched dataset.

Model Details

Base Model: xlm-roberta-base
Model Type: Multi-task model combining multi-label classification and regression.
Description: This model was fine-tuned to classify paragraphs from biomedical texts for their domain and document type, and predict an educational quality score via regression.

Training

The model was trained on a set of 400,000 paragraphs from PubMed Central, which were annotated by the Llama 3.1 70B Instruct model.

Purpose

This classifier was created to scale the initial high-quality annotations to the entire PubMed Open Access dataset. This distillation process enabled the creation of the large-scale Biomed-Enriched dataset while maintaining annotation consistency.

Model Outputs

The model predicts the following outputs:

Domain (Classification)

Clinical
Biomedical
Other

Document Type (Classification)

Clinical Case
Study
Review
Other

Educational Quality (Regression)

A regression score from 1 (low quality) to 5 (high quality).

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for almanach/Biomed-Enriched-classifier

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3924)

this model

Collection including almanach/Biomed-Enriched-classifier

Biomedical datasets & models

Collection

4 items • Updated Jun 26, 2025