Model Card for BioBERT Multi-label Medical Classification Model

A fine-tuned BioBERT model for multi-label classification of medical articles across four main medical specialties.

Model Details

Model Description

This model is based on BioBERT and specialized for multi-label classification of medical articles. It was developed through knowledge distillation from the main BioBERT model and trained on a comprehensive dataset of medical literature including titles and abstracts. The model can classify medical content into four primary medical categories and their combinations.

Developed by: Iver Johan Hincapie Betancur
Model type: Multi-label Text Classification (BioBERT-based)
Language(s) (NLP): English (Medical/Scientific)
Finetuned from model: BioBERT
Architecture: Transformer-based with LoRA adapters

Model Sources

Repository: https://huggingface.co/dmis-lab/biobert-v1.1
Base Model: BioBERT

Uses

Direct Use

This model is designed for multi-label classification of medical articles, papers, and abstracts. It can identify and classify content across four main medical specialties:

Neurological - Content related to nervous system disorders and treatments
Hepatorenal - Content related to liver and kidney conditions
Cardiovascular - Content related to heart and circulatory system
Oncological - Content related to cancer and tumor-related conditions

The model can assign multiple labels simultaneously, making it suitable for complex medical articles that span multiple specialties.

Bias, Risks, and Limitations

Dataset Imbalance: The model was trained on a dataset with significant class imbalance, which may affect prediction accuracy for underrepresented categories.

Domain Specificity: Performance may vary when applied to medical content outside the training domain or newer medical terminology.

Multi-label Complexity: Some medical articles may span multiple categories, and the model's ability to capture all relevant labels may vary.

Recommendations

Users should be aware that this model is designed for content classification purposes only and should not be used for clinical decision-making. Due to class imbalance in the training data, predictions for less common categories should be validated. Consider ensemble methods or additional validation for critical applications.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/model")
model = AutoModelForSequenceClassification.from_pretrained("path/to/model")

# Example usage
text = "Patient presents with elevated liver enzymes and cardiac arrhythmia..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.sigmoid(outputs.logits)

# Map predictions to labels
labels = ["neurological", "hepatorenal", "cardiovascular", "oncological"]
threshold = 0.5
predicted_labels = [labels[i] for i, pred in enumerate(predictions[0]) if pred > threshold]

Training Details

Training Data

The model was trained on a dataset containing over 3,500 medical records, including titles and abstracts from medical literature. The dataset covers four main medical specialties with their combinations, though it exhibits class imbalance that presents ongoing challenges.

Labels:

neurological
hepatorenal
cardiovascular
oncological

Training Procedure

The model was trained using knowledge distillation from the main BioBERT model, with specialized techniques to prevent overfitting and catastrophic forgetting.

Training Approach

Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation):

LoRA freezes all pre-trained model weights
Adds small "adapter" matrices of low rank in attention layers
Only adapter matrix parameters are trained (typically <1% of total parameters)
Prevents catastrophic forgetting while enabling domain adaptation

Training Hyperparameters

Training epochs: 10
LoRA rank (r): 16
LoRA alpha: 32
LoRA dropout: 0.1
Bias: none
Task type: SEQ_CLS (Sequence Classification)
Training regime: Mixed precision training

Performance Analysis

Training Progression:

Strong convergence observed from epoch 1 to 10 Training loss decreased from 0.608 to 0.299 (51% reduction) Validation loss decreased from 0.594 to 0.284 (52% reduction) No signs of overfitting - validation and training losses track together

F1-Micro vs F1-Macro Gap:

Epoch 1: Large gap (0.312 vs 0.147) indicates class imbalance Epoch 9: Minimal gap (0.841 vs 0.813) shows balanced performance across labels

ROC-AUC Performance:

Excellent discriminative ability (0.923) across all labels Consistent improvement throughout training

Model Stability:

Best performance at epoch 9, with slight decline at epoch 10 Suggests optimal stopping point around epoch 9-10

Additional Metrics to Consider

For comprehensive multi-label evaluation, consider adding:

Hamming Loss - Average fraction of incorrect labels per sample
Subset Accuracy - Exact match ratio (all labels predicted correctly)
Label-wise Precision/Recall - Performance breakdown per individual label
Coverage Error - Average number of labels needed to cover true labels

Hiver77
/

MDT