Model Card
This is DeBERTaV2 pre-trained language model from scratch on the DutchMedicalText corpus and continously-pretrained on UMCU clinical texts + a random selection of the DutchMedicalText corpus to avoid model collapse. Currently (2025-06-21) the perplexity is at 3.1, and training is still ongoing.
Model Details
About 360 parameters, with a 1024 token context length
Model Description
- Developed by: Bram van Es - at UPOD UMCU/MEDxAI
- Funded by : UMCU / Google TPU
- Model type: DeBERTaV2
- Language(s) (NLP): Dutch
- License: GPL-3
Model Sources [optional]
- Repository: code for training
- Paper: coming
Intended use
This model is directly suitable for Masked Language Modeling and can be finetuned for token/sequence classification, contrastive embeddings, relationship extraction and other downstream tasks.
Bias, Risks, and Limitations
This model was not filtered for bias. As for any language model, do not blindly accept the generated output. This is not a causal model, and it is not finetuned in any away for clinical decision support tasks.
Training Details
Training Data
Trained on about 80GB of Dutch medical texts, ranging van guidelines to patient cases reports, and about $5$ million medical records from the cardiology department.
Training Procedure
Preprocessing
- Deidentification with DEDUCE
- Removal of repetitive phrases and repetitive non-word characters
- Cleaning with FTFY
- Chunking of each document based on token counts. We do not
Training Hyperparameters
- Training regime: bf16 mixed precision
- learning rate: 1e-4 - 2e-4
- number of warmup steps: 5000
- steps per epoch: 50000
- weight decacy: 0.001
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Pre-training
- Hardware Type: TPUv4
- Hours used: 400+
- Cloud Provider: Google
- Compute Region: US-WEST2 en EUROPE-WEST4
- Carbon Emitted: 100kg+, compensated
Continuous pre-training
- Hardware Type: RTX 4000 ADA
- Hours used: 400+
- Carbon Emitted: 24kg+
- Downloads last month
- 58
Model tree for UMCU/CardioDeBERTa.nl_clinical
Base model
microsoft/deberta-v2-xlarge