You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ChristBERT

ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT) is a family of domain-adapted German biomedical RoBERTa models. It was developed to address the lack of high-quality German-language models for clinical and healthcare NLP tasks.

Model Variants

ChristBERT: Continued pretraining from GeistBERT on biomedical data.
ChristBERT_scratch: Trained from scratch on biomedical data using GeistBERT's vocabulary.
ChristBERT_BPE: Trained from scratch on biomedical data using a new vocabulary trained on the biomedical domain (byte-level BPE, 52k tokens).

Model Architecture

All ChristBERT variants are based on the RoBERTa base architecture:

12 transformer layers
768 hidden size
12 attention heads
~125M parameters
Sequence length: 512 tokens

Pretraining Data

ChristBERT was trained on a 13.5 GB biomedical corpus consisting of:

Hpsmedia medical journals
Springer Nature biomedical publications
PubMed Central abstracts and full texts
German medical Wikipedia
German-translated MIMIC-IV notes (using LLaMA 3.1 8B)
Crawled German health web content (filtered via a fine-tuned classifier)

See table below:

Dataset	Documents	Words	Size (MB)
Hpsmedia	277,357	405M	3,117
Springer Nature	258,000	259M	1,984
PubMed Central	90,273	220M	1,609
PhD Theses	7,486	90M	646
Medical Wikipedia	75,585	49M	362
MIMIC-IV Notes	330,486	734M	5,310
Web Crawl	93,642	69M	512
Total	1.1M+	~1.8B words	~13,540

Pretraining Setup

Framework: Fairseq
Objective: Masked Language Modeling (Whole Word Masking)
Optimizer: AdamW
Learning Rate Schedule: Linear warmup (10k steps) + polynomial decay
Max LR:
- 7e-4 (ChristBERT)
- 6e-4 (ChristBERTscratch & BPE)
Batch Size: 8,192 tokens
Sequence Length: 512
Steps: 100,000
Hardware: 4× NVIDIA A100 or 2× NVIDIA H100
Total compute time: ~21.7 GPU days

Tokenizer

Type: Byte-level BPE
Vocabulary size: 52,000
Compatible with RoBERTa/GPT-2 tokenizer conventions

Intended Use

Named Entity Recognition (NER)
Clinical and biomedical text classification
German medical text mining and information retrieval

Evaluation

ChristBERT was evaluated on:

3 medical NER benchmarks
2 clinical text classification benchmarks

Metrics: Micro-averaged F1, precision, and accuracy

✅ Outperformed existing German medical and general-purpose LMs on 4 out of 5 tasks
📈 Demonstrated strong performance especially with continued pretraining on general medical text

Named Entity Recognition

Model	BRONCO150 Prec.	BRONCO150 Rec.	BRONCO150 F1	CARDIO:DE Prec.	CARDIO:DE Rec.	CARDIO:DE F1	GGPONC Prec.	GGPONC Rec.	GGPONC F1
ChristBERT	81.42	81.77	81.87	85.58	89.65	87.57	75.65	79.83	77.69
ChristBERT_scratch	81.87	82.32	82.09	88.38	89.89	89.13	76.54	77.56	77.05
ChristBERT_BPE	85.71	83.78	84.74	89.50	91.31	90.40	76.59	77.42	77.00
medBERT.de	78.67	79.58	79.12	87.66	90.02	88.83	73.89	75.78	74.73
BioGottBERT	76.96	78.45	77.70	88.37	90.74	89.54	75.24	75.40	75.32
GeistBERT	75.65	79.83	77.69	85.58	89.65	87.57	74.57	75.36	74.96
GeBERTa	78.67	79.58	79.12	90.51	90.23	90.37	75.96	76.93	76.45

Text Classification

Model	CLEF Prec.	CLEF Rec.	CLEF F1	JSynCC Prec.	JSynCC Rec.	JSynCC F1
ChristBERT	78.12	75.34	76.03	89.01	100	94.19
ChristBERT_scratch	93.68	85.17	89.22	91.86	97.53	94.61
ChristBERT_BPE	88.22	88.35	88.28	89.53	95.06	92.22
medBERT.de	89.21	87.59	88.40	91.25	90.12	90.68
BioGottBERT	88.30	87.90	88.10	88.89	98.77	93.57
GeistBERT	90.43	72.92	80.74	92.59	92.59	92.59
GeBERTa	88.91	89.71	89.31	92.59	92.59	92.59

How to Use

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ChristBERT/ChristBERT_base")
model = AutoModel.from_pretrained("ChristBERT/ChristBERT_base")

inputs = tokenizer("Der Patient leidet an Diabetes mellitus.", return_tensors="pt")
outputs = model(**inputs)

Limitations

Focused on the German biomedical domain — may not generalize well to other domains
Trained on publicly available or de-identified data; not suitable for sensitive clinical decisions

Terms of Use

By downloading and using any of the ChristBERT models from the Hugging Face Hub, you agree to abide by the following terms and conditions:

Purpose and Scope: All of the ChristBERT models are intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients. The models should be used as a supplementary tool alongside professional medical advice and clinical judgment.

Proper Usage: Users agree to use one of the ChristBERT models in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines. The model must not be used for any unlawful, harmful, or malicious purposes. The model must not be used in clinical decicion making and patient treatment.

Data Privacy and Security: Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using one of the ChristBERT models. Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy.

Prohibited Activities: Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise the security and integrity of any of the ChristBERT models. Violators may face legal consequences and the retraction of the model's publication.

By downloading and using one of the ChristBERT models, you confirm that you have read, understood, and agree to abide by these terms of use.

Legal Disclaimer:

By using one of the ChristBERT models, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model. Such activities are strictly prohibited and constitute a violation of the terms of use. Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication. By continuing to one of the ChristBERT models, you acknowledge and accept the responsibility to adhere to these terms and conditions.

Citation

@misc{christbert,
  title     = {The Word and the Way: Strategies for Domain-Specific {BERT} Pre-Training in German Medical NLP},
  author    = {Henry He and Johann Frei and Raphael Scheible-Schmitt},
  shorttitle= {The Word and the Way},
  year      = {2025},
  month     = sep,
  publisher = {Research Square},
  doi       = {10.21203/rs.3.rs-7332811/v1},
  url       = {https://www.researchsquare.com/article/rs-7332811/v1},
  urldate   = {2025-09-23},
  note      = {ISSN: 2693-5015}
}

License

MIT

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChristBERT/ChristBERT_base

Base model

TUM/GottBERT_filtered_base_best

Finetuned

GeistBERT/GeistBERT_base

Finetuned

(1)

this model