You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ChristBERT

ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT) is a family of domain-adapted German biomedical RoBERTa models. It was developed to address the lack of high-quality German-language models for clinical and healthcare NLP tasks.

Model Variants

  • ChristBERT: Continued pretraining from GeistBERT on biomedical data.
  • ChristBERT_scratch: Trained from scratch on biomedical data using GeistBERT's vocabulary.
  • ChristBERT_BPE: Trained from scratch on biomedical data using a new vocabulary trained on the biomedical domain (byte-level BPE, 52k tokens).

Model Architecture

All ChristBERT variants are based on the RoBERTa base architecture:

  • 12 transformer layers
  • 768 hidden size
  • 12 attention heads
  • ~125M parameters
  • Sequence length: 512 tokens

Pretraining Data

ChristBERT was trained on a 13.5 GB biomedical corpus consisting of:

  • Hpsmedia medical journals
  • Springer Nature biomedical publications
  • PubMed Central abstracts and full texts
  • German medical Wikipedia
  • German-translated MIMIC-IV notes (using LLaMA 3.1 8B)
  • Crawled German health web content (filtered via a fine-tuned classifier)

See table below:

Dataset Documents Words Size (MB)
Hpsmedia 277,357 405M 3,117
Springer Nature 258,000 259M 1,984
PubMed Central 90,273 220M 1,609
PhD Theses 7,486 90M 646
Medical Wikipedia 75,585 49M 362
MIMIC-IV Notes 330,486 734M 5,310
Web Crawl 93,642 69M 512
Total 1.1M+ ~1.8B words ~13,540

Pretraining Setup

  • Framework: Fairseq
  • Objective: Masked Language Modeling (Whole Word Masking)
  • Optimizer: AdamW
  • Learning Rate Schedule: Linear warmup (10k steps) + polynomial decay
  • Max LR:
    • 7e-4 (ChristBERT)
    • 6e-4 (ChristBERTscratch & BPE)
  • Batch Size: 8,192 tokens
  • Sequence Length: 512
  • Steps: 100,000
  • Hardware: 4Γ— NVIDIA A100 or 2Γ— NVIDIA H100
  • Total compute time: ~21.7 GPU days

Tokenizer

  • Type: Byte-level BPE
  • Vocabulary size: 52,000
  • Compatible with RoBERTa/GPT-2 tokenizer conventions

Intended Use

  • Named Entity Recognition (NER)
  • Clinical and biomedical text classification
  • German medical text mining and information retrieval

Evaluation

ChristBERT was evaluated on:

  • 3 medical NER benchmarks
  • 2 clinical text classification benchmarks

Metrics: Micro-averaged F1, precision, and accuracy

βœ… Outperformed existing German medical and general-purpose LMs on 4 out of 5 tasks
πŸ“ˆ Demonstrated strong performance especially with continued pretraining on general medical text

Named Entity Recognition

Model BRONCO150 Prec. BRONCO150 Rec. BRONCO150 F1 CARDIO:DE Prec. CARDIO:DE Rec. CARDIO:DE F1 GGPONC Prec. GGPONC Rec. GGPONC F1
ChristBERT 81.42 81.77 81.87 85.58 89.65 87.57 75.65 79.83 77.69
ChristBERT_scratch 81.87 82.32 82.09 88.38 89.89 89.13 76.54 77.56 77.05
ChristBERT_BPE 85.71 83.78 84.74 89.50 91.31 90.40 76.59 77.42 77.00
medBERT.de 78.67 79.58 79.12 87.66 90.02 88.83 73.89 75.78 74.73
BioGottBERT 76.96 78.45 77.70 88.37 90.74 89.54 75.24 75.40 75.32
GeistBERT 75.65 79.83 77.69 85.58 89.65 87.57 74.57 75.36 74.96
GeBERTa 78.67 79.58 79.12 90.51 90.23 90.37 75.96 76.93 76.45

Text Classification

Model CLEF Prec. CLEF Rec. CLEF F1 JSynCC Prec. JSynCC Rec. JSynCC F1
ChristBERT 78.12 75.34 76.03 89.01 100 94.19
ChristBERT_scratch 93.68 85.17 89.22 91.86 97.53 94.61
ChristBERT_BPE 88.22 88.35 88.28 89.53 95.06 92.22
medBERT.de 89.21 87.59 88.40 91.25 90.12 90.68
BioGottBERT 88.30 87.90 88.10 88.89 98.77 93.57
GeistBERT 90.43 72.92 80.74 92.59 92.59 92.59
GeBERTa 88.91 89.71 89.31 92.59 92.59 92.59

How to Use

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("ChristBERT/ChristBERT_base")
model = AutoModel.from_pretrained("ChristBERT/ChristBERT_base")

inputs = tokenizer("Der Patient leidet an Diabetes mellitus.", return_tensors="pt")
outputs = model(**inputs)

Limitations

  • Focused on the German biomedical domain β€” may not generalize well to other domains
  • Trained on publicly available or de-identified data; not suitable for sensitive clinical decisions

Terms of Use

By downloading and using any of the ChristBERT models from the Hugging Face Hub, you agree to abide by the following terms and conditions:

Purpose and Scope: All of the ChristBERT models are intended for research and informational purposes only and must not be used as the sole basis for making medical decisions or diagnosing patients. The models should be used as a supplementary tool alongside professional medical advice and clinical judgment.

Proper Usage: Users agree to use one of the ChristBERT models in a responsible manner, complying with all applicable laws, regulations, and ethical guidelines. The model must not be used for any unlawful, harmful, or malicious purposes. The model must not be used in clinical decicion making and patient treatment.

Data Privacy and Security: Users are responsible for ensuring the privacy and security of any sensitive or confidential data processed using one of the ChristBERT models. Personally identifiable information (PII) should be anonymized before being processed by the model, and users must implement appropriate measures to protect data privacy.

Prohibited Activities: Users are strictly prohibited from attempting to perform adversarial attacks, information retrieval, or any other actions that may compromise the security and integrity of any of the ChristBERT models. Violators may face legal consequences and the retraction of the model's publication.

By downloading and using one of the ChristBERT models, you confirm that you have read, understood, and agree to abide by these terms of use.

Legal Disclaimer:

By using one of the ChristBERT models, you agree not to engage in any attempts to perform adversarial attacks or information retrieval from the model. Such activities are strictly prohibited and constitute a violation of the terms of use. Violators may face legal consequences, and any discovered violations may result in the immediate retraction of the model's publication. By continuing to one of the ChristBERT models, you acknowledge and accept the responsibility to adhere to these terms and conditions.

Citation

@misc{christbert,
  title     = {The Word and the Way: Strategies for Domain-Specific {BERT} Pre-Training in German Medical NLP},
  author    = {Henry He and Johann Frei and Raphael Scheible-Schmitt},
  shorttitle= {The Word and the Way},
  year      = {2025},
  month     = sep,
  publisher = {Research Square},
  doi       = {10.21203/rs.3.rs-7332811/v1},
  url       = {https://www.researchsquare.com/article/rs-7332811/v1},
  urldate   = {2025-09-23},
  note      = {ISSN: 2693-5015}
}

License

MIT

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ChristBERT/ChristBERT_base

Finetuned
(1)
this model