TransBERT-bio-fr
TransBERT-bio-fr is a French biomedical language model pretrained exclusively on synthetically translated PubMed abstracts, using the TransCorpus framework. This model demonstrates that high-quality domain-specific language models can be built for low-resource languages using only machine-translated data.
Model Details
- Architecture: BERT-base (12 layers, 768 hidden, 12 heads, 110M parameters)
- Tokenizer: SentencePiece unigram, 32k vocab, trained on synthetic biomedical French
- Training Data: 36.4GB corpus, 22M PubMed abstracts, translated from English to French, available here: TransCorpus-bio-fr 🤗
- Translation Model: M2M-100 (1.2B) using TransCorpus Toolkit
- Domain: Biomedical, clinical, life sciences (French)
Motivation
The lack of large-scale, high-quality biomedical corpora in French has historically limited the development of domain-specific language models. TransBERT-bio-fr addresses this gap by leveraging recent advances in neural machine translation to generate a massive, high-quality synthetic corpus, making robust French biomedical NLP possible.
How to Use
Loading the model and tokenizer :
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jknafou/TransBERT-bio-fr")
model = AutoModel.from_pretrained("jknafou/TransBERT-bio-fr")
Perform the mask filling task :
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="jknafou/TransBERT-bio-fr", tokenizer="jknafou/TransBERT-bio-fr")
results = fill_mask("L’insuline est une hormone produite par le <mask> et régule la glycémie.")
# [{'score': 0.6606941223144531,
# 'token': 486,
# 'token_str': 'foie',
# 'sequence': 'L’insuline est une hormone produite par le foie et régule la glycémie.'},
# {'score': 0.172934889793396,
# 'token': 2642,
# 'token_str': 'pancréas',
# 'sequence': 'L’insuline est une hormone produite par le pancréas et régule la glycémie.'},
# {'score': 0.08486421406269073,
# 'token': 488,
# 'token_str': 'cerveau',
# 'sequence': 'L’insuline est une hormone produite par le cerveau et régule la glycémie.'},
# {'score': 0.017183693125844002,
# 'token': 2092,
# 'token_str': 'cœur',
# 'sequence': 'L’insuline est une hormone produite par le cœur et régule la glycémie.'},
# {'score': 0.009480085223913193,
# 'token': 712,
# 'token_str': 'corps',
# 'sequence': 'L’insuline est une hormone produite par le corps et régule la glycémie.'}]
Key Results
TransBERT-bio-fr sets a new state-of-the-art (SOTA) on the French biomedical benchmark DrBenchmark, outperforming both general-domain (CamemBERT) and previous domain-specific (DrBERT) models on classification, NER, POS, and STS tasks.
Task | CamemBERT | DrBERT | TransBERT |
---|---|---|---|
Classification (F1) | 74.17 | 73.73 | 75.71* |
NER (F1) | 81.55 | 80.88 | 83.15* |
POS (F1) | 98.29 | 98.18* | 98.31 |
STS (R²) | 83.38 | 73.56* | 83.04 |
*Statistically significance (Friedman & Nemenyi test, p<0.01).
Paper to be submitted to EMNLP2025
TransCorpus enables the training of state-of-the-art language models through synthetic translation. For example, TransBERT achieved superior performance by leveraging corpus translation with this toolkit. A paper detailing these results will be submitted to EMNLP 2025. 📝 Current Paper Version
Why Synthetic Translation?
- Scalable: Enables pretraining on gigabytes of text for any language with a strong MT system.
- Effective: Outperforms models trained on native data in key biomedical tasks.
- Accessible: Makes high-quality domain-specific PLMs possible for low-resource languages.
🔗 Related Resources
This model was pretrained on large-scale synthetic French biomedical data generated using TransCorpus, an open-source toolkit for scalable, parallel translation and preprocessing. For source code, data recipes, and reproducible pipelines, visit the TransCorpus GitHub repository. If you use this model, please cite:
@misc{knafou-transbert,
author = {Knafou, Julien and Mottin, Luc and Ana\"{i}s, Mottaz and Alexandre, Flament and Ruch, Patrick},
title = {TransBERT: A Framework for Synthetic Translation in Domain-Specific Language Modeling},
year = {2025},
note = {Submitted to EMNLP2025. Anonymous ACL submission available:},
url = {https://transbert.s3.text-analytics.ch/TransBERT.pdf},
}
- Downloads last month
- 19
Model tree for jknafou/TransBERT-bio-fr
Unable to build the model tree, the base model loops to the model itself. Learn more.