Model Card for impresso-project/ner-newsagency-bert-multilingual
This model is designed to detect mentions of news agencies in historical newspaper articles in French and German. It was developed as part of the Impresso project, a multidisciplinary initiative aiming to enable exploration of large-scale historical media archives.
The model is fine-tuned from dbmdz/bert-base-historic-multilingual-cased
, trained on a custom annotated dataset of over 1,500 historical articles (1840–2000) from the Swiss and Luxembourgish press.
Model Details
Description
- Developed by: DHLAB, EPFL and the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
- Model type: BERT-based token classification model for named entity recognition
- Languages: French and German
- License: AGPL v3+
- Finetuned from:
dbmdz/bert-base-historic-multilingual-cased
- Training data: Zenodo dataset of agency mentions
Model Architecture
The model architecture consists of the following component:
- A pre-trained BERT encoder (multilingual historic BERT) as the base for token classification.
Entity Types Supported
The model predicts whether a given token span corresponds to a news agency mention. The following tags are used:
Recognized News Agencies
Tag | Description |
---|---|
AFP | Agence France Presse (A.F.P.) |
ANP | Algemeen Nederlands Persbureau |
ANSA | Agenzia Nationale Stampa Associata |
AP | Associated Press (Assoc. Press) |
APA | Austria Press Agentur |
ATS-SDA | Agence télégraphique suisse / Schweizerische Depeschenagentur (ATS, SDA) |
BTA | Bulgarska Telegrafitscheka Agentzia (Agence Bulgare) |
Belga | Agence Belga SA |
CTK | Czechoslavenska Tiskova Kancelar (Ceteka) |
DDP-DAPD | Deutscher Depeschendienst / Deutscher Auslands-Depeschendienst |
DNB | Deutsches Nachrichtenbüro GmbH (D.N.B.) |
DPA | Deutsche Presse Agentur |
Domei | Domei Tsushin (Japan) |
Europapress | Europapress (Europapreß, Europapr.) |
Extel | Exchange Telegraph Co. Ltd. (Agence Extel) |
Havas | Havas (Agence Havas) |
Interfax | Interfax News Agency |
PAP | Polska Agencja Prasowa |
Reuters | Reuters (Reuter, Reutermeldung, Reuter’sche Bureau) |
SPK-SMP | Schweizer Mittelpresse / Schweizerische Politische Korrespondenz (SPK, SMP) |
Stefani | Agenzia Stefani (Agence Stefani) |
TANJUG | Telegrafska Agencija nova Jugoslavija |
TASS | Telegrafnoie Agenstvo sovietskavo Soyusa (ITAR-TASS, Taß, etc.) |
TT | Tidningarnas Telegrambyra (Sweden) |
Telunion | Telegraphen-Union (TU) |
UP-UPI | United Press / United Press International (UP, UPI) |
Wolff | Wolffs Telegraphisches Bureau (Wolffagentur, etc.) |
ag | Generic agency mention (e.g. “ag.”, “Agence”) |
pers.ind.articleauthor | Author of the newspaper article |
unk | Unknown agency not in tagset |
How to Use
from transformers import pipeline
nlp = pipeline("newsagency-ner", model="impresso-project/ner-newsagency-bert-multilingual", trust_remote_code=True)
nlp("La dépêche vient de (Reuter), diffusée hier.")
Example Output
{
"type": "org.ent.pressagency.Reuters",
"confidence": 98.94,
"index": 12,
"surface": "Reuter",
"start": 43,
"end": 49
}
Training Details
Training Data
The model was trained on a custom dataset of 1,530 documents (1,133 FR / 397 DE) from the Impresso HIPE-2020 dataset. The training data was manually annotated to identify mentions of news agencies in historical newspaper articles.
Dataset Characteristics
Lang | Docs | Tokens | Mentions | % Noisy |
---|---|---|---|---|
fr | 1133 | 759k | 1,399 | 5% |
de | 397 | 299k | 577 | 9% |
- Annotation tool: INCEpTION
- Tags used: specific agency names,
ag
,pers.ind.articleauthor
,unk
- Tag format: span-level BIO-style
Training Procedure
Training Hyperparameters
- Pretrained base model:
dbmdz/bert-base-historic-multilingual-cased
- Finetuning epochs: 3
- Max sequence length: 512 tokens
- Languages: French and German
Evaluation
Model evaluation was performed on a manually annotated dataset with 1,530 documents (1,133 FR / 397 DE). Results are summarised below:
Language | Precision | Recall | F1-score |
---|---|---|---|
French | 0.92 | 0.88 | 0.90 |
German | 0.89 | 0.83 | 0.86 |
Evaluation methodology and full results are detailed in the thesis: Marxen, 2023. are detailed in the thesis: Marxen, 2023.
Speeds, Sizes, Times
- Model size: ~500MB
- Training time: ~half hour on 1 NVIDIA GPU (16GB)
Environmental Impact
- Hardware: 1x NVIDIA GPU (16GB)
- Training time: ~1 hour
- Provider: Local HPC (Switzerland)
- Estimated CO₂ Emissions: ~0.022 kg CO₂eq (calculated using MLCO2 calculator)
Limitations and Risks
- Does not detect news agency content without explicit mention.
- May miss OCR-degraded mentions despite robustness strategies.
- Only covers French and German.
Model Sources
- Repository: https://github.com/impresso/newsagency-classification
- Paper: Marxen, 2023
- Demo: Impresso project
Citation
@misc{marxen_newsagency_2023,
title = {Where Did the News come from? Detection of News Agency Releases in Historical Newspapers},
author = {Marxen, Lea and Ehrmann, Maud and Boros, Emanuela},
year = {2023},
url = {https://github.com/impresso/newsagency-classification/},
note = {Master Thesis}
}
Contact
- Website: https://impresso-project.ch
- Downloads last month
- 96