Model Card for `impresso-project/ner-newsagency-bert-multilingual`

This model is designed to detect mentions of news agencies in historical newspaper articles in French and German. It was developed as part of the Impresso project, a multidisciplinary initiative aiming to enable exploration of large-scale historical media archives.

The model is fine-tuned from dbmdz/bert-base-historic-multilingual-cased, trained on a custom annotated dataset of over 1,500 historical articles (1840–2000) from the Swiss and Luxembourgish press.

Model Details

Description

Developed by: DHLAB, EPFL and the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
Model type: BERT-based token classification model for named entity recognition
Languages: French and German
License: AGPL v3+
Finetuned from: dbmdz/bert-base-historic-multilingual-cased
Training data: Zenodo dataset of agency mentions

Model Architecture

The model architecture consists of the following component:

A pre-trained BERT encoder (multilingual historic BERT) as the base for token classification.

Entity Types Supported

The model predicts whether a given token span corresponds to a news agency mention. The following tags are used:

Recognized News Agencies

Tag	Description
AFP	Agence France Presse (A.F.P.)
ANP	Algemeen Nederlands Persbureau
ANSA	Agenzia Nationale Stampa Associata
AP	Associated Press (Assoc. Press)
APA	Austria Press Agentur
ATS-SDA	Agence télégraphique suisse / Schweizerische Depeschenagentur (ATS, SDA)
BTA	Bulgarska Telegrafitscheka Agentzia (Agence Bulgare)
Belga	Agence Belga SA
CTK	Czechoslavenska Tiskova Kancelar (Ceteka)
DDP-DAPD	Deutscher Depeschendienst / Deutscher Auslands-Depeschendienst
DNB	Deutsches Nachrichtenbüro GmbH (D.N.B.)
DPA	Deutsche Presse Agentur
Domei	Domei Tsushin (Japan)
Europapress	Europapress (Europapreß, Europapr.)
Extel	Exchange Telegraph Co. Ltd. (Agence Extel)
Havas	Havas (Agence Havas)
Interfax	Interfax News Agency
PAP	Polska Agencja Prasowa
Reuters	Reuters (Reuter, Reutermeldung, Reuter’sche Bureau)
SPK-SMP	Schweizer Mittelpresse / Schweizerische Politische Korrespondenz (SPK, SMP)
Stefani	Agenzia Stefani (Agence Stefani)
TANJUG	Telegrafska Agencija nova Jugoslavija
TASS	Telegrafnoie Agenstvo sovietskavo Soyusa (ITAR-TASS, Taß, etc.)
TT	Tidningarnas Telegrambyra (Sweden)
Telunion	Telegraphen-Union (TU)
UP-UPI	United Press / United Press International (UP, UPI)
Wolff	Wolffs Telegraphisches Bureau (Wolffagentur, etc.)
ag	Generic agency mention (e.g. “ag.”, “Agence”)
pers.ind.articleauthor	Author of the newspaper article
unk	Unknown agency not in tagset

How to Use

from transformers import pipeline
nlp = pipeline("newsagency-ner", model="impresso-project/ner-newsagency-bert-multilingual", trust_remote_code=True)
nlp("La dépêche vient de (Reuter), diffusée hier.")

Example Output

[
  {
    'type': 'Reuters',
    'confidence': 0.99,
    'index': 5,
    'surface': 'Reuter',
    'start': 21, 'end': 27
  }
]

Training Details

Training Data

The model was trained on a custom dataset of 1,530 documents (1,133 FR / 397 DE) from the Impresso HIPE-2020 dataset. The training data was manually annotated to identify mentions of news agencies in historical newspaper articles.

Dataset Characteristics

Lang	Docs	Tokens	Mentions	% Noisy
fr	1133	759k	1,399	5%
de	397	299k	577	9%

Annotation tool: INCEpTION
Tags used: specific agency names, ag, pers.ind.articleauthor, unk
Tag format: span-level BIO-style

Training Procedure

Training Hyperparameters

Pretrained base model: dbmdz/bert-base-historic-multilingual-cased
Finetuning epochs: 3
Max sequence length: 512 tokens
Languages: French and German

Evaluation

Model evaluation was performed on a manually annotated dataset with 1,530 documents (1,133 FR / 397 DE). Results are summarised below:

Language	Precision	Recall	F1-score
French	0.92	0.88	0.90
German	0.89	0.83	0.86

Evaluation methodology and full results are detailed in the thesis: Marxen, 2023. are detailed in the thesis: Marxen, 2023.

Speeds, Sizes, Times

Model size: ~500MB
Training time: ~half hour on 1 NVIDIA GPU (16GB)

Environmental Impact

Hardware: 1x NVIDIA GPU (16GB)
Training time: ~1 hour
Provider: Local HPC (Switzerland)
Estimated CO₂ Emissions: ~0.022 kg CO₂eq (calculated using MLCO2 calculator)

Limitations and Risks

Does not detect news agency content without explicit mention.
May miss OCR-degraded mentions despite robustness strategies.
Only covers French and German.

Model Sources

Repository: https://github.com/impresso/newsagency-classification
Paper: Marxen, 2023
Demo: Impresso project

Citation

@misc{marxen_newsagency_2023,
  title = {Where Did the News come from? Detection of News Agency Releases in Historical Newspapers},
  author = {Marxen, Lea and Ehrmann, Maud and Boros, Emanuela},
  year = {2023},
  url = {https://github.com/impresso/newsagency-classification/},
  note = {Master Thesis}
}

Contact

Website: https://impresso-project.ch

Impresso Logo

impresso-project
/

ner-newsagency-bert-multilingual

Model Card for `impresso-project/ner-newsagency-bert-multilingual`

Model Details

Description

Model Architecture

Entity Types Supported

Recognized News Agencies

How to Use

Example Output

Training Details

Training Data

Dataset Characteristics

Training Procedure

Training Hyperparameters

Evaluation

Speeds, Sizes, Times

Environmental Impact

Limitations and Risks

Model Sources

Citation

Contact

Space using impresso-project/ner-newsagency-bert-multilingual 1

Model Card for impresso-project/ner-newsagency-bert-multilingual

Model Details

Description

Model Architecture

Entity Types Supported

Recognized News Agencies

How to Use

Example Output

Training Details

Training Data

Dataset Characteristics

Training Procedure

Training Hyperparameters

Evaluation

Speeds, Sizes, Times

Environmental Impact

Limitations and Risks

Model Sources

Citation

Contact

Space using impresso-project/ner-newsagency-bert-multilingual 1

Model Card for `impresso-project/ner-newsagency-bert-multilingual`