Model Card for impresso-project/ner-newsagency-bert-multilingual

This model is designed to detect mentions of news agencies in historical newspaper articles in French and German. It was developed as part of the Impresso project, a multidisciplinary initiative aiming to enable exploration of large-scale historical media archives.

The model is fine-tuned from dbmdz/bert-base-historic-multilingual-cased, trained on a custom annotated dataset of over 1,500 historical articles (1840–2000) from the Swiss and Luxembourgish press.

Model Details

Description

Model Architecture

The model architecture consists of the following component:

  • A pre-trained BERT encoder (multilingual historic BERT) as the base for token classification.

Entity Types Supported

The model predicts whether a given token span corresponds to a news agency mention. The following tags are used:

Recognized News Agencies

Tag Description
AFP Agence France Presse (A.F.P.)
ANP Algemeen Nederlands Persbureau
ANSA Agenzia Nationale Stampa Associata
AP Associated Press (Assoc. Press)
APA Austria Press Agentur
ATS-SDA Agence télégraphique suisse / Schweizerische Depeschenagentur (ATS, SDA)
BTA Bulgarska Telegrafitscheka Agentzia (Agence Bulgare)
Belga Agence Belga SA
CTK Czechoslavenska Tiskova Kancelar (Ceteka)
DDP-DAPD Deutscher Depeschendienst / Deutscher Auslands-Depeschendienst
DNB Deutsches Nachrichtenbüro GmbH (D.N.B.)
DPA Deutsche Presse Agentur
Domei Domei Tsushin (Japan)
Europapress Europapress (Europapreß, Europapr.)
Extel Exchange Telegraph Co. Ltd. (Agence Extel)
Havas Havas (Agence Havas)
Interfax Interfax News Agency
PAP Polska Agencja Prasowa
Reuters Reuters (Reuter, Reutermeldung, Reuter’sche Bureau)
SPK-SMP Schweizer Mittelpresse / Schweizerische Politische Korrespondenz (SPK, SMP)
Stefani Agenzia Stefani (Agence Stefani)
TANJUG Telegrafska Agencija nova Jugoslavija
TASS Telegrafnoie Agenstvo sovietskavo Soyusa (ITAR-TASS, Taß, etc.)
TT Tidningarnas Telegrambyra (Sweden)
Telunion Telegraphen-Union (TU)
UP-UPI United Press / United Press International (UP, UPI)
Wolff Wolffs Telegraphisches Bureau (Wolffagentur, etc.)
ag Generic agency mention (e.g. “ag.”, “Agence”)
pers.ind.articleauthor Author of the newspaper article
unk Unknown agency not in tagset

How to Use

from transformers import pipeline
nlp = pipeline("newsagency-ner", model="impresso-project/ner-newsagency-bert-multilingual", trust_remote_code=True)
nlp("La dépêche vient de (Reuter), diffusée hier.")

Example Output

{
  "type": "org.ent.pressagency.Reuters",
  "confidence": 98.94,
  "index": 12,
  "surface": "Reuter",
  "start": 43,
  "end": 49
}

Training Details

Training Data

The model was trained on a custom dataset of 1,530 documents (1,133 FR / 397 DE) from the Impresso HIPE-2020 dataset. The training data was manually annotated to identify mentions of news agencies in historical newspaper articles.

Dataset Characteristics

Lang Docs Tokens Mentions % Noisy
fr 1133 759k 1,399 5%
de 397 299k 577 9%
  • Annotation tool: INCEpTION
  • Tags used: specific agency names, ag, pers.ind.articleauthor, unk
  • Tag format: span-level BIO-style

Training Procedure

Training Hyperparameters

  • Pretrained base model: dbmdz/bert-base-historic-multilingual-cased
  • Finetuning epochs: 3
  • Max sequence length: 512 tokens
  • Languages: French and German

Evaluation

Model evaluation was performed on a manually annotated dataset with 1,530 documents (1,133 FR / 397 DE). Results are summarised below:

Language Precision Recall F1-score
French 0.92 0.88 0.90
German 0.89 0.83 0.86

Evaluation methodology and full results are detailed in the thesis: Marxen, 2023. are detailed in the thesis: Marxen, 2023.

Speeds, Sizes, Times

  • Model size: ~500MB
  • Training time: ~half hour on 1 NVIDIA GPU (16GB)

Environmental Impact

  • Hardware: 1x NVIDIA GPU (16GB)
  • Training time: ~1 hour
  • Provider: Local HPC (Switzerland)
  • Estimated CO₂ Emissions: ~0.022 kg CO₂eq (calculated using MLCO2 calculator)

Limitations and Risks

  • Does not detect news agency content without explicit mention.
  • May miss OCR-degraded mentions despite robustness strategies.
  • Only covers French and German.

Model Sources

Citation

@misc{marxen_newsagency_2023,
  title = {Where Did the News come from? Detection of News Agency Releases in Historical Newspapers},
  author = {Marxen, Lea and Ehrmann, Maud and Boros, Emanuela},
  year = {2023},
  url = {https://github.com/impresso/newsagency-classification/},
  note = {Master Thesis}
}

Contact

Impresso Logo

Downloads last month
96
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using impresso-project/ner-newsagency-bert-multilingual 1