--- library_name: transformers language: - fr - de license: agpl-3.0 tags: - newsagency - ner - historical - impresso - multilingual --- # Model Card for `impresso-project/ner-newsagency-bert-multilingual` This model is designed to detect mentions of **news agencies** in historical newspaper articles in **French** and **German**. It was developed as part of the [Impresso project](https://impresso-project.ch), a multidisciplinary initiative aiming to enable exploration of large-scale historical media archives. The model is fine-tuned from [`dbmdz/bert-base-historic-multilingual-cased`](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased), trained on a custom annotated dataset of over 1,500 historical articles (1840–2000) from the Swiss and Luxembourgish press. ## Model Details ### Description - **Developed by:** DHLAB, EPFL and the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). - **Model type:** BERT-based token classification model for named entity recognition - **Languages:** French and German - **License:** [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) - **Finetuned from:** [`dbmdz/bert-base-historic-multilingual-cased`](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased) - **Training data:** [Zenodo dataset of agency mentions](https://doi.org/10.5281/zenodo.8333933) ### Model Architecture The model architecture consists of the following component: - A **pre-trained BERT encoder** (multilingual historic BERT) as the base for token classification. ## Entity Types Supported The model predicts whether a given token span corresponds to a news agency mention. The following tags are used: ### Recognized News Agencies | Tag | Description | |-----|-------------| | AFP | Agence France Presse (A.F.P.) | | ANP | Algemeen Nederlands Persbureau | | ANSA | Agenzia Nationale Stampa Associata | | AP | Associated Press (Assoc. Press) | | APA | Austria Press Agentur | | ATS-SDA | Agence télégraphique suisse / Schweizerische Depeschenagentur (ATS, SDA) | | BTA | Bulgarska Telegrafitscheka Agentzia (Agence Bulgare) | | Belga | Agence Belga SA | | CTK | Czechoslavenska Tiskova Kancelar (Ceteka) | | DDP-DAPD | Deutscher Depeschendienst / Deutscher Auslands-Depeschendienst | | DNB | Deutsches Nachrichtenbüro GmbH (D.N.B.) | | DPA | Deutsche Presse Agentur | | Domei | Domei Tsushin (Japan) | | Europapress | Europapress (Europapreß, Europapr.) | | Extel | Exchange Telegraph Co. Ltd. (Agence Extel) | | Havas | Havas (Agence Havas) | | Interfax | Interfax News Agency | | PAP | Polska Agencja Prasowa | | Reuters | Reuters (Reuter, Reutermeldung, Reuter’sche Bureau) | | SPK-SMP | Schweizer Mittelpresse / Schweizerische Politische Korrespondenz (SPK, SMP) | | Stefani | Agenzia Stefani (Agence Stefani) | | TANJUG | Telegrafska Agencija nova Jugoslavija | | TASS | Telegrafnoie Agenstvo sovietskavo Soyusa (ITAR-TASS, Taß, etc.) | | TT | Tidningarnas Telegrambyra (Sweden) | | Telunion | Telegraphen-Union (TU) | | UP-UPI | United Press / United Press International (UP, UPI) | | Wolff | Wolffs Telegraphisches Bureau (Wolffagentur, etc.) | | ag | Generic agency mention (e.g. “ag.”, “Agence”) | | pers.ind.articleauthor | Author of the newspaper article | | unk | Unknown agency not in tagset | ## How to Use ```python from transformers import pipeline nlp = pipeline("newsagency-ner", model="impresso-project/ner-newsagency-bert-multilingual", trust_remote_code=True) nlp("La dépêche vient de (Reuter), diffusée hier.") ``` #### Example Output ```json { "type": "org.ent.pressagency.Reuters", "confidence": 98.94, "index": 12, "surface": "Reuter", "start": 43, "end": 49 } ``` ## Training Details ### Training Data The model was trained on a custom dataset of 1,530 documents (1,133 FR / 397 DE) from the Impresso HIPE-2020 dataset. The training data was manually annotated to identify mentions of news agencies in historical newspaper articles. ## Dataset Characteristics | Lang | Docs | Tokens | Mentions | % Noisy | |------|------|--------|----------|---------| | fr | 1133 | 759k | 1,399 | 5% | | de | 397 | 299k | 577 | 9% | - **Annotation tool:** INCEpTION - **Tags used:** specific agency names, `ag`, `pers.ind.articleauthor`, `unk` - **Tag format:** span-level BIO-style ### Training Procedure #### Training Hyperparameters - **Pretrained base model:** `dbmdz/bert-base-historic-multilingual-cased` - **Finetuning epochs:** 3 - **Max sequence length:** 512 tokens - **Languages:** French and German ## Evaluation Model evaluation was performed on a manually annotated dataset with 1,530 documents (1,133 FR / 397 DE). Results are summarised below: | Language | Precision | Recall | F1-score | |----------|-----------|--------|----------| | French | 0.92 | 0.88 | 0.90 | | German | 0.89 | 0.83 | 0.86 | Evaluation methodology and full results are detailed in the thesis: [Marxen, 2023](https://infoscience.epfl.ch/entities/publication/0ca6e53d-d37e-4cdf-a360-2dd9f34a8271). are detailed in the thesis: [Marxen, 2023](https://github.com/impresso/newsagency-classification). #### Speeds, Sizes, Times - **Model size:** ~500MB - **Training time:** ~half hour on 1 NVIDIA GPU (16GB) ## Environmental Impact - **Hardware:** 1x NVIDIA GPU (16GB) - **Training time:** ~1 hour - **Provider:** Local HPC (Switzerland) - **Estimated CO₂ Emissions:** ~0.022 kg CO₂eq (calculated using [MLCO2 calculator](https://mlco2.github.io/impact/)) ## Limitations and Risks - Does not detect news agency content without explicit mention. - May miss OCR-degraded mentions despite robustness strategies. - Only covers French and German. ### Model Sources - **Repository:** https://github.com/impresso/newsagency-classification - **Paper:** [Marxen, 2023](https://infoscience.epfl.ch/entities/publication/0ca6e53d-d37e-4cdf-a360-2dd9f34a8271) - **Demo:** [Impresso project](https://impresso-project.ch) ## Citation ```bibtex @misc{marxen_newsagency_2023, title = {Where Did the News come from? Detection of News Agency Releases in Historical Newspapers}, author = {Marxen, Lea and Ehrmann, Maud and Boros, Emanuela}, year = {2023}, url = {https://github.com/impresso/newsagency-classification/}, note = {Master Thesis} } ``` ## Contact - Website: [https://impresso-project.ch](https://impresso-project.ch)