File size: 2,428 Bytes
ffbff05 740ed29 ffbff05 740ed29 ffbff05 740ed29 ffbff05 740ed29 ffbff05 740ed29 ffbff05 f4c750f 740ed29 ffbff05 740ed29 ffbff05 740ed29 18ee040 740ed29 ffbff05 740ed29 ffbff05 740ed29 ffbff05 740ed29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
library_name: transformers
language:
- fr
- de
- en
- it
- lb
license: agpl-3.0
tags:
- language-identification
- multilingual
- historical
- impresso
---
# Model Card for impresso-project/language-identifier
## Overview
`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
## Model Details
- **Model type:** Language identification
- **Interface:** Hugging Face `transformers` pipeline
- **Languages supported:** fr, de, en, it, lb
- **License:** AGPL-3.0
- **Developed by:** UZH, Switzerland
- **Training data:** Historical newspapers from the impresso corpus and related sources
## How to Use
```python
from transformers import pipeline
MODEL_NAME = "impresso-project/language-identifier"
lang_pipeline = pipeline(
"langident",
model=MODEL_NAME,
trust_remote_code=True,
device="cpu",
)
text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""
langs = lang_pipeline(text)
print(langs)
```
## Output Format
The output is a single dictionary with the predicted language and confidence score:
```python
{
"language": "fr",
"score": 1.0
}
```
## Use Cases
- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives
## Limitations
- Works best on **sentence- or paragraph-length** texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
## Installation
```bash
pip install transformers floret
```
## Contact
- Website: [https://impresso-project.ch](https://impresso-project.ch)
<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>
|