File size: 2,428 Bytes
ffbff05
 
 
 
 
740ed29
 
 
 
ffbff05
740ed29
 
 
 
ffbff05
 
740ed29
 
 
 
 
 
 
ffbff05
740ed29
ffbff05
740ed29
 
 
 
 
 
 
 
ffbff05
 
f4c750f
 
740ed29
ffbff05
740ed29
 
 
 
 
 
ffbff05
740ed29
 
 
18ee040
740ed29
 
ffbff05
 
740ed29
 
 
 
 
 
 
 
 
ffbff05
 
 
740ed29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffbff05
740ed29
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
library_name: transformers
language:
- fr
- de
- en
- it
- lb
license: agpl-3.0
tags:
- language-identification
- multilingual
- historical
- impresso
---

# Model Card for impresso-project/language-identifier

## Overview

`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.

This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.

## Model Details

- **Model type:** Language identification
- **Interface:** Hugging Face `transformers` pipeline
- **Languages supported:** fr, de, en, it, lb
- **License:** AGPL-3.0
- **Developed by:** UZH, Switzerland
- **Training data:** Historical newspapers from the impresso corpus and related sources

## How to Use

```python
from transformers import pipeline

MODEL_NAME = "impresso-project/language-identifier"

lang_pipeline = pipeline(
    "langident",
    model=MODEL_NAME,
    trust_remote_code=True,
    device="cpu",
)

text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""

langs = lang_pipeline(text)
print(langs)
```

## Output Format

The output is a single dictionary with the predicted language and confidence score:

```python
{
  "language": "fr",
  "score": 1.0
}
```


## Use Cases

- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives

## Limitations

- Works best on **sentence- or paragraph-length** texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)

## Installation

```bash
pip install transformers floret
```

## Contact

- Website: [https://impresso-project.ch](https://impresso-project.ch)

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>