File size: 7,280 Bytes
7980110
 
 
 
c5883cb
52c098b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5883cb
52c098b
 
 
 
 
 
 
 
 
6eb32ff
52c098b
 
6eb32ff
52c098b
 
 
 
 
 
 
 
 
c5883cb
52c098b
 
c5883cb
52c098b
 
c5883cb
52c098b
 
 
e75f912
52c098b
 
 
7980110
 
583c9e8
7980110
583c9e8
7980110
c5883cb
 
7980110
1611849
 
 
 
 
 
01c3abe
1611849
52c098b
 
 
 
 
 
 
6ed6856
52c098b
 
7980110
 
 
 
 
 
 
 
 
 
 
 
 
 
e75f912
7980110
 
 
 
 
 
 
 
c5883cb
7980110
c5883cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6eb32ff
c5883cb
6eb32ff
7980110
c5883cb
 
 
 
 
 
 
e75f912
c5883cb
 
 
 
 
 
 
 
 
 
 
 
 
7980110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5883cb
7980110
c5883cb
7980110
c5883cb
52c098b
c5883cb
7980110
c5883cb
 
 
 
 
 
 
 
 
 
 
7980110
c5883cb
52c098b
c5883cb
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:120000
- multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
- source_sentence: Who is filming along?
  sentences:
  - Wién filmt mat?
  - >-
    Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
    krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
  - Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
  sentences:
  - >-
    Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
    gëtt jo een ganz neie Wunnquartier gebaut.
  - >-
    D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
    eso'gucr me' we' 90 prozent.
  - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
    Non-profit organisation Passerell, which provides legal council to refugees
    in Luxembourg, announced that it has to make four employees redundant in
    August due to a lack of funding.
  sentences:
  - Oetringen nach Remich....8.20» 215»
  - >-
    D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
    këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
  - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
  sentences:
  - Six Jours vu New-York si fir d’équipe Girgetti  Debacco
  - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
  - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
  sentences:
  - D'grenzarbechetr missten och me' lo'n kre'en.
  - >-
    De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
    gemâcht!
  - >-
    D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
    verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
    SentenceTransformer based on
    Alibaba-NLP/gte-multilingual-base
  results:
  - task:
      type: contemporary-lb
      name: Contemporary-lb
    dataset:
      name: Contemporary-lb
      type: contemporary-lb
    metrics:
    - type: accuracy
      value: 0.6216
      name: SIB-200(LB) accuracy
    - type: accuracy
      value: 0.6282
      name: ParaLUX accuracy
  - task:
      type: bitext-mining
      name: LBHistoricalBitextMining
    dataset:
      name: LBHistoricalBitextMining
      type: lb-en
    metrics:
    - type: accuracy
      value: 0.9683
      name: LB<->FR accuracy
    - type: accuracy
      value: 0.9715
      name: LB<->EN accuracy
    - type: mean_accuracy
      value: 0.9793
      name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---

# Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.


## Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)

## Limitations

We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2)

### Model Description
- **Model Type:** GTE-Multilingual-Base
- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
    - LB-EN (Historical, Modern)


## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
```



## Evaluation Results

### Metrics

(see introducing paper)
Historical Bitext Mining (Accuracy):

LB -> FR: 96.8

FR -> LB: 96.9

LB -> EN: 97.2

EN -> LB: 97.2

LB -> DE: 98.0

DE -> LB: 91.8

Contemporary LB (Accuracy):
ParaLUX: 62.82

SIB-200(LB): 62.16


## Training Details

### Training Dataset

The parallel sentences data mix is the following:

impresso-project/HistLuxAlign:
  - LB-FR (x20,000)
  - LB-EN (x20,000)
  - LB-DE (x20,000)

fredxlpy/LuxAlign:
  - LB-FR (x40,000)
  - LB-EN (x20,000)

Total: 120 000 Sentence pairs in mixed batches of size 8


### Contrastive Training
The model was trained with the parameters:
```
**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
  ```
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 1,
    "evaluation_steps": 520,
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
}
```
```

## Citation

### BibTeX

#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

```bibtex
@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}
```

#### Original Multilingual GTE Model

```bibtex
@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}
```