Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base
This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)
Limitations
We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use histlux-paraphrase-multilingual-mpnet-base-v2
Model Description
- Model Type: GTE-Multilingual-Base
- Base model: Alibaba-NLP/gte-multilingual-base
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset: See below
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
Evaluation Results
Metrics
(see introducing paper) Historical Bitext Mining (Accuracy):
LB -> FR: 96.8
FR -> LB: 96.9
LB -> EN: 97.2
EN -> LB: 97.2
LB -> DE: 98.0
DE -> LB: 91.8
Contemporary LB (Accuracy): ParaLUX: 62.82
SIB-200(LB): 62.16
Training Details
Training Dataset
The parallel sentences data mix is the following:
impresso-project/HistLuxAlign:
- LB-FR (x20,000)
- LB-EN (x20,000)
- LB-DE (x20,000)
fredxlpy/LuxAlign:
- LB-FR (x40,000)
- LB-EN (x20,000)
Total: 120 000 Sentence pairs in mixed batches of size 8
Contrastive Training
The model was trained with the parameters:
**Loss**:
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{ "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", }
Citation
BibTeX
Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
@inproceedings{michail-etal-2025-adapting,
title = "Adapting Multilingual Embedding Models to Historical {L}uxembourgish",
author = "Michail, Andrianos and
Racl{\'e}, Corina and
Opitz, Juri and
Clematide, Simon",
editor = "Kazantseva, Anna and
Szpakowicz, Stan and
Degaetano-Ortlieb, Stefania and
Bizzoni, Yuri and
Pagel, Janis",
booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)",
month = may,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.latechclfl-1.26/",
doi = "10.18653/v1/2025.latechclfl-1.26",
pages = "291--298",
ISBN = "979-8-89176-241-1"
}
Original Multilingual GTE Model
@inproceedings{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={1393--1412},
year={2024}
}
About Impresso
Impresso project
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright
Copyright (C) 2025 The Impresso team.
License
This program is provided as open source under the GNU Affero General Public License v3 or later.
- Downloads last month
- 4
Model tree for impresso-project/histlux-gte-multilingual-base
Base model
Alibaba-NLP/gte-multilingual-baseDatasets used to train impresso-project/histlux-gte-multilingual-base
Evaluation results
- SIB-200(LB) accuracy on Contemporary-lbself-reported0.622
- ParaLUX accuracy on Contemporary-lbself-reported0.628
- LB<->FR accuracy on LBHistoricalBitextMiningself-reported0.968
- LB<->EN accuracy on LBHistoricalBitextMiningself-reported0.972
- LB<->DE accuracy on LBHistoricalBitextMiningself-reported0.979