Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)

Limitations

We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use histlux-paraphrase-multilingual-mpnet-base-v2

Model Description

Model Type: GTE-Multilingual-Base
Base model: Alibaba-NLP/gte-multilingual-base
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset: See below

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

Metrics

(see introducing paper) Historical Bitext Mining (Accuracy):

LB -> FR: 96.8

FR -> LB: 96.9

LB -> EN: 97.2

EN -> LB: 97.2

LB -> DE: 98.0

DE -> LB: 91.8

Contemporary LB (Accuracy): ParaLUX: 62.82

SIB-200(LB): 62.16

Training Details

Training Dataset

The parallel sentences data mix is the following:

impresso-project/HistLuxAlign:

LB-FR (x20,000)
LB-EN (x20,000)
LB-DE (x20,000)

fredxlpy/LuxAlign:

LB-FR (x40,000)
LB-EN (x20,000)

Total: 120 000 Sentence pairs in mixed batches of size 8

Contrastive Training

The model was trained with the parameters:

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}


Parameters of the fit()-Method:

{ "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", }

Citation

BibTeX

Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

@inproceedings{michail-etal-2025-adapting,
    title = "Adapting Multilingual Embedding Models to Historical {L}uxembourgish",
    author = "Michail, Andrianos  and
      Racl{\'e}, Corina  and
      Opitz, Juri  and
      Clematide, Simon",
    editor = "Kazantseva, Anna  and
      Szpakowicz, Stan  and
      Degaetano-Ortlieb, Stefania  and
      Bizzoni, Yuri  and
      Pagel, Janis",
    booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.latechclfl-1.26/",
    doi = "10.18653/v1/2025.latechclfl-1.26",
    pages = "291--298",
    ISBN = "979-8-89176-241-1"
}

Original Multilingual GTE Model

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}

About Impresso

Impresso project

Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.

Copyright

License

This program is provided as open source under the GNU Affero General Public License v3 or later.

Impresso Project Logo

impresso-project
/

histlux-gte-multilingual-base