OCR-robust-gte-multilingual-base
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Model Details
This model that was adapted to be more robust to OCR Noise in German and French. This model would be particularly useful for libraries and archives in Central Europe that want to perform semantic search and longitudinal studies within their collections.
This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('OCR-robust-gte-multilingual-base}')
embeddings = model.encode(sentences)
print(embeddings)
Evaluation Results
I will add the model specific evaluation results once the instance is running again.
Training Details
Training Dataset
Contrastive Training
The model was trained with the parameters:
Loss:
sentence_transformers.losses.MultipleNegativesRankingLoss
with parameters:
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
Parameters of the fit()-Method:
{
"epochs": 1,
"evaluation_steps": 0,
"evaluator": "NoneType",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 250,
"weight_decay": 0.01
}
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Citation
BibTeX
Cheap Character Noise for OCR-Robust Multilingual Embeddings (introducing paper)
For details on the adaptation methodology please refer to our paper (published in ACL2025 Findings). If you use our models or methodology, please cite our work.
update once available
Original Multilingual GTE Model
@inproceedings{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={1393--1412},
year={2024}
}
About Impresso
Impresso project
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright
Copyright (C) 2025 The Impresso team.
License
This program is provided as open source under the GNU Affero General Public License v3 or later.
- Downloads last month
- 1
Model tree for impresso-project/OCR-robust-gte-multilingual-base
Base model
Alibaba-NLP/gte-multilingual-base