Andrianos commited on
Commit
fcd450b
·
verified ·
1 Parent(s): 970e2e8

Model Card first version

Browse files
Files changed (1) hide show
  1. README.md +40 -16
README.md CHANGED
@@ -5,14 +5,26 @@ tags:
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
-
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- # {MODEL_NAME}
12
 
13
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
 
15
- <!--- Describe your model here -->
 
 
 
 
16
 
17
  ## Usage (Sentence-Transformers)
18
 
@@ -28,7 +40,7 @@ Then you can use the model like this:
28
  from sentence_transformers import SentenceTransformer
29
  sentences = ["This is an example sentence", "Each sentence is converted"]
30
 
31
- model = SentenceTransformer('{MODEL_NAME}')
32
  embeddings = model.encode(sentences)
33
  print(embeddings)
34
  ```
@@ -37,21 +49,15 @@ print(embeddings)
37
 
38
  ## Evaluation Results
39
 
40
- <!--- Describe how your model was evaluated -->
41
 
42
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
43
 
 
44
 
45
- ## Training
46
  The model was trained with the parameters:
47
 
48
- **DataLoader**:
49
-
50
- `torch.utils.data.dataloader.DataLoader` of length 2500 with parameters:
51
- ```
52
- {'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
53
- ```
54
-
55
  **Loss**:
56
 
57
  `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
@@ -87,6 +93,24 @@ SentenceTransformer(
87
  )
88
  ```
89
 
90
- ## Citing & Authors
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
5
  - sentence-transformers
6
  - feature-extraction
7
  - sentence-similarity
8
+ - multilingual
9
+ license: agpl-3.0
10
+ language:
11
+ - de
12
+ - fr
13
+ - en
14
+ - lb
15
+ base_model:
16
+ - Alibaba-NLP/gte-multilingual-base
17
  ---
18
 
19
+ # OCR-robust-gte-multilingual-base
20
 
21
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
22
 
23
+ ## Model Details
24
+
25
+ This model that was adapted to be more robust to OCR Noise in German and French. This model would be particularly useful for libraries and archives in Central Europe that want to perform semantic search and longitudinal studies within their collections.
26
+
27
+ This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)
28
 
29
  ## Usage (Sentence-Transformers)
30
 
 
40
  from sentence_transformers import SentenceTransformer
41
  sentences = ["This is an example sentence", "Each sentence is converted"]
42
 
43
+ model = SentenceTransformer('OCR-robust-gte-multilingual-base}')
44
  embeddings = model.encode(sentences)
45
  print(embeddings)
46
  ```
 
49
 
50
  ## Evaluation Results
51
 
52
+ I will add the model specific evaluation results once the instance is running again.
53
 
54
+ ## Training Details
55
 
56
+ ### Training Dataset
57
 
58
+ ### Contrastive Training
59
  The model was trained with the parameters:
60
 
 
 
 
 
 
 
 
61
  **Loss**:
62
 
63
  `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
 
93
  )
94
  ```
95
 
96
+ ## Citation
97
+
98
+ ### BibTeX
99
+
100
+ #### Cheap Character Noise for OCR-Robust Multilingual Embeddings (introducing paper)
101
+
102
+ ```bibtex
103
+ update once available
104
+ ```
105
+
106
+
107
+ #### Original Multilingual GTE Model
108
 
109
+ ```bibtex
110
+ @inproceedings{zhang2024mgte,
111
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
112
+ author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
113
+ booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
114
+ pages={1393--1412},
115
+ year={2024}
116
+ }