lemon-mint commited on
Commit
90c19d7
·
verified ·
1 Parent(s): 423bf3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +384 -143
README.md CHANGED
@@ -1,199 +1,440 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
 
11
 
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
 
43
 
44
- [More Information Needed]
 
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
 
 
 
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
61
 
62
- [More Information Needed]
 
 
63
 
64
- ### Recommendations
 
 
 
 
 
 
 
 
 
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
 
71
 
72
- Use the code below to get started with the model.
 
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
 
 
 
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
81
 
82
- [More Information Needed]
 
83
 
84
- ### Training Procedure
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
87
 
88
- #### Preprocessing [optional]
 
 
 
 
89
 
90
- [More Information Needed]
91
 
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
 
102
 
103
  ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
 
129
- [More Information Needed]
 
 
 
 
 
 
 
130
 
131
- #### Summary
132
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
 
 
 
 
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
138
 
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
 
194
 
195
- [More Information Needed]
196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
1
  ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ base_model: intfloat/multilingual-e5-small
8
+ pipeline_tag: sentence-similarity
9
+ library_name: sentence-transformers
10
+ license: apache-2.0
11
+ language:
12
+ - ko
13
+ - en
14
  ---
15
 
16
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/IxcqY5qbGNuGpqDciIcOI.webp" width="600">
17
 
18
+ # SentenceTransformer based on intfloat/multilingual-e5-small
19
 
20
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) on datasets that include Korean query-passage pairs for improved performance on Korean retrieval tasks. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
21
 
22
+ This model is a lightweight Korean retriever, designed for ease of use and strong performance in practical retrieval tasks.
23
+ It is ideal for running demos or lightweight applications, offering a good balance between speed and accuracy.
24
 
25
+ For even higher retrieval performance, we recommend combining it with a reranker.
26
+ Suggested reranker models:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ - dragonkue/bge-reranker-v2-m3-ko
29
 
30
+ - BAAI/bge-reranker-v2-m3
 
 
31
 
32
+ ## Model Details
 
 
 
 
33
 
34
+ ### Model Description
35
+ - **Model Type:** Sentence Transformer
36
+ - **Base model:** [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) <!-- at revision c007d7ef6fd86656326059b28395a7a03a7c5846 -->
37
+ - **Maximum Sequence Length:** 512 tokens
38
+ - **Output Dimensionality:** 384 dimensions
39
+ - **Similarity Function:** Cosine Similarity
40
+ - **Training Datasets:**
41
 
42
+ <!-- - **Language:** Unknown -->
43
+ <!-- - **License:** Unknown -->
44
 
45
+ ### Model Sources
46
 
47
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
48
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
49
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
50
 
51
+ ### Full Model Architecture
52
 
53
+ ```
54
+ SentenceTransformer(
55
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
56
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
57
+ (2): Normalize()
58
+ )
59
+ ```
60
 
61
+ ## Usage
62
 
63
+ ### Direct Usage (Sentence Transformers)
64
 
65
+ First install the Sentence Transformers library:
66
 
67
+ ```bash
68
+ pip install -U sentence-transformers
69
+ ```
70
 
71
+ Then you can load this model and run inference.
72
+ ```python
73
+ from sentence_transformers import SentenceTransformer
74
 
75
+ # Download from the 🤗 Hub
76
+ model = SentenceTransformer("dragonkue/multilingual-e5-small-ko")
77
+ # Run inference
78
+ sentences = [
79
+ 'query: 북한가족법 몇 차 개정에서 이혼판결 확정 후 3개월 내에 등록시에만 유효하다는 조항을 확실히 했을까?',
80
+ 'passage: 1990년에 제정된 북한 가족법은 지금까지 4차례 개정되어 현재에 이르고 ���다. 1993년에 이루어진 제1차 개정은 주로 규정의 정확성을 기하기 위하여 몇몇 조문을 수정한 것이며, 실체적인 내용을 보완한 것은 상속의 승인과 포기기간을 설정한 제52조 정도라고 할 수 있다. 2004년에 이루어진 제2차에 개정에서는 제20조제3항을 신설하여 재판상 확정된 이혼판결을 3개월 내에 등록해야 이혼의 효력이 발생한다는 것을 명확하게 하였다. 2007년에 이루어진 제3차 개정에서는 부모와 자녀 관계 또한 신분등록기관에 등록한 때부터 법적 효력이 발생한다는 것을 신설(제25조제2항)하였다. 또한 미성년자, 노동능력 없는 자의 부양과 관련(제37조제2항)하여 기존에는 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀, 조부모나 손자녀, 형제자매가 부양한다”고 규정하고 있었던 것을 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀가 부양하며 그들이 없을 경우에는 조부모나 손자녀, 형제자매가 부양한다”로 개정하였다.',
81
+ 'passage: 환경마크 제도, 인증기준 변경으로 기업부담 줄인다\n환경마크 제도 소개\n□ 개요\n○ 동일 용도의 다른 제품에 비해 ‘제품의 환경성*’을 개선한 제품에 로고와 설명을 표시할 수 있도록하는 인증 제도\n※ 제품의 환경성 : 재료와 제품을 제조․소비 폐기하는 전과정에서 오염물질이나 온실가스 등을 배출하는 정도 및 자원과 에너지를 소비하는 정도 등 환경에 미치는 영향력의 정도(「환경기술 및 환경산업 지원법」제2조제5호)\n□ 법적근거\n○ 「환경기술 및 환경산업 지원법」제17조(환경표지의 인증)\n□ 관련 국제표준\n○ ISO 14024(제1유형 환경라벨링)\n□ 적용대상\n○ 사무기기, 가전제품, 생활용품, 건축자재 등 156개 대상제품군\n□ 인증현황\n○ 2,737개 기업의 16,647개 제품(2015.12월말 기준)',
82
+ ]
83
+ embeddings = model.encode(sentences)
84
+ print(embeddings.shape)
85
+ # [3, 384]
86
 
87
+ # Get the similarity scores for the embeddings
88
+ similarities = model.similarity(embeddings, embeddings)
89
+ print(similarities.shape)
90
+ # [3, 3]
91
+ ```
92
 
93
+ ### Direct Usage (Transformers)
94
 
95
+ ```python
96
+ import torch.nn.functional as F
97
 
98
+ from torch import Tensor
99
+ from transformers import AutoTokenizer, AutoModel
100
 
 
101
 
102
+ def average_pool(last_hidden_states: Tensor,
103
+ attention_mask: Tensor) -> Tensor:
104
+ last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
105
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
106
 
 
107
 
108
+ # Each input text should start with "query: " or "passage: ", even for non-English texts.
109
+ # For tasks other than retrieval, you can simply use the "query: " prefix.
110
+ input_texts = ["query: 북한가족법 몇 차 개정에서 이혼판결 확정 후 3개월 내에 등록시에만 유효하다는 조항을 확실히 했을까?",
111
+ "passage: 1990년에 제정된 북한 가족법은 지금까지 4차례 개정되어 현재에 이르고 있다. 1993년에 이루어진 제1차 개정은 주로 규정의 정확성을 기하기 위하여 몇몇 조문을 수정한 것이며, 실체적인 내용을 보완한 것은 상속의 승인과 포기기간을 설정한 제52조 정도라고 할 수 있다. 2004년에 이루어진 제2차에 개정에서는 제20조제3항을 신설하여 재판상 확정된 이혼판결을 3개월 내에 등록해야 이혼의 효력이 발생한다는 것을 명확하게 하였다. 2007년에 이루어진 제3차 개정에서는 부모와 자녀 관계 또한 신분등록기관에 등록한 때부터 법적 효력이 발생한다는 것을 신설(제25조제2항)하였다. 또한 미성년자, 노동능력 없는 자의 부양과 관련(제37조제2항)하여 기존에는 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀, 조부모나 손자녀, 형제자매가 부양한다”고 규정하고 있었던 것을 “부양능력이 있는 가정성원이 없을 경우에는 따로 사는 부모나 자녀가 부양하며 그들이 없을 경우에는 조부모나 손자녀, 형제자매가 부양한다”로 개정하였다.",
112
+ "passage: 환경마크 제도, 인증기준 변경으로 기업���담 줄인다\n환경마크 제도 소개\n□ 개요\n○ 동일 용도의 다른 제품에 비해 ‘제품의 환경성*’을 개선한 제품에 로고와 설명을 표시할 수 있도록하는 인증 제도\n※ 제품의 환경성 : 재료와 제품을 제조․소비 폐기하는 전과정에서 오염물질이나 온실가스 등을 배출하는 정도 및 자원과 에너지를 소비하는 정도 등 환경에 미치는 영향력의 정도(「환경기술 및 환경산업 지원법」제2조제5호)\n□ 법적근거\n○ 「환경기술 및 환경산업 지원법」제17조(환경표지의 인증)\n□ 관련 국제표준\n○ ISO 14024(제1유형 환경라벨링)\n□ 적용대상\n○ 사무기기, 가전제품, 생활용품, 건축자재 등 156개 대상제품군\n□ 인증현황\n○ 2,737개 기업의 16,647개 제품(2015.12월말 기준)"]
113
 
114
+ tokenizer = AutoTokenizer.from_pretrained('dragonkue/multilingual-e5-small-ko')
115
+ model = AutoModel.from_pretrained('dragonkue/multilingual-e5-small-ko')
116
 
117
+ # Tokenize the input texts
118
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
119
 
120
+ outputs = model(**batch_dict)
121
+ embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
122
 
123
+ # normalize embeddings
124
+ embeddings = F.normalize(embeddings, p=2, dim=1)
125
+ scores = (embeddings[:1] @ embeddings[1:].T)
126
+ print(scores.tolist())
127
+ ```
128
 
 
129
 
130
+ <!--
131
+ ### Downstream Usage (Sentence Transformers)
132
 
133
+ You can finetune this model on your own dataset.
134
 
135
+ <details><summary>Click to expand</summary>
136
 
137
+ </details>
138
+ -->
139
 
140
+ <!--
141
+ ### Out-of-Scope Use
142
 
143
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
144
+ -->
145
 
146
  ## Evaluation
147
 
148
+ - This evaluation references the KURE GitHub repository. (https://github.com/nlpai-lab/KURE)
149
+ - We conducted an evaluation on all **Korean Retrieval Benchmarks** registered in [MTEB](https://github.com/embeddings-benchmark/mteb).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
+ ### Korean Retrieval Benchmark
152
+ - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA.
153
+ - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**.
154
+ - [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl): A **Korean document retrieval dataset** based on Wikipedia.
155
+ - [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean.
156
+ - [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele): A **Korean document retrieval dataset** based on FLORES-200.
157
+ - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**.
158
+ - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): A **cross-domain Korean document retrieval dataset**.
159
 
160
+ ### Metrics
161
 
162
+ * Standard metric : NDCG@10
163
 
164
+ #### Information Retrieval
165
 
166
+ | Model | Size(M) | Average | XPQARetrieval | PublicHealthQA | MIRACLRetrieval | Ko-StrategyQA | BelebeleRetrieval | AutoRAGRetrieval | MrTidyRetrieval |
167
+ |:------------------------------------------------------------|----------:|----------:|----------------:|-----------------:|------------------:|----------------:|--------------------:|-------------------:|------------------:|
168
+ | BAAI/bge-m3 | 560 | 0.724169 | 0.36075 | 0.80412 | 0.70146 | 0.79405 | 0.93164 | 0.83008 | 0.64708 |
169
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 560 | 0.724104 | 0.43018 | 0.81679 | 0.66077 | 0.80455 | 0.9271 | 0.83863 | 0.59071 |
170
+ | intfloat/multilingual-e5-large | 560 | 0.721607 | 0.3571 | 0.82534 | 0.66486 | 0.80348 | 0.94499 | 0.81337 | 0.64211 |
171
+ | intfloat/multilingual-e5-base | 278 | 0.689429 | 0.3607 | 0.77203 | 0.6227 | 0.76355 | 0.92868 | 0.79752 | 0.58082 |
172
+ | **dragonkue/multilingual-e5-small-ko** | 118 | 0.688819 | 0.34871 | 0.79729 | 0.61113 | 0.76173 | 0.9297 | 0.86184 | 0.51133 |
173
+ | intfloat/multilingual-e5-small | 118 | 0.670906 | 0.33003 | 0.73668 | 0.61238 | 0.75157 | 0.90531 | 0.80068 | 0.55969 |
174
+ | ibm-granite/granite-embedding-278m-multilingual | 278 | 0.616466 | 0.23058 | 0.77668 | 0.59216 | 0.71762 | 0.83231 | 0.70226 | 0.46365 |
175
+ | ibm-granite/granite-embedding-107m-multilingual | 107 | 0.599759 | 0.23058 | 0.73209 | 0.58413 | 0.70531 | 0.82063 | 0.68243 | 0.44314 |
176
+ | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 118 | 0.409766 | 0.21345 | 0.67409 | 0.25676 | 0.45903 | 0.71491 | 0.42296 | 0.12716 |
177
 
178
+ #### Performance Comparison by Model Size (Based on Average NDCG@10)
179
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/ZgOwD9nlgVchYBqK4iXTW.png" width="1000"/>
180
 
181
+ <!--
182
+ ### Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
185
+ -->
186
 
187
+ ## Training Details
188
 
189
+ ### Training Datasets
190
+ This model was fine-tuned on the same dataset used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, which consists of Korean query-passage pairs.
191
+ The training objective was to improve retrieval performance specifically for Korean-language tasks.
192
+
193
+ ### Training Methods
194
+
195
+ Following the training approach used in dragonkue/snowflake-arctic-embed-l-v2.0-ko, this model constructs in-batch negatives based on clustered passages. In addition, we introduce GISTEmbedLoss with a configurable margin.
196
+
197
+ **📈 Margin-based Training Results**
198
+ - Using the standard MNR (Multiple Negatives Ranking) loss alone resulted in decreased performance.
199
+
200
+ - The original GISTEmbedLoss (without margin) yielded modest improvements of around +0.8 NDCG@10.
201
+
202
+ - Applying a margin led to performance gains of up to +1.5 NDCG@10.
203
+
204
+ - This indicates that simply tuning the margin value can lead to up to 2x improvement, showing strong sensitivity and effectiveness of margin scaling.
205
+
206
+ This margin-based approach extends the idea proposed in the NV-Retriever paper, which originally filtered false negatives during hard negative sampling.
207
+ We adapt this to in-batch negatives, treating false negatives as dynamic samples guided by margin-based filtering.
208
+
209
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642b0c2fecec03b4464a1d9b/IpDDTshuZ5noxPOdm6gVk.png" width="800"/>
210
+
211
+ The sentence-transformers library now supports GISTEmbedLoss with margin configuration, making it easy to integrate into any training pipeline.
212
+
213
+ You can install the latest version with:
214
+
215
+ ```bash
216
+ pip install -U sentence-transformers
217
+ ```
218
+
219
+
220
+ ### Training Hyperparameters
221
+ #### Non-Default Hyperparameters
222
+
223
+ - `eval_strategy`: steps
224
+ - `per_device_train_batch_size`: 20000
225
+ - `per_device_eval_batch_size`: 4096
226
+ - `learning_rate`: 0.00025
227
+ - `num_train_epochs`: 3
228
+ - `warmup_ratio`: 0.05
229
+ - `fp16`: True
230
+ - `dataloader_drop_last`: True
231
+ - `batch_sampler`: no_duplicates
232
+
233
+ #### All Hyperparameters
234
+ <details><summary>Click to expand</summary>
235
+
236
+ - `overwrite_output_dir`: False
237
+ - `do_predict`: False
238
+ - `eval_strategy`: steps
239
+ - `prediction_loss_only`: True
240
+ - `per_device_train_batch_size`: 20000
241
+ - `per_device_eval_batch_size`: 4096
242
+ - `per_gpu_train_batch_size`: None
243
+ - `per_gpu_eval_batch_size`: None
244
+ - `gradient_accumulation_steps`: 1
245
+ - `eval_accumulation_steps`: None
246
+ - `torch_empty_cache_steps`: None
247
+ - `learning_rate`: 0.00025
248
+ - `weight_decay`: 0.0
249
+ - `adam_beta1`: 0.9
250
+ - `adam_beta2`: 0.999
251
+ - `adam_epsilon`: 1e-08
252
+ - `max_grad_norm`: 1.0
253
+ - `num_train_epochs`: 2
254
+ - `max_steps`: -1
255
+ - `lr_scheduler_type`: linear
256
+ - `lr_scheduler_kwargs`: {}
257
+ - `warmup_ratio`: 0.05
258
+ - `warmup_steps`: 0
259
+ - `log_level`: passive
260
+ - `log_level_replica`: warning
261
+ - `log_on_each_node`: True
262
+ - `logging_nan_inf_filter`: True
263
+ - `save_safetensors`: True
264
+ - `save_on_each_node`: False
265
+ - `save_only_model`: False
266
+ - `restore_callback_states_from_checkpoint`: False
267
+ - `no_cuda`: False
268
+ - `use_cpu`: False
269
+ - `use_mps_device`: False
270
+ - `seed`: 42
271
+ - `data_seed`: None
272
+ - `jit_mode_eval`: False
273
+ - `use_ipex`: False
274
+ - `bf16`: False
275
+ - `fp16`: True
276
+ - `fp16_opt_level`: O1
277
+ - `half_precision_backend`: auto
278
+ - `bf16_full_eval`: False
279
+ - `fp16_full_eval`: False
280
+ - `tf32`: None
281
+ - `local_rank`: 0
282
+ - `ddp_backend`: None
283
+ - `tpu_num_cores`: None
284
+ - `tpu_metrics_debug`: False
285
+ - `debug`: []
286
+ - `dataloader_drop_last`: True
287
+ - `dataloader_num_workers`: 0
288
+ - `dataloader_prefetch_factor`: None
289
+ - `past_index`: -1
290
+ - `disable_tqdm`: False
291
+ - `remove_unused_columns`: True
292
+ - `label_names`: None
293
+ - `load_best_model_at_end`: False
294
+ - `ignore_data_skip`: False
295
+ - `fsdp`: []
296
+ - `fsdp_min_num_params`: 0
297
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
298
+ - `tp_size`: 0
299
+ - `fsdp_transformer_layer_cls_to_wrap`: None
300
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
301
+ - `deepspeed`: None
302
+ - `label_smoothing_factor`: 0.0
303
+ - `optim`: adamw_torch
304
+ - `optim_args`: None
305
+ - `adafactor`: False
306
+ - `group_by_length`: False
307
+ - `length_column_name`: length
308
+ - `ddp_find_unused_parameters`: None
309
+ - `ddp_bucket_cap_mb`: None
310
+ - `ddp_broadcast_buffers`: False
311
+ - `dataloader_pin_memory`: True
312
+ - `dataloader_persistent_workers`: False
313
+ - `skip_memory_metrics`: True
314
+ - `use_legacy_prediction_loop`: False
315
+ - `push_to_hub`: False
316
+ - `resume_from_checkpoint`: None
317
+ - `hub_model_id`: None
318
+ - `hub_strategy`: every_save
319
+ - `hub_private_repo`: None
320
+ - `hub_always_push`: False
321
+ - `gradient_checkpointing`: False
322
+ - `gradient_checkpointing_kwargs`: None
323
+ - `include_inputs_for_metrics`: False
324
+ - `include_for_metrics`: []
325
+ - `eval_do_concat_batches`: True
326
+ - `fp16_backend`: auto
327
+ - `push_to_hub_model_id`: None
328
+ - `push_to_hub_organization`: None
329
+ - `mp_parameters`:
330
+ - `auto_find_batch_size`: False
331
+ - `full_determinism`: False
332
+ - `torchdynamo`: None
333
+ - `ray_scope`: last
334
+ - `ddp_timeout`: 1800
335
+ - `torch_compile`: False
336
+ - `torch_compile_backend`: None
337
+ - `torch_compile_mode`: None
338
+ - `include_tokens_per_second`: False
339
+ - `include_num_input_tokens_seen`: False
340
+ - `neftune_noise_alpha`: None
341
+ - `optim_target_modules`: None
342
+ - `batch_eval_metrics`: False
343
+ - `eval_on_start`: False
344
+ - `use_liger_kernel`: False
345
+ - `eval_use_gather_object`: False
346
+ - `average_tokens_across_devices`: False
347
+ - `prompts`: None
348
+ - `batch_sampler`: no_duplicates
349
+ - `multi_dataset_batch_sampler`: proportional
350
+
351
+ </details>
352
+
353
+
354
+ ### Framework Versions
355
+ - Python: 3.11.10
356
+ - Sentence Transformers: 4.1.0
357
+ - Transformers: 4.51.3
358
+ - PyTorch: 2.7.0+cu126
359
+ - Accelerate: 1.6.0
360
+ - Datasets: 3.5.1
361
+ - Tokenizers: 0.21.1
362
+
363
+ ## FAQ
364
+ **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
365
+
366
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
367
+
368
+ Here are some rules of thumb:
369
+
370
+ Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
371
+
372
+ Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
373
+
374
+ Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
375
+
376
+ **2. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
377
+
378
+ This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
379
+
380
+ For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.
381
+
382
+ ## Citation
383
+
384
+ ### BibTeX
385
+
386
+ #### Sentence Transformers
387
+ ```bibtex
388
+ @inproceedings{reimers-2019-sentence-bert,
389
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
390
+ author = "Reimers, Nils and Gurevych, Iryna",
391
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
392
+ month = "11",
393
+ year = "2019",
394
+ publisher = "Association for Computational Linguistics",
395
+ url = "https://arxiv.org/abs/1908.10084",
396
+ }
397
+ ```
398
+
399
+ #### Base Model
400
+ ```bibtex
401
+ @article{wang2024multilingual,
402
+ title={Multilingual E5 Text Embeddings: A Technical Report},
403
+ author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
404
+ journal={arXiv preprint arXiv:2402.05672},
405
+ year={2024}
406
+ }
407
+ ```
408
+ #### NV-Retriever: Improving text embedding models with effective hard-negative mining
409
+ ```bibtex
410
+ @article{moreira2024nvretriever,
411
+ title = {NV-Retriever: Improving text embedding models with effective hard-negative mining},
412
+ author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
413
+ journal = {arXiv preprint arXiv:2407.15831},
414
+ year = {2024},
415
+ url = {https://arxiv.org/abs/2407.15831},
416
+ doi = {10.48550/arXiv.2407.15831}
417
+ }
418
+ ```
419
+
420
+ ## Limitations
421
+
422
+ Long texts will be truncated to at most 512 tokens.
423
+
424
+ <!--
425
+ ## Glossary
426
+
427
+ *Clearly define terms in order to be accessible across audiences.*
428
+ -->
429
+
430
+ <!--
431
+ ## Model Card Authors
432
+
433
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
434
+ -->
435
+
436
+ <!--
437
  ## Model Card Contact
438
 
439
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
440
+ -->