Update README.md
Browse files
README.md
CHANGED
@@ -1,49 +1,58 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
In
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- BAAI/Infinity-Instruct
|
4 |
+
- opencsg/chinese-fineweb-edu
|
5 |
+
language:
|
6 |
+
- zh
|
7 |
+
pipeline_tag: sentence-similarity
|
8 |
+
library_name: sentence-transformers
|
9 |
+
---
|
10 |
+
## Introduction
|
11 |
+
|
12 |
+
This model was trained by [richinfoai](https://www.richinfo.cn/).
|
13 |
+
Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
|
14 |
+
[lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
|
15 |
+
[dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
|
16 |
+
and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
|
17 |
+
Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
|
18 |
+
|
19 |
+
We believe this model once again demonstrates the effectiveness of distillation learning.
|
20 |
+
In the future, we will train more bilingual vector models based on various excellent vector training methods.
|
21 |
+
|
22 |
+
## Methods
|
23 |
+
|
24 |
+
### Stage1
|
25 |
+
|
26 |
+
We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
|
27 |
+
and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
|
28 |
+
as training data to do a distillation from the above three models.
|
29 |
+
In this stage, we only use cosine-loss.
|
30 |
+
|
31 |
+
### Stage2
|
32 |
+
|
33 |
+
The objective of stage2 is reducing dimensions.
|
34 |
+
We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
|
35 |
+
|
36 |
+
## Usage
|
37 |
+
|
38 |
+
This model does not need instructions and you can use it in `SentenceTransformer`:
|
39 |
+
|
40 |
+
```python
|
41 |
+
import os
|
42 |
+
|
43 |
+
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
|
44 |
+
from sentence_transformers import SentenceTransformer
|
45 |
+
|
46 |
+
text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
|
47 |
+
texts = [
|
48 |
+
"什么是人工智能",
|
49 |
+
"介绍一下主流的LLM",
|
50 |
+
"人工智能(AI)是模拟人类智能的计算机系统,能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化,广泛应用于各行各业。"
|
51 |
+
]
|
52 |
+
vectors = text_encoder.encode(texts, normalize_embeddings=True)
|
53 |
+
print(vectors @ vectors.T)
|
54 |
+
# [[0.9999999 0.67707014 0.91421044]
|
55 |
+
# [0.67707014 0.9999998 0.6353945 ]
|
56 |
+
# [0.91421044 0.6353945 1.0000001 ]]
|
57 |
+
|
58 |
+
```
|