richinfoai commited on
Commit
21c68d6
·
verified ·
1 Parent(s): 5ee70a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -49
README.md CHANGED
@@ -1,49 +1,58 @@
1
- ## Introduction
2
-
3
- This model was trained by [richinfoai](https://www.richinfo.cn/).
4
- Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
5
- [lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
6
- [dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
7
- and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
8
- Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
9
-
10
- We believe this model once again demonstrates the effectiveness of distillation learning.
11
- In the future, we will train more bilingual vector models based on various excellent vector training methods.
12
-
13
- ## Methods
14
-
15
- ### Stage1
16
-
17
- We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
18
- and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
19
- as training data to do a distillation from the above three models.
20
- In this stage, we only use cosine-loss.
21
-
22
- ### Stage2
23
-
24
- The objective of stage2 is reducing dimensions.
25
- We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
26
-
27
- ## Usage
28
-
29
- This model does not need instructions and you can use it in `SentenceTransformer`:
30
-
31
- ```python
32
- import os
33
-
34
- os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
35
- from sentence_transformers import SentenceTransformer
36
-
37
- text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
38
- texts = [
39
- "什么是人工智能",
40
- "介绍一下主流的LLM",
41
- "人工智能(AI)是模拟人类智能的计算机系统,能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化,广泛应用于各行各业。"
42
- ]
43
- vectors = text_encoder.encode(texts, normalize_embeddings=True)
44
- print(vectors @ vectors.T)
45
- # [[0.9999999 0.67707014 0.91421044]
46
- # [0.67707014 0.9999998 0.6353945 ]
47
- # [0.91421044 0.6353945 1.0000001 ]]
48
-
49
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - BAAI/Infinity-Instruct
4
+ - opencsg/chinese-fineweb-edu
5
+ language:
6
+ - zh
7
+ pipeline_tag: sentence-similarity
8
+ library_name: sentence-transformers
9
+ ---
10
+ ## Introduction
11
+
12
+ This model was trained by [richinfoai](https://www.richinfo.cn/).
13
+ Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
14
+ [lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
15
+ [dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
16
+ and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
17
+ Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
18
+
19
+ We believe this model once again demonstrates the effectiveness of distillation learning.
20
+ In the future, we will train more bilingual vector models based on various excellent vector training methods.
21
+
22
+ ## Methods
23
+
24
+ ### Stage1
25
+
26
+ We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
27
+ and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
28
+ as training data to do a distillation from the above three models.
29
+ In this stage, we only use cosine-loss.
30
+
31
+ ### Stage2
32
+
33
+ The objective of stage2 is reducing dimensions.
34
+ We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
35
+
36
+ ## Usage
37
+
38
+ This model does not need instructions and you can use it in `SentenceTransformer`:
39
+
40
+ ```python
41
+ import os
42
+
43
+ os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
44
+ from sentence_transformers import SentenceTransformer
45
+
46
+ text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
47
+ texts = [
48
+ "什么是人工智能",
49
+ "介绍一下主流的LLM",
50
+ "人工智能(AI)是模拟人类智能的计算机系统,能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化,广泛应用于各行各业。"
51
+ ]
52
+ vectors = text_encoder.encode(texts, normalize_embeddings=True)
53
+ print(vectors @ vectors.T)
54
+ # [[0.9999999 0.67707014 0.91421044]
55
+ # [0.67707014 0.9999998 0.6353945 ]
56
+ # [0.91421044 0.6353945 1.0000001 ]]
57
+
58
+ ```