richinfoai commited on
Commit
5ee70a3
·
verified ·
1 Parent(s): bb92c87

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+
3
+ This model was trained by [richinfoai](https://www.richinfo.cn/).
4
+ Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
5
+ [lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
6
+ [dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
7
+ and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
8
+ Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
9
+
10
+ We believe this model once again demonstrates the effectiveness of distillation learning.
11
+ In the future, we will train more bilingual vector models based on various excellent vector training methods.
12
+
13
+ ## Methods
14
+
15
+ ### Stage1
16
+
17
+ We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
18
+ and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
19
+ as training data to do a distillation from the above three models.
20
+ In this stage, we only use cosine-loss.
21
+
22
+ ### Stage2
23
+
24
+ The objective of stage2 is reducing dimensions.
25
+ We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
26
+
27
+ ## Usage
28
+
29
+ This model does not need instructions and you can use it in `SentenceTransformer`:
30
+
31
+ ```python
32
+ import os
33
+
34
+ os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
35
+ from sentence_transformers import SentenceTransformer
36
+
37
+ text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
38
+ texts = [
39
+ "什么是人工智能",
40
+ "介绍一下主流的LLM",
41
+ "人工智能(AI)是模拟人类智能的计算机系统,能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化,广泛应用于各行各业。"
42
+ ]
43
+ vectors = text_encoder.encode(texts, normalize_embeddings=True)
44
+ print(vectors @ vectors.T)
45
+ # [[0.9999999 0.67707014 0.91421044]
46
+ # [0.67707014 0.9999998 0.6353945 ]
47
+ # [0.91421044 0.6353945 1.0000001 ]]
48
+
49
+ ```