richinfoai
/

ritrieve_zh_v1

@@ -1,49 +1,58 @@
-## Introduction
-This model was trained by [richinfoai](https://www.richinfo.cn/).
-Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
-[lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
-[dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
-and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
-Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
-We believe this model once again demonstrates the effectiveness of distillation learning.
-In the future, we will train more bilingual vector models based on various excellent vector training methods.
-## Methods
-### Stage1
-We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
-and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
-as training data to do a distillation from the above three models.
-In this stage, we only use cosine-loss.
-### Stage2
-The objective of stage2 is reducing dimensions.
-We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
-## Usage
-This model does not need instructions and you can use it in `SentenceTransformer`:
-```python
-import os
-os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
-from sentence_transformers import SentenceTransformer
-text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
-texts = [
-    "什么是人工智能",
-    "介绍一下主流的LLM",
-    "人工智能（AI）是模拟人类智能的计算机系统，能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化，广泛应用于各行各业。"
-]
-vectors = text_encoder.encode(texts, normalize_embeddings=True)
-print(vectors @ vectors.T)
-# [[0.9999999  0.67707014 0.91421044]
-#  [0.67707014 0.9999998  0.6353945 ]
-#  [0.91421044 0.6353945  1.0000001 ]]
-```

+---
+datasets:
+- BAAI/Infinity-Instruct
+- opencsg/chinese-fineweb-edu
+language:
+- zh
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+## Introduction
+This model was trained by [richinfoai](https://www.richinfo.cn/).
+Followed [Stella and Jasper models](https://arxiv.org/pdf/2412.19048), we do distillation training from
+[lier007/xiaobu-embedding-v2](https://huggingface.co/lier007/xiaobu-embedding-v2),
+[dunzhang/stella-large-zh-v3-1792d](https://huggingface.co/dunzhang/stella-large-zh-v3-1792d)
+and [BAAI/bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2).
+Thanks to their outstanding performance, our model has achieved excellent results on MTEB(cmn, v1).
+We believe this model once again demonstrates the effectiveness of distillation learning.
+In the future, we will train more bilingual vector models based on various excellent vector training methods.
+## Methods
+### Stage1
+We use [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
+and [opencsg/chinese-fineweb-edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)
+as training data to do a distillation from the above three models.
+In this stage, we only use cosine-loss.
+### Stage2
+The objective of stage2 is reducing dimensions.
+We use the same training data as the stage1 with `similarity loss`. After stage2, the dimensions of our model is 1792.
+## Usage
+This model does not need instructions and you can use it in `SentenceTransformer`:
+```python
+import os
+os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
+from sentence_transformers import SentenceTransformer
+text_encoder = SentenceTransformer("richinfoai/ritrieve_zh_v1")
+texts = [
+    "什么是人工智能",
+    "介绍一下主流的LLM",
+    "人工智能（AI）是模拟人类智能的计算机系统，能够执行学习、推理和决策等任务。它通过算法和大数据实现自动化，广泛应用于各行各业。"
+]
+vectors = text_encoder.encode(texts, normalize_embeddings=True)
+print(vectors @ vectors.T)
+# [[0.9999999  0.67707014 0.91421044]
+#  [0.67707014 0.9999998  0.6353945 ]
+#  [0.91421044 0.6353945  1.0000001 ]]
+```