Yuan-embedding-1.0
Yuan-embedding-1.0 是专门为中文文本检索任务设计的嵌入模型。 在xiaobu模型结构(bert-large结构)基础上, 采用全新的数据集构建、生成与清洗方法, 结合二阶段微调实现Retrieval任务的精度领先(Hugging Face C-MTEB榜单 [1])。 其中, 正负例样本采用源2.0-M32(Yuan2.0-M32 [2])大模型进行生成。主要工作如下:
在Hard negative sampling中,使用Rerank模型(bge-reranker-large [3])进行数据排序筛选
通过(Yuan2.0-M32大模型)迭代生成新query、corpus
采用MRL方法进行模型微调训练
Usage
pip install -U sentence-transformers==3.1.1
使用示例:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("IEIYuan/Yuan-embedding-1.0")
sentences = [
"这是一个样例-1",
"这是一个样例-2",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Reference
- Downloads last month
- 381
Evaluation results
- cosine_pearson on MTEB AFQMC (default)validation set self-reported56.399
- cosine_spearman on MTEB AFQMC (default)validation set self-reported60.298
- manhattan_pearson on MTEB AFQMC (default)validation set self-reported58.344
- manhattan_spearman on MTEB AFQMC (default)validation set self-reported59.634
- euclidean_pearson on MTEB AFQMC (default)validation set self-reported58.332
- euclidean_spearman on MTEB AFQMC (default)validation set self-reported59.633
- main_score on MTEB AFQMC (default)validation set self-reported60.298
- cosine_pearson on MTEB ATEC (default)test set self-reported56.419
- cosine_spearman on MTEB ATEC (default)test set self-reported58.498
- manhattan_pearson on MTEB ATEC (default)test set self-reported62.053