Update README.md

dc90fbc verified 18 days ago

6.38 kB

	---
	language:
	- ja
	license_name: sarahina-non-commercial-license
	license_link: LICENSE
	base_model:
	- sbintuitions/sarashina2.2-1b
	tags:
	- transformers
	- sentence-similarity
	- feature-extraction
	- sentence-transformers
	inference: false
	---

	# Sarashina-Embedding-v2-1B

	[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/README_JA.md)

	"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "[Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b)".
	We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. )

	This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.

	## Model Details

	### Model Description

	- Model Type: Sentence Transformer
	- Base model: [Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b)
	- Maximum Sequence Length: 8,192 tokens
	- Output Dimensionality: 1,792 dimensions
	- Similarity Function: Cosine Similarity
	- Language: Japanese
	- License: [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
	(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
	)
	```

	## Usage

	First install the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library:

	```bash
	pip install sentence-transformers==4.0.2
	```

	Then you can load this model and run inference.

	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b")
	# Run inference
	query = [
	'task: クエリを与えるので、与えられたWeb検索クエリに答える関連文章を検索してください。\nquery: Sarashinaのテキスト埋め込みモデルはありますか?'
	]
	texts = [
	'text: 更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
	'text: Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
	'text: サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
	]
	query_embedding = model.encode(query)
	text_embeddings = model.encode(texts)
	# Get the similarity scores between the embeddings
	similarities = model.similarity(query_embedding, text_embeddings)
	print(similarities)
	# tensor([[0.7403, 0.8651, 0.8775]])
	```
	### How to add instructions and prefixes

	For both the query and document sides, use different prefix formats. On the query side, add the prefix `task:` followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.)

	- Query Side: ```task: {Instrcution}\nquery: {Query}```
	- Document Side: ```text: {Document}```

	### Templates for instructions and prefixes

	The table below provides instruction and prefix templates for five main tasks.
	\|Task\|Query Side\|Document Side\|
	\|:-:\|:-\|:-\|
	\|Retrieval<br>Reranking\|task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery: \|text: \|
	\|Clustering\|task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery: \| - \|
	\|Classification\|task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery: \| - \|
	\|STS\|task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: \|task: クエリを与えるので，もっともクエリに意味が似ている一節を探してください。\nquery: \|

	## Training

	Sarashina-Embedding-v2-1B is created through the following three-stage learning process:

	### Stage 1: Weakly-supervised Learning
	To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets.

	### Step2: Supervised Fine-tuning
	To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data.

	### Stage 3: Model Merging
	To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging.

	## Evaluation Results (*) with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)

	\|Model\|Avg.\|Retrieval\|STS\|Classification\|Reranking\|Clustering\|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\|Sarashina-Embedding-v2-1B (This model)\|76.38\|76.48\|84.22\|77.14\|86.28\|52.56\|
	\|[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)\|75.85\|76.03\|81.59\|77.65\|85.84\|50.52\|
	\|[sbintuitions/sarashina-embedding-v1-1b](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)\|74.87\|74.53\|81.71\|77.20\|84.36\|50.30\|
	\|[OpenAI/text-embedding-3-large](https://openai.com/ja-JP/index/new-embedding-models-and-api-updates/)\|73.86\|71.95\|82.52\|77.27\|83.06\|51.82\|

	(*) Evaluated on July 28, 2025.

	## License

	This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE).

	If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/contact/).