|
--- |
|
language: |
|
- ja |
|
license_name: sarahina-non-commercial-license |
|
license_link: LICENSE |
|
base_model: |
|
- sbintuitions/sarashina2.2-1b |
|
tags: |
|
- transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- sentence-transformers |
|
inference: false |
|
--- |
|
|
|
# Sarashina-Embedding-v2-1B |
|
|
|
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/README_JA.md)** |
|
|
|
"Sarashina-Embedding-v2-1B" is a Japanese text embedding model, based on the Japanese LLM "[Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b)". |
|
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 28 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).(Benchmarked on July 28, 2025. ) |
|
|
|
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [Sarashina2.2-1B](https://huggingface.co/sbintuitions/sarashina2.2-1b) |
|
- **Maximum Sequence Length:** 8,192 tokens |
|
- **Output Dimensionality:** 1,792 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Language:** Japanese |
|
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel |
|
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
First install the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library: |
|
|
|
```bash |
|
pip install sentence-transformers==4.0.2 |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("sbintuitions/sarashina-embedding-v2-1b") |
|
# Run inference |
|
query = [ |
|
'task: クエリを与えるので、与えられたWeb検索クエリに答える関連文章を検索してください。\nquery: Sarashinaのテキスト埋め込みモデルはありますか?' |
|
] |
|
texts = [ |
|
'text: 更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。', |
|
'text: Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。', |
|
'text: サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。' |
|
] |
|
query_embedding = model.encode(query) |
|
text_embeddings = model.encode(texts) |
|
# Get the similarity scores between the embeddings |
|
similarities = model.similarity(query_embedding, text_embeddings) |
|
print(similarities) |
|
# tensor([[0.7403, 0.8651, 0.8775]]) |
|
``` |
|
### How to add instructions and prefixes |
|
|
|
For both the query and document sides, use different prefix formats. On the query side, add the prefix `task:` followed by instructions. (Only for STS task, both sentences are considered as query, and should be prefixed with the same instruction.) |
|
|
|
- Query Side: ```task: {Instrcution}\nquery: {Query}``` |
|
- Document Side: ```text: {Document}``` |
|
|
|
### Templates for instructions and prefixes |
|
|
|
The table below provides instruction and prefix templates for five main tasks. |
|
|Task|Query Side|Document Side| |
|
|:-:|:-|:-| |
|
|Retrieval<br>Reranking|task: 質問を与えるので、その質問に答えるのに役立つ関連文書を検索してください。\nquery: |text: | |
|
|Clustering|task: 与えられたドキュメントのトピックまたはテーマを特定してください。\nquery: | - | |
|
|Classification|task: 与えられたレビューを適切な評価カテゴリに分類してください。\nquery: | - | |
|
|STS|task: クエリを与えるので,もっともクエリに意味が似ている一節を探してください。\nquery: |task: クエリを与えるので,もっともクエリに意味が似ている一節を探してください。\nquery: | |
|
|
|
## Training |
|
|
|
Sarashina-Embedding-v2-1B is created through the following three-stage learning process: |
|
|
|
### Stage 1: Weakly-supervised Learning |
|
To build a general-purpose and high-performance embedding model for a wide range of domains, we employed contrastive learning using weak supervision data, which consists of our own web-crawled data and open datasets. |
|
|
|
### Step2: Supervised Fine-tuning |
|
To further train the model to better understand the similarity between queries and documents, we performed fine-tuning using higher-quality data than that used in Stage 1. Additionally, we trained multiple models by modifying parts of the data. |
|
|
|
### Stage 3: Model Merging |
|
To enhance performance, we merged the weights of the two models that yielded the highest JMTEB scores in Stage 2 through linear merging. |
|
|
|
## Evaluation Results (*) with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) |
|
|
|
|Model|Avg.|Retrieval|STS|Classification|Reranking|Clustering| |
|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:| |
|
|Sarashina-Embedding-v2-1B (This model)|**76.38**|**76.48**|**84.22**|77.14|**86.28**|52.56| |
|
|[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|75.85|76.03|81.59|**77.65**|85.84|50.52| |
|
|[sbintuitions/sarashina-embedding-v1-1b](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|74.87|74.53|81.71|77.20|84.36|50.30| |
|
|[OpenAI/text-embedding-3-large](https://openai.com/ja-JP/index/new-embedding-models-and-api-updates/)|73.86|71.95|82.52|77.27|83.06|51.82| |
|
|
|
(*) Evaluated on July 28, 2025. |
|
|
|
## License |
|
|
|
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v2-1b/blob/main/LICENSE). |
|
|
|
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/contact/).** |