---
language:
- ja
tags:
- sentence-similarity
- feature-extraction
base_model: sbintuitions/modernbert-ja-310m
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- cl-nagoya/ruri-v3-dataset-pt
---

# Ruri: Japanese General Text Embeddings

**⚠️Notes:**
**This model is a pretrained version and has not been fine-tuned.**  
For the fine-tuned version, please use [cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)!

## Fine-tuned Model Series

**Ruri v3** is a general-purpose Japanese text embedding model built on top of [**ModernBERT-Ja**](https://huggingface.co/collections/sbintuitions/modernbert-ja-67b68fe891132877cf67aa0a).
We provide Ruri-v3 in several model sizes. Below is a summary of each model.

|ID| #Param. | #Param.<br>w/o Emb.|Dim.|#Layers|Avg. JMTEB|
|-|-|-|-|-|-|
|[cl-nagoya/ruri-v3-30m](https://huggingface.co/cl-nagoya/ruri-v3-30m)|37M|10M|256|10|74.51|
|[cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)|70M|31M|384|13|75.48|
|[cl-nagoya/ruri-v3-130m](https://huggingface.co/cl-nagoya/ruri-v3-130m)|132M|80M|512|19|76.55|
|[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|315M|236M|768|25|77.24|


## Usage

You can use our models directly with the transformers library v4.48.0 or higher:

```bash
pip install -U "transformers>=4.48.0" sentence-transformers
```

Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.

```
pip install flash-attn --no-build-isolation
```

Then you can load this model and run inference.
```python
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-v3-pt-310m")

# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
    "川べりでサーフボードを持った人たちがいます",
    "サーファーたちが川べりに立っています",
    "トピック: 瑠璃色のサーファー",
    "検索クエリ: 瑠璃色はどんな色？",
    "検索文書: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
```

## Citation

```bibtex
@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}
```


## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).