File size: 3,258 Bytes
e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 9b1f89a e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 4ae1d0d dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 e656703 dfeaaa4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
language:
- ja
tags:
- sentence-similarity
- feature-extraction
base_model: sbintuitions/modernbert-ja-70m
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- cl-nagoya/ruri-v3-dataset-pt
---
# Ruri: Japanese General Text Embeddings
**⚠️Notes:**
**This model is a pretrained version and has not been fine-tuned.**
For the fine-tuned version, please use [cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)!
## Fine-tuned Model Series
**Ruri v3** is a general-purpose Japanese text embedding model built on top of [**ModernBERT-Ja**](https://huggingface.co/collections/sbintuitions/modernbert-ja-67b68fe891132877cf67aa0a).
We provide Ruri-v3 in several model sizes. Below is a summary of each model.
|ID| #Param. | #Param.<br>w/o Emb.|Dim.|#Layers|Avg. JMTEB|
|-|-|-|-|-|-|
|[cl-nagoya/ruri-v3-30m](https://huggingface.co/cl-nagoya/ruri-v3-30m)|37M|10M|256|10|74.51|
|[cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)|70M|31M|384|13|75.48|
|[cl-nagoya/ruri-v3-130m](https://huggingface.co/cl-nagoya/ruri-v3-130m)|132M|80M|512|19|76.55|
|[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|315M|236M|768|25|77.24|
## Usage
You can use our models directly with the transformers library v4.48.0 or higher:
```bash
pip install -U "transformers>=4.48.0" sentence-transformers
```
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
```
pip install flash-attn --no-build-isolation
```
Then you can load this model and run inference.
```python
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-v3-pt-70m")
# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
"川べりでサーフボードを持った人たちがいます",
"サーファーたちが川べりに立っています",
"トピック: 瑠璃色のサーファー",
"検索クエリ: 瑠璃色はどんな色?",
"検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 384]
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
```
## Citation
```bibtex
@misc{
Ruri,
title={{Ruri: Japanese General Text Embeddings}},
author={Hayato Tsukagoshi and Ryohei Sasano},
year={2024},
eprint={2409.07737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07737},
}
```
## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |