metadata
language:
- ja
tags:
- sentence-similarity
- feature-extraction
base_model: sbintuitions/modernbert-ja-70m
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- cl-nagoya/ruri-v3-dataset-pt
Ruri: Japanese General Text Embeddings
⚠️Notes:
This model is a pretrained version and has not been fine-tuned.
For the fine-tuned version, please use cl-nagoya/ruri-v3-70m!
Fine-tuned Model Series
Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. We provide Ruri-v3 in several model sizes. Below is a summary of each model.
ID | #Param. | #Param. w/o Emb. |
Dim. | #Layers | Avg. JMTEB |
---|---|---|---|---|---|
cl-nagoya/ruri-v3-30m | 37M | 10M | 256 | 10 | 74.51 |
cl-nagoya/ruri-v3-70m | 70M | 31M | 384 | 13 | 75.48 |
cl-nagoya/ruri-v3-130m | 132M | 80M | 512 | 19 | 76.55 |
cl-nagoya/ruri-v3-310m | 315M | 236M | 768 | 25 | 77.24 |
Usage
You can use our models directly with the transformers library v4.48.0 or higher:
pip install -U "transformers>=4.48.0" sentence-transformers
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
pip install flash-attn --no-build-isolation
Then you can load this model and run inference.
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-v3-pt-70m")
# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
"川べりでサーフボードを持った人たちがいます",
"サーファーたちが川べりに立っています",
"トピック: 瑠璃色のサーファー",
"検索クエリ: 瑠璃色はどんな色?",
"検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 384]
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
Citation
@misc{
Ruri,
title={{Ruri: Japanese General Text Embeddings}},
author={Hayato Tsukagoshi and Ryohei Sasano},
year={2024},
eprint={2409.07737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07737},
}
License
This model is published under the Apache License, Version 2.0.