Sentence Similarity
Safetensors
Japanese
modernbert
feature-extraction

Ruri: Japanese General Text Embeddings

⚠️Notes: This model is a pretrained version and has not been fine-tuned.
For the fine-tuned version, please use cl-nagoya/ruri-v3-70m!

Fine-tuned Model Series

Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. We provide Ruri-v3 in several model sizes. Below is a summary of each model.

ID #Param. #Param.
w/o Emb.
Dim. #Layers Avg. JMTEB
cl-nagoya/ruri-v3-30m 37M 10M 256 10 74.51
cl-nagoya/ruri-v3-70m 70M 31M 384 13 75.48
cl-nagoya/ruri-v3-130m 132M 80M 512 19 76.55
cl-nagoya/ruri-v3-310m 315M 236M 768 25 77.24

Usage

You can use our models directly with the transformers library v4.48.0 or higher:

pip install -U "transformers>=4.48.0" sentence-transformers

Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.

pip install flash-attn --no-build-isolation

Then you can load this model and run inference.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-v3-pt-70m")

# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
    "川べりでサーフボードを持った人たちがいます",
    "サーファーたちが川べりに立っています",
    "トピック: 瑠璃色のサーファー",
    "検索クエリ: 瑠璃色はどんな色?",
    "検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 384]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

License

This model is published under the Apache License, Version 2.0.

Downloads last month
87
Safetensors
Model size
70M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cl-nagoya/ruri-v3-pt-70m

Finetuned
(2)
this model
Finetunes
1 model

Collection including cl-nagoya/ruri-v3-pt-70m