Sentence Similarity
Safetensors
Japanese
modernbert
feature-extraction
hpprc commited on
Commit
b7044e8
·
verified ·
1 Parent(s): 93e4fdc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -100
README.md CHANGED
@@ -1,141 +1,166 @@
1
  ---
 
 
2
  tags:
3
- - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
  base_model: cl-nagoya/ruri-v3-pt-130m
 
7
  pipeline_tag: sentence-similarity
8
- library_name: sentence-transformers
 
 
9
  ---
10
 
11
- # SentenceTransformer based on cl-nagoya/ruri-v3-pt-130m
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [cl-nagoya/ruri-v3-pt-130m](https://huggingface.co/cl-nagoya/ruri-v3-pt-130m). It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
 
 
 
 
 
 
 
 
14
 
15
- ## Model Details
16
 
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- - **Base model:** [cl-nagoya/ruri-v3-pt-130m](https://huggingface.co/cl-nagoya/ruri-v3-pt-130m) <!-- at revision 086588f9006908c33be10a4898cd0d6d0867b009 -->
20
- - **Maximum Sequence Length:** 8192 tokens
21
- - **Output Dimensionality:** 512 dimensions
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
 
27
- ### Model Sources
28
 
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
 
 
 
32
 
33
- ### Full Model Architecture
34
-
35
- ```
36
- MySentenceTransformer(
37
- (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
38
- (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- )
40
- ```
41
 
42
  ## Usage
43
 
44
- ### Direct Usage (Sentence Transformers)
45
-
46
- First install the Sentence Transformers library:
47
 
48
  ```bash
49
- pip install -U sentence-transformers
 
 
 
 
 
 
50
  ```
51
 
52
  Then you can load this model and run inference.
53
  ```python
 
54
  from sentence_transformers import SentenceTransformer
55
 
56
  # Download from the 🤗 Hub
57
- model = SentenceTransformer("hpprc/ruri-v3-130m-default14-1")
58
- # Run inference
 
 
 
 
 
59
  sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
 
 
63
  ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 512]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
- ```
73
-
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
-
79
- </details>
80
- -->
81
 
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
-
85
- You can finetune this model on your own dataset.
86
-
87
- <details><summary>Click to expand</summary>
88
-
89
- </details>
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
 
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- <!--
99
- ## Bias, Risks and Limitations
100
 
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
 
104
- <!--
105
- ### Recommendations
 
 
 
 
 
 
 
106
 
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
 
110
- ## Training Details
111
 
112
- ### Framework Versions
113
- - Python: 3.10.13
114
- - Sentence Transformers: 3.4.1
115
- - Transformers: 4.48.3
116
- - PyTorch: 2.5.1+cu124
117
- - Accelerate: 1.3.0
118
- - Datasets: 3.3.0
119
- - Tokenizers: 0.21.0
120
 
121
  ## Citation
122
 
123
- ### BibTeX
124
-
125
- <!--
126
- ## Glossary
127
-
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
-
131
- <!--
132
- ## Model Card Authors
133
-
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
1
  ---
2
+ language:
3
+ - ja
4
  tags:
 
5
  - sentence-similarity
6
  - feature-extraction
7
  base_model: cl-nagoya/ruri-v3-pt-130m
8
+ widget: []
9
  pipeline_tag: sentence-similarity
10
+ license: apache-2.0
11
+ datasets:
12
+ - cl-nagoya/ruri-v3-dataset-ft
13
  ---
14
 
15
+ # Ruri: Japanese General Text Embeddings
16
 
17
+ **Ruri v3** is a general-purpose Japanese text embedding model built on top of [**ModernBERT-Ja**](https://huggingface.co/collections/sbintuitions/modernbert-ja-67b68fe891132877cf67aa0a).
18
+ Ruri v3 offers several key technical advantages:
19
+ - **State-of-the-art performance** for Japanese text embedding tasks.
20
+ - **Supports sequence lengths up to 8192 tokens**
21
+ - Previous versions of Ruri (v1, v2) were limited to 512.
22
+ - **Expanded vocabulary of 100K tokens**, compared to 32K in v1 and v2
23
+ - The larger vocabulary make input sequences shorter, improving efficiency.
24
+ - **Integrated FlashAttention**, following ModernBERT's architecture
25
+ - Enables faster inference and fine-tuning.
26
+ - **Tokenizer based solely on SentencePiece**
27
+ - Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.
28
 
 
29
 
30
+ ## Model Series
 
 
 
 
 
 
 
 
31
 
32
+ We provide Ruri-v3 in several model sizes. Below is a summary of each model.
33
 
34
+ |ID| #Param. | #Param.<br>w/o Emb.|Dim.|#Layers|Avg. JMTEB|
35
+ |-|-|-|-|-|-|
36
+ |[cl-nagoya/ruri-v3-30m](https://huggingface.co/cl-nagoya/ruri-v3-30m)|37M|10M|256|10|74.51|
37
+ |[cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)|70M|31M|384|13|75.48|
38
+ |[**cl-nagoya/ruri-v3-130m**](https://huggingface.co/cl-nagoya/ruri-v3-130m)|132M|80M|512|19|**76.55**|
39
+ |[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|315M|236M|768|25|77.24|
40
 
 
 
 
 
 
 
 
 
41
 
42
  ## Usage
43
 
44
+ You can use our models directly with the transformers library v4.48.0 or higher:
 
 
45
 
46
  ```bash
47
+ pip install -U "transformers>=4.48.0"
48
+ ```
49
+
50
+ Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
51
+
52
+ ```
53
+ pip install flash-attn --no-build-isolation
54
  ```
55
 
56
  Then you can load this model and run inference.
57
  ```python
58
+ import torch.nn.functional as F
59
  from sentence_transformers import SentenceTransformer
60
 
61
  # Download from the 🤗 Hub
62
+ model = SentenceTransformer("cl-nagoya/ruri-v3-130m")
63
+
64
+ # Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
65
+ # "" (empty string) is used for encoding semantic meaning.
66
+ # "トピック: " is used for classification, clustering, and encoding topical information.
67
+ # "検索クエリ: " is used for queries in retrieval tasks.
68
+ # "検索文書: " is used for documents to be retrieved.
69
  sentences = [
70
+ "川べりでサーフボードを持った人たちがいます",
71
+ "サーファーたちが川べりに立っています",
72
+ "トピック: 瑠璃色のサーファー",
73
+ "検索クエリ: 瑠璃色はどんな色?",
74
+ "検索文書: 瑠璃色(る��いろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
75
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
+ embeddings = model.encode(sentences, convert_to_tensor=True)
78
+ print(embeddings.size())
79
+ # [5, 512]
80
+
81
+ similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
82
+ print(similarities)
83
+ # [[1.0000, 0.9564, 0.8183, 0.7000, 0.7108],
84
+ # [0.9564, 1.0000, 0.8112, 0.6994, 0.7117],
85
+ # [0.8183, 0.8112, 1.0000, 0.8788, 0.8514],
86
+ # [0.7000, 0.6994, 0.8788, 1.0000, 0.9448],
87
+ # [0.7108, 0.7117, 0.8514, 0.9448, 1.0000]]
88
+ ```
89
 
90
+ ## Benchmarks
91
+
92
+ ### JMTEB
93
+ Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
94
+
95
+ |Model|#Param.|Avg.|Retrieval|STS|Classfification|Reranking|Clustering|PairClassification|
96
+ |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
97
+ ||||||||||
98
+ |[Ruri-v3-30m](https://huggingface.co/cl-nagoya/ruri-v3-30m)|37M|74.51|78.08|82.48|74.80|93.00|52.12|62.40|
99
+ |[Ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)|70M|75.48|79.96|79.82|76.97|93.27|52.70|61.75|
100
+ |[**Ruri-v3-130m**](https://huggingface.co/cl-nagoya/ruri-v3-130m)<br/>(this model)|**132M**|**76.55**|81.89|79.25|77.16|93.31|55.36|62.26|
101
+ |[Ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|315M|77.24|81.89|81.22|78.66|93.43|55.69|62.60|
102
+ ||||||||||
103
+ |[sbintuitions/sarashina-embedding-v1-1b](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|1.22B|75.50|77.61|82.71|78.37|93.74|53.86|62.00|
104
+ ||||||||||
105
+ |OpenAI/text-embedding-ada-002|-|69.48|64.38|79.02|69.75|93.04|48.30|62.40|
106
+ |OpenAI/text-embedding-3-small|-|70.86|66.39|79.46|73.06|92.92|51.06|62.27|
107
+ |OpenAI/text-embedding-3-large|-|73.97|74.48|82.52|77.58|93.58|53.32|62.35|
108
+ ||||||||||
109
+ |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|133M|70.44|59.02|78.71|76.82|91.90|49.78|66.39|
110
+ |[pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)|133M|72.23|73.36|82.96|74.21|93.01|48.65|62.37|
111
+ |[retrieva-jp/amber-base](https://huggingface.co/retrieva-jp/amber-base)|130M|72.12|73.40|77.81|76.14|93.27|48.05|64.03|
112
+ |[retrieva-jp/amber-large](https://huggingface.co/retrieva-jp/amber-large)|315M|73.22|75.40|79.32|77.14|93.54|48.73|60.97|
113
+ ||||||||||
114
+ |[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|472M|64.70|40.12|76.56|72.66|91.63|44.88|62.33|
115
+ |[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)|118M|69.52|67.27|80.07|67.62|93.03|46.91|62.19|
116
+ |[intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)|278M|70.12|68.21|79.84|69.30|92.85|48.26|62.26|
117
+ |[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)|560M|71.65|70.98|79.70|72.89|92.96|51.24|62.15|
118
+ ||||||||||
119
+ |[Ruri-Small](https://huggingface.co/cl-nagoya/ruri-small)|68M|71.53|69.41|82.79|76.22|93.00|51.19|62.11|
120
+ |[Ruri-Small v2](https://huggingface.co/cl-nagoya/ruri-small-v2)|68M|73.30|73.94|82.91|76.17|93.20|51.58|62.32|
121
+ |[Ruri-Base](https://huggingface.co/cl-nagoya/ruri-base)|111M|71.91|69.82|82.87|75.58|92.91|54.16|62.38|
122
+ |[Ruri-Base v2](https://huggingface.co/cl-nagoya/ruri-base-v2)|111M|72.48|72.33|83.03|75.34|93.17|51.38|62.35|
123
+ |[Ruri-Large](https://huggingface.co/cl-nagoya/ruri-large)|337M|73.31|73.02|83.13|77.43|92.99|51.82|62.29|
124
+ |[Ruri-Large v2](https://huggingface.co/cl-nagoya/ruri-large-v2)|337M|74.55|76.34|83.17|77.18|93.21|52.14|62.27|
125
 
 
 
126
 
127
+ ## Model Details
 
128
 
129
+ ### Model Description
130
+ - **Model Type:** Sentence Transformer
131
+ - **Base model:** [cl-nagoya/ruri-v3-pt-130m](https://huggingface.co/cl-nagoya/ruri-v3-pt-130m)
132
+ - **Maximum Sequence Length:** 8192 tokens
133
+ - **Output Dimensionality:** 512
134
+ - **Similarity Function:** Cosine Similarity
135
+ - **Language:** Japanese
136
+ - **License:** Apache 2.0
137
+ - **Paper:** https://arxiv.org/abs/2409.07737
138
 
 
 
139
 
140
+ ### Full Model Architecture
141
 
142
+ ```
143
+ SentenceTransformer(
144
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
145
+ (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
146
+ )
147
+ ```
 
 
148
 
149
  ## Citation
150
 
151
+ ```bibtex
152
+ @misc{
153
+ Ruri,
154
+ title={{Ruri: Japanese General Text Embeddings}},
155
+ author={Hayato Tsukagoshi and Ryohei Sasano},
156
+ year={2024},
157
+ eprint={2409.07737},
158
+ archivePrefix={arXiv},
159
+ primaryClass={cs.CL},
160
+ url={https://arxiv.org/abs/2409.07737},
161
+ }
162
+ ```
 
163
 
 
 
164
 
165
+ ## License
166
+ This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).