Sentence Similarity
Safetensors
Japanese
modernbert
feature-extraction
hpprc commited on
Commit
dfeaaa4
·
verified ·
1 Parent(s): e656703

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -107
README.md CHANGED
@@ -1,141 +1,94 @@
1
  ---
 
 
2
  tags:
3
- - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
  base_model: sbintuitions/modernbert-ja-70m
 
7
  pipeline_tag: sentence-similarity
8
- library_name: sentence-transformers
 
 
9
  ---
10
 
11
- # SentenceTransformer based on sbintuitions/modernbert-ja-70m
12
 
13
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sbintuitions/modernbert-ja-70m](https://huggingface.co/sbintuitions/modernbert-ja-70m). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
 
14
 
15
- ## Model Details
16
 
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- - **Base model:** [sbintuitions/modernbert-ja-70m](https://huggingface.co/sbintuitions/modernbert-ja-70m) <!-- at revision 4c79ac9aad6f8399f7493c35dde895e7a5d79bf8 -->
20
- - **Maximum Sequence Length:** 8192 tokens
21
- - **Output Dimensionality:** 384 dimensions
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
 
27
- ### Model Sources
 
 
 
 
 
28
 
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
- ### Full Model Architecture
34
-
35
- ```
36
- MySentenceTransformer(
37
- (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
38
- (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- )
40
- ```
41
 
42
  ## Usage
43
 
44
- ### Direct Usage (Sentence Transformers)
45
-
46
- First install the Sentence Transformers library:
47
 
48
  ```bash
49
- pip install -U sentence-transformers
 
 
 
 
 
 
50
  ```
51
 
52
  Then you can load this model and run inference.
53
  ```python
 
54
  from sentence_transformers import SentenceTransformer
55
 
56
  # Download from the 🤗 Hub
57
- model = SentenceTransformer("hpprc/ruri-v3-70m-pt-2")
58
- # Run inference
 
 
 
 
 
59
  sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
 
 
63
  ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 384]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
- ```
73
-
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
-
79
- </details>
80
- -->
81
-
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
 
85
- You can finetune this model on your own dataset.
 
 
86
 
87
- <details><summary>Click to expand</summary>
88
-
89
- </details>
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
-
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
-
98
- <!--
99
- ## Bias, Risks and Limitations
100
-
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
-
104
- <!--
105
- ### Recommendations
106
-
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
-
110
- ## Training Details
111
-
112
- ### Framework Versions
113
- - Python: 3.10.13
114
- - Sentence Transformers: 3.4.1
115
- - Transformers: 4.48.3
116
- - PyTorch: 2.5.1+cu124
117
- - Accelerate: 1.3.0
118
- - Datasets: 3.3.0
119
- - Tokenizers: 0.21.0
120
 
121
  ## Citation
122
 
123
- ### BibTeX
124
-
125
- <!--
126
- ## Glossary
127
-
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
-
131
- <!--
132
- ## Model Card Authors
133
-
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
1
  ---
2
+ language:
3
+ - ja
4
  tags:
 
5
  - sentence-similarity
6
  - feature-extraction
7
  base_model: sbintuitions/modernbert-ja-70m
8
+ widget: []
9
  pipeline_tag: sentence-similarity
10
+ license: apache-2.0
11
+ datasets:
12
+ - cl-nagoya/ruri-v3-dataset-pt
13
  ---
14
 
15
+ # Ruri: Japanese General Text Embeddings
16
 
17
+ **⚠️Notes:**
18
+ **This model is a pretrained version and has not been fine-tuned.**
19
+ For the fine-tuned version, please use [cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-310m)!
20
 
21
+ ## Fine-tuned Model Series
22
 
23
+ **Ruri v3** is a general-purpose Japanese text embedding model built on top of [**ModernBERT-Ja**](https://huggingface.co/collections/sbintuitions/modernbert-ja-67b68fe891132877cf67aa0a).
24
+ We provide Ruri-v3 in several model sizes. Below is a summary of each model.
 
 
 
 
 
 
 
25
 
26
+ |ID| #Param. | #Param.<br>w/o Emb.|Dim.|#Layers|Avg. JMTEB|
27
+ |-|-|-|-|-|-|
28
+ |[cl-nagoya/ruri-v3-30m](https://huggingface.co/cl-nagoya/ruri-v3-30m)|37M|10M|256|10|74.51|
29
+ |[cl-nagoya/ruri-v3-70m](https://huggingface.co/cl-nagoya/ruri-v3-70m)|70M|31M|384|13|75.48|
30
+ |[cl-nagoya/ruri-v3-130m](https://huggingface.co/cl-nagoya/ruri-v3-130m)|132M|80M|512|19|76.55|
31
+ |[cl-nagoya/ruri-v3-310m](https://huggingface.co/cl-nagoya/ruri-v3-310m)|315M|236M|768|25|77.24|
32
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Usage
35
 
36
+ You can use our models directly with the transformers library v4.48.0 or higher:
 
 
37
 
38
  ```bash
39
+ pip install -U "transformers>=4.48.0"
40
+ ```
41
+
42
+ Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
43
+
44
+ ```
45
+ pip install flash-attn --no-build-isolation
46
  ```
47
 
48
  Then you can load this model and run inference.
49
  ```python
50
+ import torch.nn.functional as F
51
  from sentence_transformers import SentenceTransformer
52
 
53
  # Download from the 🤗 Hub
54
+ model = SentenceTransformer("cl-nagoya/ruri-v3-pt-70m")
55
+
56
+ # Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
57
+ # "" (empty string) is used for encoding semantic meaning.
58
+ # "トピック: " is used for classification, clustering, and encoding topical information.
59
+ # "検索クエリ: " is used for queries in retrieval tasks.
60
+ # "検索文書: " is used for documents to be retrieved.
61
  sentences = [
62
+ "川べりでサーフボードを持った人たちがいます",
63
+ "サーファーたちが川べりに立っています",
64
+ "トピック: 瑠璃色のサーファー",
65
+ "検索クエリ: 瑠璃色はどんな色?",
66
+ "検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
67
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ embeddings = model.encode(sentences, convert_to_tensor=True)
70
+ print(embeddings.size())
71
+ # [5, 384]
72
 
73
+ similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
74
+ print(similarities)
75
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Citation
78
 
79
+ ```bibtex
80
+ @misc{
81
+ Ruri,
82
+ title={{Ruri: Japanese General Text Embeddings}},
83
+ author={Hayato Tsukagoshi and Ryohei Sasano},
84
+ year={2024},
85
+ eprint={2409.07737},
86
+ archivePrefix={arXiv},
87
+ primaryClass={cs.CL},
88
+ url={https://arxiv.org/abs/2409.07737},
89
+ }
90
+ ```
 
91
 
 
 
92
 
93
+ ## License
94
+ This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).