Sentence Similarity
Safetensors
Japanese
distilbert
feature-extraction
hpprc commited on
Commit
4c40393
·
verified ·
1 Parent(s): 2fb8b46

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -101
README.md CHANGED
@@ -1,141 +1,126 @@
1
  ---
2
- base_model: hpprc/ruri-v2-pt-small-m
3
- library_name: sentence-transformers
4
- pipeline_tag: sentence-similarity
5
  tags:
6
- - sentence-transformers
7
  - sentence-similarity
8
  - feature-extraction
 
 
 
 
9
  ---
10
 
11
- # SentenceTransformer based on hpprc/ruri-v2-pt-small-m
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [hpprc/ruri-v2-pt-small-m](https://huggingface.co/hpprc/ruri-v2-pt-small-m). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
14
-
15
- ## Model Details
16
-
17
- ### Model Description
18
- - **Model Type:** Sentence Transformer
19
- - **Base model:** [hpprc/ruri-v2-pt-small-m](https://huggingface.co/hpprc/ruri-v2-pt-small-m) <!-- at revision c64c90af5641f0e60d2a384240f29f8cf2cd6167 -->
20
- - **Maximum Sequence Length:** 512 tokens
21
- - **Output Dimensionality:** 768 tokens
22
- - **Similarity Function:** Cosine Similarity
23
- <!-- - **Training Dataset:** Unknown -->
24
- <!-- - **Language:** Unknown -->
25
- <!-- - **License:** Unknown -->
26
-
27
- ### Model Sources
28
-
29
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
30
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
31
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
32
-
33
- ### Full Model Architecture
34
-
35
- ```
36
- MySentenceTransformer(
37
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
38
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
39
- )
40
- ```
41
 
42
  ## Usage
43
 
44
- ### Direct Usage (Sentence Transformers)
45
-
46
  First install the Sentence Transformers library:
47
 
48
  ```bash
49
- pip install -U sentence-transformers
50
  ```
51
 
52
  Then you can load this model and run inference.
53
  ```python
 
54
  from sentence_transformers import SentenceTransformer
55
 
56
  # Download from the 🤗 Hub
57
- model = SentenceTransformer("hpprc/ruri-v2-small15")
58
- # Run inference
 
59
  sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
 
63
  ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 768]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
- ```
73
-
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
 
79
- </details>
80
- -->
 
81
 
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
-
85
- You can finetune this model on your own dataset.
86
-
87
- <details><summary>Click to expand</summary>
88
 
89
- </details>
90
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- <!--
93
- ### Out-of-Scope Use
94
 
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
 
98
- <!--
99
- ## Bias, Risks and Limitations
100
 
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
 
 
 
 
 
 
 
103
 
104
- <!--
105
- ### Recommendations
106
 
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
 
110
- ## Training Details
 
 
 
 
 
111
 
112
  ### Framework Versions
113
  - Python: 3.10.13
114
- - Sentence Transformers: 3.1.1
115
- - Transformers: 4.45.1
116
- - PyTorch: 2.4.1+cu124
117
- - Accelerate: 0.34.2
118
  - Datasets: 2.19.1
119
- - Tokenizers: 0.20.0
120
 
121
  ## Citation
122
 
123
- ### BibTeX
124
-
125
- <!--
126
- ## Glossary
127
-
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
-
131
- <!--
132
- ## Model Card Authors
133
-
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
1
  ---
2
+ language:
3
+ - ja
 
4
  tags:
 
5
  - sentence-similarity
6
  - feature-extraction
7
+ base_model: cl-nagoya/ruri-pt-small-v2
8
+ widget: []
9
+ pipeline_tag: sentence-similarity
10
+ license: apache-2.0
11
  ---
12
 
13
+ # Ruri: Japanese General Text Embeddings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Usage
16
 
 
 
17
  First install the Sentence Transformers library:
18
 
19
  ```bash
20
+ pip install -U sentence-transformers fugashi sentencepiece unidic-lite
21
  ```
22
 
23
  Then you can load this model and run inference.
24
  ```python
25
+ import torch.nn.functional as F
26
  from sentence_transformers import SentenceTransformer
27
 
28
  # Download from the 🤗 Hub
29
+ model = SentenceTransformer("cl-nagoya/ruri-small-v2")
30
+
31
+ # Don't forget to add the prefix "クエリ: " for query-side or "文章: " for passage-side texts.
32
  sentences = [
33
+ "クエリ: 瑠璃色はどんな色?",
34
+ "文章: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
35
+ "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
36
+ "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
37
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ embeddings = model.encode(sentences, convert_to_tensor=True)
40
+ print(embeddings.size())
41
+ # [4, 768]
42
 
43
+ similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
44
+ print(similarities)
45
+ ```
 
 
 
46
 
47
+ ## Benchmarks
48
+
49
+ ### JMTEB
50
+ Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
51
+
52
+ |Model|#Param.|Avg.|Retrieval|STS|Classfification|Reranking|Clustering|PairClassification|
53
+ |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
54
+ |[cl-nagoya/sup-simcse-ja-base](https://huggingface.co/cl-nagoya/sup-simcse-ja-base)|111M|68.56|49.64|82.05|73.47|91.83|51.79|62.57|
55
+ |[cl-nagoya/sup-simcse-ja-large](https://huggingface.co/cl-nagoya/sup-simcse-ja-large)|337M|66.51|37.62|83.18|73.73|91.48|50.56|62.51|
56
+ |[cl-nagoya/unsup-simcse-ja-base](https://huggingface.co/cl-nagoya/unsup-simcse-ja-base)|111M|65.07|40.23|78.72|73.07|91.16|44.77|62.44|
57
+ |[cl-nagoya/unsup-simcse-ja-large](https://huggingface.co/cl-nagoya/unsup-simcse-ja-large)|337M|66.27|40.53|80.56|74.66|90.95|48.41|62.49|
58
+ |[pkshatech/GLuCoSE-base-ja](https://huggingface.co/pkshatech/GLuCoSE-base-ja)|133M|70.44|59.02|78.71|76.82|91.90|49.78|66.39|
59
+ ||||||||||
60
+ |[sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE)|472M|64.70|40.12|76.56|72.66|91.63|44.88|62.33|
61
+ |[intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)|118M|69.52|67.27|80.07|67.62|93.03|46.91|62.19|
62
+ |[intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)|278M|70.12|68.21|79.84|69.30|92.85|48.26|62.26|
63
+ |[intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)|560M|71.65|70.98|79.70|72.89|92.96|51.24|62.15|
64
+ ||||||||||
65
+ |OpenAI/text-embedding-ada-002|-|69.48|64.38|79.02|69.75|93.04|48.30|62.40|
66
+ |OpenAI/text-embedding-3-small|-|70.86|66.39|79.46|73.06|92.92|51.06|62.27|
67
+ |OpenAI/text-embedding-3-large|-|73.97|74.48|82.52|77.58|93.58|53.32|62.35|
68
+ ||||||||||
69
+ |[Ruri-Small](https://huggingface.co/cl-nagoya/ruri-small)|68M|71.53|69.41|82.79|76.22|93.00|51.19|62.11|
70
+ |[**Ruri-Small v2**](https://huggingface.co/cl-nagoya/ruri-small-v2) (this model)|68M|73.30|73.94|82.91|76.17|93.20|51.58|62.32|
71
+ |[Ruri-Base](https://huggingface.co/cl-nagoya/ruri-base)|111M|71.91|69.82|82.87|75.58|92.91|54.16|62.38|
72
+ |[Ruri-Base v2](https://huggingface.co/cl-nagoya/ruri-base-v2)|111M|72.48|72.33|83.03|75.34|93.17|51.38|62.35|
73
+ |[Ruri-Large](https://huggingface.co/cl-nagoya/ruri-large)|337M|73.31|73.02|83.13|77.43|92.99|51.82|62.29|
74
+ |[Ruri-Large v2](https://huggingface.co/cl-nagoya/ruri-large-v2)|337M|74.55|76.34|83.17|77.18|93.21|52.14|62.27|
75
 
 
 
76
 
 
 
77
 
78
+ ## Model Details
 
79
 
80
+ ### Model Description
81
+ - **Model Type:** Sentence Transformer
82
+ - **Base model:** [cl-nagoya/ruri-pt-small-v2](https://huggingface.co/cl-nagoya/ruri-pt-small-v2)
83
+ - **Maximum Sequence Length:** 512 tokens
84
+ - **Output Dimensionality:** 768
85
+ - **Similarity Function:** Cosine Similarity
86
+ - **Language:** Japanese
87
+ - **License:** Apache 2.0
88
+ - **Paper:** https://arxiv.org/abs/2409.07737
89
 
 
 
90
 
91
+ ### Full Model Architecture
 
92
 
93
+ ```
94
+ SentenceTransformer(
95
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
96
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
97
+ )
98
+ ```
99
 
100
  ### Framework Versions
101
  - Python: 3.10.13
102
+ - Sentence Transformers: 3.0.0
103
+ - Transformers: 4.41.2
104
+ - PyTorch: 2.3.1+cu118
105
+ - Accelerate: 0.30.1
106
  - Datasets: 2.19.1
107
+ - Tokenizers: 0.19.1
108
 
109
  ## Citation
110
 
111
+ ```bibtex
112
+ @misc{
113
+ Ruri,
114
+ title={{Ruri: Japanese General Text Embeddings}},
115
+ author={Hayato Tsukagoshi and Ryohei Sasano},
116
+ year={2024},
117
+ eprint={2409.07737},
118
+ archivePrefix={arXiv},
119
+ primaryClass={cs.CL},
120
+ url={https://arxiv.org/abs/2409.07737},
121
+ }
122
+ ```
 
123
 
 
 
124
 
125
+ ## License
126
+ This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).