sergeyzh commited on
Commit
5b8b53a
·
verified ·
1 Parent(s): 8569849

Upload 10 files

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 256,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md CHANGED
@@ -1,3 +1,94 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ru
4
+
5
+ pipeline_tag: sentence-similarity
6
+
7
+ tags:
8
+ - russian
9
+ - pretraining
10
+ - embeddings
11
+ - tiny
12
+ - feature-extraction
13
+ - sentence-similarity
14
+ - sentence-transformers
15
+ - transformers
16
+ - mteb
17
+
18
+ datasets:
19
+ - IlyaGusev/gazeta
20
+ - zloelias/lenta-ru
21
+ - HuggingFaceFW/fineweb-2
22
+ - HuggingFaceFW/fineweb
23
+
24
  license: mit
25
+
26
+
27
  ---
28
+
29
+
30
+ Быстрая модель BERT для русского языка с размером ембеддинга 256 и длиной контекста 512. Модель получена методом последовательной дистилляции моделей [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) и [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3). Выигрывает по скорости у [rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) при аналогичном качестве на CPU в ~x1.4, на GPU в ~x1.2 раза.
31
+
32
+
33
+
34
+ ## Использование
35
+ ```Python
36
+ from sentence_transformers import SentenceTransformer
37
+
38
+ model = SentenceTransformer('sergeyzh/rubert-tiny-lite')
39
+
40
+ sentences = ["привет мир", "hello world", "здравствуй вселенная"]
41
+ embeddings = model.encode(sentences)
42
+
43
+ print(model.similarity(embeddings, embeddings))
44
+ ```
45
+
46
+ ## Метрики
47
+
48
+ Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
49
+
50
+ | model | STS | PI | NLI | SA | TI |
51
+ |:-----------------------------------|:---------|:---------|:---------|:---------|:---------|
52
+ | BAAI/bge-m3 | 0.864 | 0.749 | 0.510 | 0.819 | 0.973 |
53
+ | intfloat/multilingual-e5-large | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
54
+ | **sergeyzh/rubert-tiny-lite** | 0.839 | 0.712 | 0.488 | 0.788 | 0.949 |
55
+ | intfloat/multilingual-e5-base | 0.835 | 0.704 | 0.459 | 0.796 | 0.964 |
56
+ | [sergeyzh/rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) | 0.828 | 0.722 | 0.476 | 0.787 | 0.955 |
57
+ | intfloat/multilingual-e5-small | 0.822 | 0.714 | 0.457 | 0.758 | 0.957 |
58
+ | cointegrated/rubert-tiny2 | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
59
+
60
+ Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
61
+
62
+ |Model Name | Metric | rubert-tiny2 | [rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) | rubert-tiny-lite | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
63
+ |:----------------------------------|:--------------------|----------------:|------------------:|------------------:|----------------------:|---------------------:|----------------------:|
64
+ |CEDRClassification | Accuracy | 0.369 | 0.390 | 0.407 | 0.401 | 0.423 | **0.448** |
65
+ |GeoreviewClassification | Accuracy | 0.396 | 0.414 | 0.423 | 0.447 | 0.461 | **0.497** |
66
+ |GeoreviewClusteringP2P | V-measure | 0.442 | 0.597 | **0.611** | 0.586 | 0.545 | 0.605 |
67
+ |HeadlineClassification | Accuracy | 0.742 | 0.686 | 0.652 | 0.732 | 0.757 | **0.758** |
68
+ |InappropriatenessClassification | Accuracy | 0.586 | 0.591 | 0.588 | 0.592 | 0.588 | **0.616** |
69
+ |KinopoiskClassification | Accuracy | 0.491 | 0.505 | 0.507 | 0.500 | 0.509 | **0.566** |
70
+ |RiaNewsRetrieval | NDCG@10 | 0.140 | 0.513 | 0.617 | 0.700 | 0.702 | **0.807** |
71
+ |RuBQReranking | MAP@10 | 0.461 | 0.622 | 0.631 | 0.715 | 0.720 | **0.756** |
72
+ |RuBQRetrieval | NDCG@10 | 0.109 | 0.517 | 0.511 | 0.685 | 0.696 | **0.741** |
73
+ |RuReviewsClassification | Accuracy | 0.570 | 0.607 | 0.615 | 0.612 | 0.630 | **0.653** |
74
+ |RuSTSBenchmarkSTS | Pearson correlation | 0.694 | 0.787 | 0.799 | 0.781 | 0.796 | **0.831** |
75
+ |RuSciBenchGRNTIClassification | Accuracy | 0.456 | 0.529 | 0.544 | 0.550 | 0.563 | **0.582** |
76
+ |RuSciBenchGRNTIClusteringP2P | V-measure | 0.414 | 0.481 | 0.510 | 0.511 | 0.516 | **0.520** |
77
+ |RuSciBenchOECDClassification | Accuracy | 0.355 | 0.415 | 0.424 | 0.427 | 0.423 | **0.445** |
78
+ |RuSciBenchOECDClusteringP2P | V-measure | 0.381 | 0.411 | 0.438 | 0.443 | 0.448 | **0.450** |
79
+ |SensitiveTopicsClassification | Accuracy | 0.220 | 0.244 | **0.282** | 0.228 | 0.234 | 0.257 |
80
+ |TERRaClassification | Average Precision | 0.519 | 0.563 | 0.574 | 0.551 | 0.550 | **0.584** |
81
+
82
+ |Model Name | Metric | rubert-tiny2 | [rubert-tiny-turbo](https://huggingface.co/sergeyzh/rubert-tiny-turbo) | rubert-tiny-lite | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
83
+ |:----------------------------------|:--------------------|----------------:|------------------:|------------------:|----------------------:|----------------------:|---------------------:|
84
+ |Classification | Accuracy | 0.514 | 0.535 | 0.536 | 0.551 | 0.561 | **0.588** |
85
+ |Clustering | V-measure | 0.412 | 0.496 | 0.520 | 0.513 | 0.503 | **0.525** |
86
+ |MultiLabelClassification | Accuracy | 0.294 | 0.317 | 0.344 | 0.314 | 0.329 | **0.353** |
87
+ |PairClassification | Average Precision | 0.519 | 0.563 | 0.574 | 0.551 | 0.550 | **0.584** |
88
+ |Reranking | MAP@10 | 0.461 | 0.622 | 0.631 | 0.715 | 0.720 | **0.756** |
89
+ |Retrieval | NDCG@10 | 0.124 | 0.515 | 0.564 | 0.697 | 0.699 | **0.774** |
90
+ |STS | Pearson correlation | 0.694 | 0.787 | 0.799 | 0.781 | 0.796 | **0.831** |
91
+ |Average | Average | 0.431 | 0.548 | 0.567 | 0.588 | 0.594 | **0.630** |
92
+
93
+
94
+
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": ".sergeyzh/rubert-tiny-lite",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "emb_size": 256,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 256,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 384,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "bert",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 3,
20
+ "pad_token_id": 0,
21
+ "position_embedding_type": "absolute",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.46.3",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 83828
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1deef831e2634e8f074e2520eebc18d4308e72ca74adccca1ffc23beb2554cd3
3
+ size 92174712
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 2048,
51
+ "never_split": null,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "[PAD]",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "[SEP]",
57
+ "stride": 0,
58
+ "strip_accents": null,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "[UNK]"
64
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff