rubert-tiny-lite / README.md
sergeyzh's picture
Update README.md
9c3bb58 verified
metadata
language:
  - ru
pipeline_tag: sentence-similarity
tags:
  - russian
  - pretraining
  - embeddings
  - tiny
  - feature-extraction
  - sentence-similarity
  - sentence-transformers
  - transformers
  - mteb
datasets:
  - IlyaGusev/gazeta
  - zloelias/lenta-ru
  - HuggingFaceFW/fineweb-2
license: mit

Быстрая модель BERT для русского языка с размером ембеддинга 256 и длиной контекста 512. Модель получена методом последовательной дистилляции моделей sergeyzh/rubert-tiny-turbo и BAAI/bge-m3. Выигрывает по скорости у rubert-tiny-turbo при аналогичном качестве на CPU в ~x1.4, на GPU в ~x1.2 раза.

Использование

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sergeyzh/rubert-tiny-lite')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)

print(model.similarity(embeddings, embeddings))

Метрики

Оценки модели на бенчмарке encodechka:

model STS PI NLI SA TI
BAAI/bge-m3 0.864 0.749 0.510 0.819 0.973
intfloat/multilingual-e5-large 0.862 0.727 0.473 0.810 0.979
sergeyzh/rubert-tiny-lite 0.839 0.712 0.488 0.788 0.949
intfloat/multilingual-e5-base 0.835 0.704 0.459 0.796 0.964
sergeyzh/rubert-tiny-turbo 0.828 0.722 0.476 0.787 0.955
intfloat/multilingual-e5-small 0.822 0.714 0.457 0.758 0.957
cointegrated/rubert-tiny2 0.750 0.651 0.417 0.737 0.937

Оценки модели на бенчмарке ruMTEB:

Model Name Metric rubert-tiny2 rubert-tiny-turbo rubert-tiny-lite multilingual-e5-small multilingual-e5-base multilingual-e5-large
CEDRClassification Accuracy 0.369 0.390 0.407 0.401 0.423 0.448
GeoreviewClassification Accuracy 0.396 0.414 0.423 0.447 0.461 0.497
GeoreviewClusteringP2P V-measure 0.442 0.597 0.611 0.586 0.545 0.605
HeadlineClassification Accuracy 0.742 0.686 0.652 0.732 0.757 0.758
InappropriatenessClassification Accuracy 0.586 0.591 0.588 0.592 0.588 0.616
KinopoiskClassification Accuracy 0.491 0.505 0.507 0.500 0.509 0.566
RiaNewsRetrieval NDCG@10 0.140 0.513 0.617 0.700 0.702 0.807
RuBQReranking MAP@10 0.461 0.622 0.631 0.715 0.720 0.756
RuBQRetrieval NDCG@10 0.109 0.517 0.511 0.685 0.696 0.741
RuReviewsClassification Accuracy 0.570 0.607 0.615 0.612 0.630 0.653
RuSTSBenchmarkSTS Pearson correlation 0.694 0.787 0.799 0.781 0.796 0.831
RuSciBenchGRNTIClassification Accuracy 0.456 0.529 0.544 0.550 0.563 0.582
RuSciBenchGRNTIClusteringP2P V-measure 0.414 0.481 0.510 0.511 0.516 0.520
RuSciBenchOECDClassification Accuracy 0.355 0.415 0.424 0.427 0.423 0.445
RuSciBenchOECDClusteringP2P V-measure 0.381 0.411 0.438 0.443 0.448 0.450
SensitiveTopicsClassification Accuracy 0.220 0.244 0.282 0.228 0.234 0.257
TERRaClassification Average Precision 0.519 0.563 0.574 0.551 0.550 0.584
Model Name Metric rubert-tiny2 rubert-tiny-turbo rubert-tiny-lite multilingual-e5-small multilingual-e5-base multilingual-e5-large
Classification Accuracy 0.514 0.535 0.536 0.551 0.561 0.588
Clustering V-measure 0.412 0.496 0.520 0.513 0.503 0.525
MultiLabelClassification Accuracy 0.294 0.317 0.344 0.314 0.329 0.353
PairClassification Average Precision 0.519 0.563 0.574 0.551 0.550 0.584
Reranking MAP@10 0.461 0.622 0.631 0.715 0.720 0.756
Retrieval NDCG@10 0.124 0.515 0.564 0.697 0.699 0.774
STS Pearson correlation 0.694 0.787 0.799 0.781 0.796 0.831
Average Average 0.431 0.548 0.567 0.588 0.594 0.630