README.md · s-nlp/ruRoberta-large-paraphrase-v1 at 4f6879b3d9e1f3a34fdc7244424e8b34bda2f640

metadata

language:
  - ru
tags:
  - sentence-similarity
  - text-classification
datasets:
  - merionum/ru_paraphraser
  - RuPAWS

This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences.

It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer.

It is a sberbank-ai/ruRoberta-large model fine-tuned on a union of 3 datasets:

RuPAWS: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP;
ru_paraphraser: https://huggingface.co/merionum/ru_paraphraser;
Results of the manual check of content preservation for the RUSSE-2022 text detoxification dataset collection (content_5.tsv).

The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0).

The table shows the training dataset size after duplication (joining text1 + text2 and text2 + text1 pairs):

source \ label	0	1
detox	1412	3843
paraphraser	5539	1688
rupaws_qqp	1112	792
rupaws_wiki	3526	2166

The model was trained with Adam optimizer and the following hyperparameters:

learning_rate = 1e-5
batch_size = 8
gradient_accumulation_steps = 4
n_epochs = 3
max_grad_norm = 1.0

After training, the model had the following ROC AUC scores on the test sets:

set	ROC AUC
detox	0.857112
paraphraser	0.858465
rupaws_qqp	0.859195
rupaws_wiki	0.906121