Commit
·
4f6879b
1
Parent(s):
5de60ad
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ru
|
| 4 |
+
tags:
|
| 5 |
+
- sentence-similarity
|
| 6 |
+
- text-classification
|
| 7 |
+
datasets:
|
| 8 |
+
- merionum/ru_paraphraser
|
| 9 |
+
- RuPAWS
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences.
|
| 14 |
+
|
| 15 |
+
It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer.
|
| 16 |
+
|
| 17 |
+
It is a [sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large) model fine-tuned on a union of 3 datasets:
|
| 18 |
+
1. `RuPAWS`: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP;
|
| 19 |
+
2. `ru_paraphraser`: https://huggingface.co/merionum/ru_paraphraser;
|
| 20 |
+
3. Results of the manual check of content preservation for the [RUSSE-2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf) text detoxification dataset collection (`content_5.tsv`).
|
| 21 |
+
|
| 22 |
+
The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0).
|
| 23 |
+
|
| 24 |
+
The table shows the training dataset size after duplication (joining `text1 + text2` and `text2 + text1` pairs):
|
| 25 |
+
|
| 26 |
+
source \ label | 0 | 1
|
| 27 |
+
-- | -- | --
|
| 28 |
+
detox | 1412| 3843
|
| 29 |
+
paraphraser |5539 | 1688
|
| 30 |
+
rupaws_qqp |1112 | 792
|
| 31 |
+
rupaws_wiki |3526 | 2166
|
| 32 |
+
|
| 33 |
+
The model was trained with Adam optimizer and the following hyperparameters:
|
| 34 |
+
|
| 35 |
+
```
|
| 36 |
+
learning_rate = 1e-5
|
| 37 |
+
batch_size = 8
|
| 38 |
+
gradient_accumulation_steps = 4
|
| 39 |
+
n_epochs = 3
|
| 40 |
+
max_grad_norm = 1.0
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
After training, the model had the following ROC AUC scores on the test sets:
|
| 44 |
+
set | ROC AUC
|
| 45 |
+
- | -
|
| 46 |
+
detox | 0.857112
|
| 47 |
+
paraphraser | 0.858465
|
| 48 |
+
rupaws_qqp | 0.859195
|
| 49 |
+
rupaws_wiki | 0.906121
|