cointegrated
commited on
Commit
•
4f6879b
1
Parent(s):
5de60ad
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ru
|
4 |
+
tags:
|
5 |
+
- sentence-similarity
|
6 |
+
- text-classification
|
7 |
+
datasets:
|
8 |
+
- merionum/ru_paraphraser
|
9 |
+
- RuPAWS
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences.
|
14 |
+
|
15 |
+
It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer.
|
16 |
+
|
17 |
+
It is a [sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large) model fine-tuned on a union of 3 datasets:
|
18 |
+
1. `RuPAWS`: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP;
|
19 |
+
2. `ru_paraphraser`: https://huggingface.co/merionum/ru_paraphraser;
|
20 |
+
3. Results of the manual check of content preservation for the [RUSSE-2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf) text detoxification dataset collection (`content_5.tsv`).
|
21 |
+
|
22 |
+
The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0).
|
23 |
+
|
24 |
+
The table shows the training dataset size after duplication (joining `text1 + text2` and `text2 + text1` pairs):
|
25 |
+
|
26 |
+
source \ label | 0 | 1
|
27 |
+
-- | -- | --
|
28 |
+
detox | 1412| 3843
|
29 |
+
paraphraser |5539 | 1688
|
30 |
+
rupaws_qqp |1112 | 792
|
31 |
+
rupaws_wiki |3526 | 2166
|
32 |
+
|
33 |
+
The model was trained with Adam optimizer and the following hyperparameters:
|
34 |
+
|
35 |
+
```
|
36 |
+
learning_rate = 1e-5
|
37 |
+
batch_size = 8
|
38 |
+
gradient_accumulation_steps = 4
|
39 |
+
n_epochs = 3
|
40 |
+
max_grad_norm = 1.0
|
41 |
+
```
|
42 |
+
|
43 |
+
After training, the model had the following ROC AUC scores on the test sets:
|
44 |
+
set | ROC AUC
|
45 |
+
- | -
|
46 |
+
detox | 0.857112
|
47 |
+
paraphraser | 0.858465
|
48 |
+
rupaws_qqp | 0.859195
|
49 |
+
rupaws_wiki | 0.906121
|