cointegrated commited on
Commit
4f6879b
1 Parent(s): 5de60ad

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ tags:
5
+ - sentence-similarity
6
+ - text-classification
7
+ datasets:
8
+ - merionum/ru_paraphraser
9
+ - RuPAWS
10
+ ---
11
+
12
+
13
+ This is a cross-encoder model trained to predict semantic equivalence of two Russian sentences.
14
+
15
+ It classifies text pairs as paraphrases (class 1) or non-paraphrases (class 0). Its scores can be used as a metric of content preservation for paraphrasing or text style transfer.
16
+
17
+ It is a [sberbank-ai/ruRoberta-large](https://huggingface.co/sberbank-ai/ruRoberta-large) model fine-tuned on a union of 3 datasets:
18
+ 1. `RuPAWS`: https://github.com/ivkrotova/rupaws_dataset based on Quora and QQP;
19
+ 2. `ru_paraphraser`: https://huggingface.co/merionum/ru_paraphraser;
20
+ 3. Results of the manual check of content preservation for the [RUSSE-2022](https://www.dialog-21.ru/media/5755/dementievadplusetal105.pdf) text detoxification dataset collection (`content_5.tsv`).
21
+
22
+ The task was formulated as binary classification: whether the two sentences have the same meaning (1) or different (0).
23
+
24
+ The table shows the training dataset size after duplication (joining `text1 + text2` and `text2 + text1` pairs):
25
+
26
+ source \ label | 0 | 1
27
+ -- | -- | --
28
+ detox | 1412| 3843
29
+ paraphraser |5539 | 1688
30
+ rupaws_qqp |1112 | 792
31
+ rupaws_wiki |3526 | 2166
32
+
33
+ The model was trained with Adam optimizer and the following hyperparameters:
34
+
35
+ ```
36
+ learning_rate = 1e-5
37
+ batch_size = 8
38
+ gradient_accumulation_steps = 4
39
+ n_epochs = 3
40
+ max_grad_norm = 1.0
41
+ ```
42
+
43
+ After training, the model had the following ROC AUC scores on the test sets:
44
+ set | ROC AUC
45
+ - | -
46
+ detox | 0.857112
47
+ paraphraser | 0.858465
48
+ rupaws_qqp | 0.859195
49
+ rupaws_wiki | 0.906121