finiteautomata commited on
Commit
87211ae
·
1 Parent(s): 22986cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,11 +1,32 @@
1
  # robertuito-base-uncased
 
2
 
3
- **WORK IN PROGRESS**
 
4
 
5
- RoBERTa model trained on tweets.
 
6
 
7
- For the time being, please use [this function](https://github.com/finiteautomata/finetune_vs_scratch/blob/main/finetune_vs_scratch/preprocessing.py#L13) before feeding it to the model. We still need to create a proper tokenizer for this model
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
 
11
  ## Masked LM
@@ -17,3 +38,73 @@ Este es un día<mask>
17
  ```
18
 
19
  don't put a space between `día` and `<mask>`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # robertuito-base-uncased
2
+ # robertuito-base-deacc
3
 
4
+ # RoBERTuito
5
+ ## A pre-trained language model for social media text in Spanish
6
 
7
+ [**READ THE FULL PAPER**](https://arxiv.org/abs/2111.09453)
8
+ [Github Repository](https://github.com/pysentimiento/robertuito)
9
 
10
+ *RoBERTuito* is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. *RoBERTuito* comes in 3 flavors: cased, uncased, and uncased+deaccented.
11
 
12
+ We tested *RoBERTuito* on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as *BETO*, *BERTin* and *RoBERTa-BNE*. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).
13
+
14
+ | model | hate speech | sentiment analysis | emotion analysis | irony detection | score |
15
+ |:-------------------|:----------------|:---------------------|:-------------------|:-----------------|---------:|
16
+ | robertuito-uncased | 0.801 ± 0.010 | 0.707 ± 0.004 | 0.551 ± 0.011 | 0.736 ± 0.008 | 0.6987 |
17
+ | robertuito-deacc | 0.798 ± 0.008 | 0.702 ± 0.004 | 0.543 ± 0.015 | 0.740 ± 0.006 | 0.6958 |
18
+ | robertuito-cased | 0.790 ± 0.012 | 0.701 ± 0.012 | 0.519 ± 0.032 | 0.719 ± 0.023 | 0.6822 |
19
+ | roberta-bne | 0.766 ± 0.015 | 0.669 ± 0.006 | 0.533 ± 0.011 | 0.723 ± 0.017 | 0.6726 |
20
+ | bertin | 0.767 ± 0.005 | 0.665 ± 0.003 | 0.518 ± 0.012 | 0.716 ± 0.008 | 0.6666 |
21
+ | beto-cased | 0.768 ± 0.012 | 0.665 ± 0.004 | 0.521 ± 0.012 | 0.706 ± 0.007 | 0.6651 |
22
+ | beto-uncased | 0.757 ± 0.012 | 0.649 ± 0.005 | 0.521 ± 0.006 | 0.702 ± 0.008 | 0.6571 |
23
+
24
+
25
+ We release the pre-trained models on huggingface model hub:
26
+
27
+ - [RoBERTuito uncased](https://huggingface.co/pysentimiento/robertuito-base-uncased)
28
+ - [RoBERTuito cased](https://huggingface.co/pysentimiento/robertuito-base-cased)
29
+ - [RoBERTuito deacc](https://huggingface.co/pysentimiento/robertuito-base-deacc)
30
 
31
 
32
  ## Masked LM
 
38
  ```
39
 
40
  don't put a space between `día` and `<mask>`
41
+
42
+
43
+ ## Usage
44
+
45
+ **IMPORTANT -- READ THIS FIRST**
46
+
47
+ *RoBERTuito* is not yet fully-integrated into `huggingface/transformers`. To use it, first install `pysentimiento`
48
+
49
+ ```bash
50
+ pip install pysentimiento
51
+ ```
52
+
53
+ and preprocess text using `pysentimiento.preprocessing.preprocess_tweet` before feeding it into the tokenizer
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer
57
+ from pysentimiento.preprocessing import preprocess_tweet
58
+
59
+ tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
60
+
61
+ text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
62
+ preprocessed_text = preprocess_tweet(text, ha)
63
+
64
+ tokenizer.tokenize(preprocessed_text)
65
+ # ['<s>','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','</s>']
66
+ ```
67
+
68
+ We are working on integrating this preprocessing step into a Tokenizer within `transformers` library
69
+
70
+ ## Development
71
+
72
+ ### Installing
73
+
74
+ We use `python==3.7` and `poetry` to manage dependencies.
75
+
76
+ ```bash
77
+ pip install poetry
78
+ poetry install
79
+ ```
80
+
81
+ ### Benchmarking
82
+
83
+ To run benchmarks
84
+
85
+ ```bash
86
+ python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>
87
+ ```
88
+
89
+ Check [RUN_BENCHMARKS](RUN_BENCHMARKS.md) for all experiments
90
+
91
+ ### Smoke test
92
+ Test the benchmark running
93
+
94
+ ```
95
+ ./smoke_test.sh
96
+ ```
97
+ ## Citation
98
+
99
+ If you use *RoBERTuito*, please cite our paper:
100
+
101
+ ```bibtex
102
+ @misc{perez2021robertuito,
103
+ title={RoBERTuito: a pre-trained language model for social media text in Spanish},
104
+ author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
105
+ year={2021},
106
+ eprint={2111.09453},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CL}
109
+ }
110
+ ```