Commit
·
87211ae
1
Parent(s):
22986cf
Update README.md
Browse files
README.md
CHANGED
@@ -1,11 +1,32 @@
|
|
1 |
# robertuito-base-uncased
|
|
|
2 |
|
3 |
-
|
|
|
4 |
|
5 |
-
|
|
|
6 |
|
7 |
-
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
|
11 |
## Masked LM
|
@@ -17,3 +38,73 @@ Este es un día<mask>
|
|
17 |
```
|
18 |
|
19 |
don't put a space between `día` and `<mask>`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# robertuito-base-uncased
|
2 |
+
# robertuito-base-deacc
|
3 |
|
4 |
+
# RoBERTuito
|
5 |
+
## A pre-trained language model for social media text in Spanish
|
6 |
|
7 |
+
[**READ THE FULL PAPER**](https://arxiv.org/abs/2111.09453)
|
8 |
+
[Github Repository](https://github.com/pysentimiento/robertuito)
|
9 |
|
10 |
+
*RoBERTuito* is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. *RoBERTuito* comes in 3 flavors: cased, uncased, and uncased+deaccented.
|
11 |
|
12 |
+
We tested *RoBERTuito* on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as *BETO*, *BERTin* and *RoBERTa-BNE*. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).
|
13 |
+
|
14 |
+
| model | hate speech | sentiment analysis | emotion analysis | irony detection | score |
|
15 |
+
|:-------------------|:----------------|:---------------------|:-------------------|:-----------------|---------:|
|
16 |
+
| robertuito-uncased | 0.801 ± 0.010 | 0.707 ± 0.004 | 0.551 ± 0.011 | 0.736 ± 0.008 | 0.6987 |
|
17 |
+
| robertuito-deacc | 0.798 ± 0.008 | 0.702 ± 0.004 | 0.543 ± 0.015 | 0.740 ± 0.006 | 0.6958 |
|
18 |
+
| robertuito-cased | 0.790 ± 0.012 | 0.701 ± 0.012 | 0.519 ± 0.032 | 0.719 ± 0.023 | 0.6822 |
|
19 |
+
| roberta-bne | 0.766 ± 0.015 | 0.669 ± 0.006 | 0.533 ± 0.011 | 0.723 ± 0.017 | 0.6726 |
|
20 |
+
| bertin | 0.767 ± 0.005 | 0.665 ± 0.003 | 0.518 ± 0.012 | 0.716 ± 0.008 | 0.6666 |
|
21 |
+
| beto-cased | 0.768 ± 0.012 | 0.665 ± 0.004 | 0.521 ± 0.012 | 0.706 ± 0.007 | 0.6651 |
|
22 |
+
| beto-uncased | 0.757 ± 0.012 | 0.649 ± 0.005 | 0.521 ± 0.006 | 0.702 ± 0.008 | 0.6571 |
|
23 |
+
|
24 |
+
|
25 |
+
We release the pre-trained models on huggingface model hub:
|
26 |
+
|
27 |
+
- [RoBERTuito uncased](https://huggingface.co/pysentimiento/robertuito-base-uncased)
|
28 |
+
- [RoBERTuito cased](https://huggingface.co/pysentimiento/robertuito-base-cased)
|
29 |
+
- [RoBERTuito deacc](https://huggingface.co/pysentimiento/robertuito-base-deacc)
|
30 |
|
31 |
|
32 |
## Masked LM
|
|
|
38 |
```
|
39 |
|
40 |
don't put a space between `día` and `<mask>`
|
41 |
+
|
42 |
+
|
43 |
+
## Usage
|
44 |
+
|
45 |
+
**IMPORTANT -- READ THIS FIRST**
|
46 |
+
|
47 |
+
*RoBERTuito* is not yet fully-integrated into `huggingface/transformers`. To use it, first install `pysentimiento`
|
48 |
+
|
49 |
+
```bash
|
50 |
+
pip install pysentimiento
|
51 |
+
```
|
52 |
+
|
53 |
+
and preprocess text using `pysentimiento.preprocessing.preprocess_tweet` before feeding it into the tokenizer
|
54 |
+
|
55 |
+
```python
|
56 |
+
from transformers import AutoTokenizer
|
57 |
+
from pysentimiento.preprocessing import preprocess_tweet
|
58 |
+
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
|
60 |
+
|
61 |
+
text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
|
62 |
+
preprocessed_text = preprocess_tweet(text, ha)
|
63 |
+
|
64 |
+
tokenizer.tokenize(preprocessed_text)
|
65 |
+
# ['<s>','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','</s>']
|
66 |
+
```
|
67 |
+
|
68 |
+
We are working on integrating this preprocessing step into a Tokenizer within `transformers` library
|
69 |
+
|
70 |
+
## Development
|
71 |
+
|
72 |
+
### Installing
|
73 |
+
|
74 |
+
We use `python==3.7` and `poetry` to manage dependencies.
|
75 |
+
|
76 |
+
```bash
|
77 |
+
pip install poetry
|
78 |
+
poetry install
|
79 |
+
```
|
80 |
+
|
81 |
+
### Benchmarking
|
82 |
+
|
83 |
+
To run benchmarks
|
84 |
+
|
85 |
+
```bash
|
86 |
+
python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>
|
87 |
+
```
|
88 |
+
|
89 |
+
Check [RUN_BENCHMARKS](RUN_BENCHMARKS.md) for all experiments
|
90 |
+
|
91 |
+
### Smoke test
|
92 |
+
Test the benchmark running
|
93 |
+
|
94 |
+
```
|
95 |
+
./smoke_test.sh
|
96 |
+
```
|
97 |
+
## Citation
|
98 |
+
|
99 |
+
If you use *RoBERTuito*, please cite our paper:
|
100 |
+
|
101 |
+
```bibtex
|
102 |
+
@misc{perez2021robertuito,
|
103 |
+
title={RoBERTuito: a pre-trained language model for social media text in Spanish},
|
104 |
+
author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
|
105 |
+
year={2021},
|
106 |
+
eprint={2111.09453},
|
107 |
+
archivePrefix={arXiv},
|
108 |
+
primaryClass={cs.CL}
|
109 |
+
}
|
110 |
+
```
|