ieumermo commited on
Commit
9d2fffb
·
verified ·
1 Parent(s): fc987c8

Upload 3 files

Browse files

1. Todos os direitos autorais são e continuam com o Projeto/LegalNLP.
2. Copiando aqui por conta do erro que está ocorrendo no Projeto/LegalNLP:
YAML Metadata Error: "language" with value "pt-br" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.
3. O erro ocorre no arquivo README.md, que a "language" está como "pt-br", e ao que parece, o sistema agora não aceita mais, exigindo que seja "ptb" (três letras) ou "pt" (duas letras)

.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ w2v_d2v_dbow_size_100_window_15_epochs_20 filter=lfs diff=lfs merge=lfs -text
37
+ w2v_d2v_dm_size_100_window_15_epochs_20 filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,249 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ptb
3
+ license: mit
4
+ tags:
5
+ - LegalNLP
6
+ - NLP
7
+ - legal field
8
+ - python
9
+ - word2vec
10
+ - doc2vec
11
+ ---
12
+
13
+
14
+ # ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️
15
+
16
+ ### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.
17
+
18
+ You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709).
19
+
20
+ If you use our library in your academic work, please cite us in the following way
21
+
22
+ @article{polo2021legalnlp,
23
+ title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
24
+ author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
25
+ journal={arXiv preprint arXiv:2110.15709},
26
+ year={2021}
27
+ }
28
+
29
+ --------------
30
+
31
+ ## Summary
32
+
33
+ 0. [Accessing the Language Models](#0)
34
+ 1. [ Introduction / Installing package](#1)
35
+ 2. [ Language Models (Details / How to use)](#2)
36
+ 1. [ Word2Vec/Doc2Vec ](#2.1)
37
+ 3. [ Demonstrations / Tutorials](#3)
38
+ 4. [ References](#4)
39
+
40
+ --------------
41
+
42
+ <a name="0"></a>
43
+ ## 0\. Accessing the Language Models
44
+
45
+
46
+ All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
47
+
48
+ Please contact *[email protected]* if you have any problem accessing the language models.
49
+
50
+ --------------
51
+
52
+ <a name="1"></a>
53
+ ## 1\. Introduction / Installing package
54
+ *LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.
55
+
56
+
57
+ You first need to install the HuggingFaceHub library running the following command on terminal
58
+ ``` :sh
59
+ $ pip install huggingface_hub
60
+ ```
61
+
62
+ Import `hf_hub_download`:
63
+
64
+ ```python
65
+ from huggingface_hub import hf_hub_download
66
+ ```
67
+
68
+ And then you can download our Word2Vec(SG)/Doc2Vec(DBOW) and Word2Vec(CBOW)/Doc2Vec(DM) by the following commands:
69
+
70
+ ```python
71
+ w2v_sg_d2v_dbow = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dbow_size_100_window_15_epochs_20")
72
+ w2v_cbow_d2v_dm = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dm_size_100_window_15_epochs_20")
73
+ ```
74
+
75
+ --------------
76
+
77
+
78
+
79
+ <a name="2"></a>
80
+ ## 2\. Model Languages
81
+
82
+ <a name="3.2"></a>
83
+ ### 3.2\. Word2Vec/Doc2Vec
84
+
85
+ Our first models for generating vector representation for tokens and
86
+ texts (embeddings) are variations of the Word2Vec [1,
87
+ 2] and Doc2Vec [3] methods. In short, the
88
+ Word2Vec methods generate embeddings for tokens5 and that somehow capture
89
+ the meaning of the various textual elements, based on the contexts in which these
90
+ elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
91
+ for generating whole text representations.
92
+
93
+ Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html) for more details. Preferably use Gensim version 3.8.3.
94
+
95
+
96
+ Below we have a summary table with some important information about the trained models:
97
+
98
+
99
+
100
+ | Filenames | Doc2Vec | Word2Vec | Size | Windows
101
+ |:-------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
102
+ | ```w2v_d2v_dm*``` | Distributed Memory (DM) | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
103
+ | ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW) | Skip-Gram (SG) | 100, 200, 300 | 15
104
+
105
+
106
+ Here we made available both models with 100 size and 15 window.
107
+
108
+ #### Using *Word2Vec*
109
+
110
+ Installing Gensim
111
+
112
+
113
+ ```python
114
+ !pip install gensim=='3.8.3'
115
+ ```
116
+
117
+ Loading W2V:
118
+
119
+
120
+ ```python
121
+ from gensim.models import KeyedVectors
122
+
123
+ #Loading a W2V model
124
+ w2v=KeyedVectors.load(w2v_cbow_d2v_dm)
125
+ w2v=w2v.wv
126
+ ```
127
+ Viewing the first 10 entries of 'juiz' vector
128
+
129
+
130
+ ```python
131
+ w2v['juiz'][:10]
132
+ ```
133
+
134
+
135
+
136
+
137
+ array([ 6.570131 , -1.262787 , 5.156106 , -8.943866 , -5.884408 ,
138
+ -7.717058 , 1.8819941 , -8.02803 , -0.66901577, 6.7223144 ],
139
+ dtype=float32)
140
+
141
+
142
+
143
+
144
+ Viewing closest tokens to 'juiz'
145
+
146
+ ```python
147
+ w2v.most_similar('juiz')
148
+ ```
149
+
150
+
151
+
152
+
153
+ [('juíza', 0.8210258483886719),
154
+ ('juiza', 0.7306275367736816),
155
+ ('juíz', 0.691645085811615),
156
+ ('juízo', 0.6605231165885925),
157
+ ('magistrado', 0.6213295459747314),
158
+ ('mmª_juíza', 0.5510469675064087),
159
+ ('juizo', 0.5494943261146545),
160
+ ('desembargador', 0.5313084721565247),
161
+ ('mmjuiz', 0.5277603268623352),
162
+ ('fabíola_melo_feijão_juíza', 0.5043971538543701)]
163
+
164
+
165
+ #### Using *Doc2Vec*
166
+ Installing Gensim
167
+
168
+
169
+ ```python
170
+ !pip install gensim=='3.8.3'
171
+ ```
172
+
173
+ Loading D2V
174
+
175
+
176
+ ```python
177
+ from gensim.models import Doc2Vec
178
+
179
+ #Loading a D2V model
180
+ d2v=Doc2Vec.load(w2v_cbow_d2v_dm)
181
+ ```
182
+
183
+ Inferring vector for a text
184
+
185
+
186
+ ```python
187
+ txt='direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios'
188
+ tokens=txt.split()
189
+
190
+ txt_vec=d2v.infer_vector(tokens, epochs=20)
191
+ txt_vec[:10]
192
+ ```
193
+
194
+
195
+
196
+
197
+ array([ 0.02626514, -0.3876521 , -0.24873355, -0.0318402 , 0.3343679 ,
198
+ -0.21307918, 0.07193747, 0.02030687, 0.407305 , 0.20065512],
199
+ dtype=float32)
200
+
201
+
202
+
203
+
204
+ --------------
205
+
206
+ <a name="4"></a>
207
+ ## 4\. Demonstrations
208
+
209
+ For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:
210
+
211
+ - **BERT notebook** :
212
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
213
+
214
+ - **Word2Vec notebook** :
215
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)
216
+
217
+ - **Doc2Vec notebook** :
218
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)
219
+
220
+
221
+
222
+ --------------
223
+
224
+ <a name="5"></a>
225
+ ## 5\. References
226
+
227
+ [1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b).
228
+ Distributed representations of words and phrases and their compositionality.
229
+ In Advances in neural information processing systems, pages 3111–3119.
230
+
231
+ [2] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of
232
+ word representations in vector space. arXiv preprint arXiv:1301.3781.
233
+
234
+ [3] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and
235
+ documents. In International conference on machine learning, pages 1188–1196.
236
+ PMLR.
237
+
238
+ [4] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching
239
+ word vectors with subword information. Transactions of the Association for
240
+ Computational Linguistics, 5:135–146.
241
+
242
+ [5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training
243
+ of deep bidirectional transformers for language understanding. arXiv preprint
244
+ arXiv:1810.04805.
245
+
246
+ [6] Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT
247
+ models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent
248
+ Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23
249
+
w2v_d2v_dbow_size_100_window_15_epochs_20 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc31120c587429584c41fc227448d413b206b1c34faf3979808d2bf602c1ce7d
3
+ size 130371884
w2v_d2v_dm_size_100_window_15_epochs_20 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2e58df1b447ac81423870bb4fbc9b6d659fad87ff375e6a9e0f4b818dda5dd7
3
+ size 130205330