File size: 3,435 Bytes
3e96cae
 
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
0bf48bc
3e96cae
 
 
 
 
 
 
0bf48bc
3e96cae
0bf48bc
3e96cae
ab34bf6
3e96cae
ab34bf6
3e96cae
 
49d37a6
3e96cae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bf3c0b
 
 
 
3e96cae
 
 
 
0bf48bc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language: 
- tr
thumbnail: 
tags:
- gpt2
- turkish

license: Apache 2.0
datasets:
- wikipedia-turkish
metrics:
- perplexity
- accuracy

widget:
- text: "Bu yazıyı bir bilgisayar yazdı. Yazarken"
  context: ""
- text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda"
  context: ""
  
---

# MyModel

## Model description

This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

Work has been done on Pierre Guillou tutorial as on this page.
(https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) 

Code is converted to work with Fastai 2.X .

Using Google Colab for training. 

Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

Current accuracy 28.9 %  , Perplexity : 86.71

Models are available:

* [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish)

## Intended uses & limitations

#### How to use

#### Install

```python
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

```

#### Generate 1 word
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: Quem era Jim Henson? Jim Henson era um
# predicted text:  homem

```

#### Generate Full Sequence
```python
# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
# Quem era Jim Henson? Jim Henson era um executivo de televisão e diretor de um grande estúdio de cinema mudo chamado Selig,
# depois que o diretor de cinema mudo Georges Seuray dirigiu vários filmes para a Columbia e o estúdio.    

```

#### Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. 


## Training data

Wikipedia Turkish article dump as of 28-10-2020

## Training procedure


## Eval results

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	6.922922	6.653488	0.148002	775.484253	2:26:41    
1	4.799396	4.633522	0.277028	102.875755	3:03:38    
2	4.610025	4.462641	0.289884	86.716248	2:34:50      



```