--- language: - tr thumbnail: tags: - gpt2 - turkish license: Apache 2.0 datasets: - wikipedia-turkish metrics: - perplexity - accuracy widget: - text: "Bu yazıyı bir bilgisayar yazdı. Yazarken" context: "" - text: "İnternete kolay erişim sayesinde dünya daha da küçüldü. Bunun sonucunda" context: "" --- # Turkish GPT2 Model Finetuned # Türkçe GPT2 Modeli ## Model description This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020 Live demo based on this work at : https://www.metayazar.com/ Fine tuned writer on this model: https://huggingface.co/gorkemgoknar/gpt2-turkish-writer Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb) Code is converted to work with Fastai 2.X . Using Google Colab for training. Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage. Current accuracy 33 % , Perplexity : 51.88 Models are available: * [gpt2-small-tuned-tr] (https://huggingface.co/gorkemgoknar/gpt2-small-turkish) ## Intended uses & limitations #### How to use #### Install ```python from transformers import AutoTokenizer, AutoModelWithLMHead import torch tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish") model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish") # Get sequence length max of 1024 tokenizer.model_max_length=1024 model.eval() # disable dropout (or leave in train mode to finetune) ``` #### Generate 1 word ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output outputs = model(**inputs, labels=inputs["input_ids"]) loss, logits = outputs[:2] predicted_index = torch.argmax(logits[0, -1, :]).item() predicted_text = tokenizer.decode([predicted_index]) # results print('input text:', text) print('predicted text:', predicted_text) # input text: # predicted text: ``` #### Generate Full Sequence ```python # input sequence text = "Bu yazıyı bilgisayar yazdı." inputs = tokenizer(text, return_tensors="pt") # model output using Top-k sampling text generation method sample_outputs = model.generate(inputs.input_ids, pad_token_id=50256, do_sample=True, max_length=50, # put the token number you want top_k=40, num_return_sequences=1) # generated sequence for i, sample_output in enumerate(sample_outputs): print(">> Generated text {}\\ \\ {}".format(i+1, tokenizer.decode(sample_output.tolist()))) # >> Generated text # ``` #### Limitations and bias The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral. ## Training data Wikipedia Turkish article dump as of 28-10-2020 ## Training procedure ## Eval results | epoch\\t|train_loss\\t|valid_loss\\t|accuracy\\t|perplexity\\t|time | | ----- | -------- |--------- | ---------- | --------- | ----- | |0\\t|4.777015\\t|4.621834\\t|0.292547\\t|101.680367\\t|2:42:05| |1\\t|4.509412\\t|4.403999\\t|0.305574\\t|81.777267\\t|1:09:38| |2\\t|4.169529\\t|4.120755\\t|0.324908\\t|61.605747\\t|1:07:45| |3\\t|4.293973\\t|4.177899\\t|0.317211\\t|65.228653\\t|1:07:02| |4\\t|4.049848\\t|3.949103\\t|0.338347\\t|51.888783\\t|1:05:53| #Epoch 0 on Tesla T4, others on V100 ```