Bulgarian language poetry generation

Pretrained model using causal language modeling (CLM) objective based on GPT-2.
Developed by Radostin Cholakov as a part of the AzBuki.ML initiatives.

How to use?

>>> from transformers import AutoModel, AutoTokenizer
>>>
>>> model_id = "radi-cho/poetry-bg"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
>>>
>>> input_ids = tokenizer.encode(
>>>     "[HED]Суетата на живота[NEL][BDY]", 
>>>     add_special_tokens=False, 
>>>     return_tensors='pt')
>>>
>>> output_ids = model.generate(
>>>     input_ids, 
>>>     do_sample=True, 
>>>     max_length=250,
>>>     top_p=0.98,
>>>     top_k=0,
>>>     pad_token_id=2,
>>>     eos_token_id=50258)
>>>
>>> output = tokenizer.decode(output_ids[0])
>>>
>>> output = output.replace('[NEL]', '\n')
>>> output = output.replace('[BDY]', '\n')
>>> output = output.replace('[HED]', '')
>>> output = output.replace('[SEP]', '')
>>>
>>> print(output)
Суетата на живота

Да страдам ли?
Да страдам ли за това?
Не, не за това, че умирам...
Но само за това,
че миговете ми са рани.

Аз съм сам и търся утеха.

Custom Tokens

We introduced 3 custom tokens in the tokenizer - [NEL], [BDY], [HED]

[HED] denotes where the title of the poem begins;
[BDY] denotes where the body of the poem begins;
[NEL] marks the end of a verse and should be decoded as a new line;

[SEP] (with id 50258) is the end of sequence token.

Credits

Inspired by rmihaylov/gpt2-medium-bg.
Data: https://chitanka.info/texts/type/poetry;