Bulgarian language poetry generation
Pretrained model using causal language modeling (CLM) objective based on GPT-2.
Developed by Radostin Cholakov as a part of the AzBuki.ML initiatives.
How to use?
>>> from transformers import AutoModel, AutoTokenizer
>>>
>>> model_id = "radi-cho/poetry-bg"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
>>>
>>> input_ids = tokenizer.encode(
>>> "[HED]Суетата на живота[NEL][BDY]",
>>> add_special_tokens=False,
>>> return_tensors='pt')
>>>
>>> output_ids = model.generate(
>>> input_ids,
>>> do_sample=True,
>>> max_length=250,
>>> top_p=0.98,
>>> top_k=0,
>>> pad_token_id=2,
>>> eos_token_id=50258)
>>>
>>> output = tokenizer.decode(output_ids[0])
>>>
>>> output = output.replace('[NEL]', '\n')
>>> output = output.replace('[BDY]', '\n')
>>> output = output.replace('[HED]', '')
>>> output = output.replace('[SEP]', '')
>>>
>>> print(output)
Суетата на живота
Да страдам ли?
Да страдам ли за това?
Не, не за това, че умирам...
Но само за това,
че миговете ми са рани.
Аз съм сам и търся утеха.
Custom Tokens
We introduced 3 custom tokens in the tokenizer - [NEL]
, [BDY]
, [HED]
[HED]
denotes where the title of the poem begins;[BDY]
denotes where the body of the poem begins;[NEL]
marks the end of a verse and should be decoded as a new line;
[SEP]
(with id 50258) is the end of sequence token.
Credits
- Inspired by rmihaylov/gpt2-medium-bg.
- Data: https://chitanka.info/texts/type/poetry;
- Downloads last month
- 28