|
|
|
Text generation strategies |
|
Text generation is essential to many NLP tasks, such as open-ended text generation, summarization, translation, and |
|
more. It also plays a role in a variety of mixed-modality applications that have text as an output like speech-to-text |
|
and vision-to-text. Some of the models that can generate text include |
|
GPT2, XLNet, OpenAI GPT, CTRL, TransformerXL, XLM, Bart, T5, GIT, Whisper. |
|
Check out a few examples that use [~transformers.generation_utils.GenerationMixin.generate] method to produce |
|
text outputs for different tasks: |
|
* Text summarization |
|
* Image captioning |
|
* Audio transcription |
|
Note that the inputs to the generate method depend on the model's modality. They are returned by the model's preprocessor |
|
class, such as AutoTokenizer or AutoProcessor. If a model's preprocessor creates more than one kind of input, pass all |
|
the inputs to generate(). You can learn more about the individual model's preprocessor in the corresponding model's documentation. |
|
The process of selecting output tokens to generate text is known as decoding, and you can customize the decoding strategy |
|
that the generate() method will use. Modifying a decoding strategy does not change the values of any trainable parameters. |
|
However, it can have a noticeable impact on the quality of the generated output. It can help reduce repetition in the text |
|
and make it more coherent. |
|
This guide describes: |
|
* default generation configuration |
|
* common decoding strategies and their main parameters |
|
* saving and sharing custom generation configurations with your fine-tuned model on 🤗 Hub |
|
Default text generation configuration |
|
A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference |
|
within a [pipeline], the models call the PreTrainedModel.generate() method that applies a default generation |
|
configuration under the hood. The default configuration is also used when no custom configuration has been saved with |
|
the model. |
|
When you load a model explicitly, you can inspect the generation configuration that comes with it through |
|
model.generation_config: |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM |
|
model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") |
|
model.generation_config |
|
GenerationConfig { |
|
"bos_token_id": 50256, |
|
"eos_token_id": 50256, |
|
} |
|
|
|
Printing out the model.generation_config reveals only the values that are different from the default generation |
|
configuration, and does not list any of the default values. |
|
The default generation configuration limits the size of the output combined with the input prompt to a maximum of 20 |
|
tokens to avoid running into resource limitations. The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token. For many tasks |
|
and small output sizes this works well. However, when used to generate longer outputs, greedy search can start |
|
producing highly repetitive results. |
|
Customize text generation |
|
You can override any generation_config by passing the parameters and their values directly to the [generate] method: |
|
thon |
|
|
|
my_model.generate(**inputs, num_beams=4, do_sample=True) # doctest: +SKIP |
|
|
|
Even if the default decoding strategy mostly works for your task, you can still tweak a few things. Some of the |
|
commonly adjusted parameters include: |
|
|
|
max_new_tokens: the maximum number of tokens to generate. In other words, the size of the output sequence, not |
|
including the tokens in the prompt. As an alternative to using the output's length as a stopping criteria, you can choose |
|
to stop generation whenever the full generation exceeds some amount of time. To learn more, check [StoppingCriteria]. |
|
num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy search to |
|
beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that |
|
has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability |
|
sequences that start with a lower probability initial tokens and would've been ignored by the greedy search. |
|
do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search |
|
multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability |
|
distribution over the entire vocabulary with various strategy-specific adjustments. |
|
num_return_sequences: the number of sequence candidates to return for each input. This option is only available for |
|
the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and sampling. Decoding |
|
strategies like greedy search and contrastive search return a single output sequence. |
|
|
|
Save a custom decoding strategy with your model |
|
If you would like to share your fine-tuned model with a specific generation configuration, you can: |
|
* Create a [GenerationConfig] class instance |
|
* Specify the decoding strategy parameters |
|
* Save your generation configuration with [GenerationConfig.save_pretrained], making sure to leave its config_file_name argument empty |
|
* Set push_to_hub to True to upload your config to the model's repo |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, GenerationConfig |
|
model = AutoModelForCausalLM.from_pretrained("my_account/my_model") # doctest: +SKIP |
|
generation_config = GenerationConfig( |
|
max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id |
|
) |
|
generation_config.save_pretrained("my_account/my_model", push_to_hub=True) # doctest: +SKIP |
|
|
|
You can also store several generation configurations in a single directory, making use of the config_file_name |
|
argument in [GenerationConfig.save_pretrained]. You can later instantiate them with [GenerationConfig.from_pretrained]. This is useful if you want to |
|
store several generation configurations for a single model (e.g. one for creative text generation with sampling, and |
|
one for summarization with beam search). You must have the right Hub permissions to add configuration files to a model. |
|
thon |
|
|
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig |
|
tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small") |
|
translation_generation_config = GenerationConfig( |
|
num_beams=4, |
|
early_stopping=True, |
|
decoder_start_token_id=0, |
|
eos_token_id=model.config.eos_token_id, |
|
pad_token=model.config.pad_token_id, |
|
) |
|
Tip: add push_to_hub=True to push to the Hub |
|
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json") |
|
You could then use the named generation config file to parameterize generation |
|
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json") |
|
inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt") |
|
outputs = model.generate(**inputs, generation_config=generation_config) |
|
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) |
|
['Les fichiers de configuration sont faciles à utiliser!'] |
|
|
|
Streaming |
|
The generate() supports streaming, through its streamer input. The streamer input is compatible with any instance |
|
from a class that has the following methods: put() and end(). Internally, put() is used to push new tokens and |
|
end() is used to flag the end of text generation. |
|
|
|
The API for the streamer classes is still under development and may change in the future. |
|
|
|
In practice, you can craft your own streaming class for all sorts of purposes! We also have basic streaming classes |
|
ready for you to use. For example, you can use the [TextStreamer] class to stream the output of generate() into |
|
your screen, one word at a time: |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
tok = AutoTokenizer.from_pretrained("openai-community/gpt2") |
|
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2") |
|
inputs = tok(["An increasing sequence: one,"], return_tensors="pt") |
|
streamer = TextStreamer(tok) |
|
Despite returning the usual output, the streamer will also print the generated text to stdout. |
|
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=20) |
|
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, |
|
|
|
Decoding strategies |
|
Certain combinations of the generate() parameters, and ultimately generation_config, can be used to enable specific |
|
decoding strategies. If you are new to this concept, we recommend reading this blog post that illustrates how common decoding strategies work. |
|
Here, we'll show some of the parameters that control the decoding strategies and illustrate how you can use them. |
|
Greedy Search |
|
[generate] uses greedy search decoding by default so you don't have to pass any parameters to enable it. This means the parameters num_beams is set to 1 and do_sample=False. |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
prompt = "I look forward to" |
|
checkpoint = "distilbert/distilgpt2" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
outputs = model.generate(**inputs) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n'] |
|
|
|
Contrastive search |
|
The contrastive search decoding strategy was proposed in the 2022 paper A Contrastive Framework for Neural Text Generation. |
|
It demonstrates superior results for generating non-repetitive yet coherent long outputs. To learn how contrastive search |
|
works, check out this blog post. |
|
The two main parameters that enable and control the behavior of contrastive search are penalty_alpha and top_k: |
|
thon |
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
checkpoint = "openai-community/gpt2-large" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
prompt = "Hugging Face Company is" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best |
|
in the business and our customer service is second to none.\n\nIf you have any questions about our |
|
products or services, feel free to contact us at any time. We look forward to hearing from you!'] |
|
|
|
Multinomial sampling |
|
As opposed to greedy search that always chooses a token with the highest probability as the |
|
next token, multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire |
|
vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the |
|
risk of repetition. |
|
To enable multinomial sampling set do_sample=True and num_beams=1. |
|
thon |
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed |
|
set_seed(0) # For reproducibility |
|
checkpoint = "openai-community/gpt2-large" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
prompt = "Today was an amazing day because" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited, |
|
that\'s a terrible feeling."'] |
|
|
|
Beam-search decoding |
|
Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses |
|
the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability |
|
sequences that start with lower probability initial tokens and would've been ignored by the greedy search. |
|
To enable this decoding strategy, specify the num_beams (aka number of hypotheses to keep track of) that is greater than 1. |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
prompt = "It is astonishing how one can" |
|
checkpoint = "openai-community/gpt2-medium" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of |
|
time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have'] |
|
|
|
Beam-search multinomial sampling |
|
As the name implies, this decoding strategy combines beam search with multinomial sampling. You need to specify |
|
the num_beams greater than 1, and set do_sample=True to use this decoding strategy. |
|
thon |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed |
|
set_seed(0) # For reproducibility |
|
prompt = "translate English to German: The house is wonderful." |
|
checkpoint = "google-t5/t5-small" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) |
|
outputs = model.generate(**inputs, num_beams=5, do_sample=True) |
|
tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
'Das Haus ist wunderbar.' |
|
|
|
Diverse beam search decoding |
|
The diverse beam search decoding strategy is an extension of the beam search strategy that allows for generating a more diverse |
|
set of beam sequences to choose from. To learn how it works, refer to Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. |
|
This approach has three main parameters: num_beams, num_beam_groups, and diversity_penalty. |
|
The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group. |
|
thon |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
checkpoint = "google/pegasus-xsum" |
|
prompt = ( |
|
"The Permaculture Design Principles are a set of universal design principles " |
|
"that can be applied to any location, climate and culture, and they allow us to design " |
|
"the most efficient and sustainable human habitation and food production systems. " |
|
"Permaculture is a design system that encompasses a wide variety of disciplines, such " |
|
"as ecology, landscape design, environmental science and energy conservation, and the " |
|
"Permaculture design principles are drawn from these various disciplines. Each individual " |
|
"design principle itself embodies a complete conceptual framework based on sound " |
|
"scientific principles. When we bring all these separate principles together, we can " |
|
"create a design system that both looks at whole systems, the parts that these systems " |
|
"consist of, and how those parts interact with each other to create a complex, dynamic, " |
|
"living system. Each design principle serves as a tool that allows us to integrate all " |
|
"the separate parts of a design, referred to as elements, into a functional, synergistic, " |
|
"whole system, where the elements harmoniously interact and work together in the most " |
|
"efficient way possible." |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) |
|
outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0) |
|
tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
'The Design Principles are a set of universal design principles that can be applied to any location, climate and |
|
culture, and they allow us to design the' |
|
|
|
This guide illustrates the main parameters that enable various decoding strategies. More advanced parameters exist for the |
|
[generate] method, which gives you even further control over the [generate] method's behavior. |
|
For the complete list of the available parameters, refer to the API documentation. |
|
Speculative Decoding |
|
Speculative decoding (also known as assisted decoding) is a modification of the decoding strategies above, that uses an |
|
assistant model (ideally a much smaller one) with the same tokenizer, to generate a few candidate tokens. The main |
|
model then validates the candidate tokens in a single forward pass, which speeds up the decoding process. If |
|
do_sample=True, then the token validation with resampling introduced in the |
|
speculative decoding paper is used. |
|
Currently, only greedy search and sampling are supported with assisted decoding, and assisted decoding doesn't support batched inputs. |
|
To learn more about assisted decoding, check this blog post. |
|
To enable assisted decoding, set the assistant_model argument with a model. |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
prompt = "Alice and Bob" |
|
checkpoint = "EleutherAI/pythia-1.4b-deduped" |
|
assistant_checkpoint = "EleutherAI/pythia-160m-deduped" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint) |
|
outputs = model.generate(**inputs, assistant_model=assistant_model) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] |
|
|
|
When using assisted decoding with sampling methods, you can use the temperature argument to control the randomness, |
|
just like in multinomial sampling. However, in assisted decoding, reducing the temperature may help improve the latency. |
|
thon |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed |
|
set_seed(42) # For reproducibility |
|
prompt = "Alice and Bob" |
|
checkpoint = "EleutherAI/pythia-1.4b-deduped" |
|
assistant_checkpoint = "EleutherAI/pythia-160m-deduped" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint) |
|
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint) |
|
outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5) |
|
tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
['Alice and Bob are going to the same party. It is a small party, in a small'] |
|
|
|
|