Here's an example of preparing input for model.generate(), using the Zephyr assistant model: thon from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "HuggingFaceH4/zephyr-7b-beta" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here messages = [ { "role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", }, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") print(tokenizer.decode(tokenized_chat[0])) This will yield a string in the input format that Zephyr expects.text <|system|> You are a friendly chatbot who always responds in the style of a pirate <|user|> How many helicopters can a human eat in one sitting? <|assistant|> Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: python outputs = model.generate(tokenized_chat, max_new_tokens=128) print(tokenizer.decode(outputs[0])) This will yield: text <|system|> You are a friendly chatbot who always responds in the style of a pirate <|user|> How many helicopters can a human eat in one sitting? <|assistant|> Matey, I'm afraid I must inform ye that humans cannot eat helicopters.