In such models, passing the labels is the preferred way to handle training. Please check each model's docs to see how they handle these input IDs for sequence to sequence training. decoder models Also referred to as autoregressive models, decoder models involve a pretraining task (called causal language modeling) where the model reads the texts in order and has to predict the next word.