Few days later, I'll answer my own question based on what I could see in the code. Feel free to complement if I missed something.
From what I see in the current code of SFTTrainer :
- the current option of SFTTrainer (packing=True) does not deal with attention masks.
- it does not deal either with positionnal encoding.
See here https://github.com/huggingface/trl/blob/64aa06499b2e71537a8e701fad076873b0f3603f/trl/trainer/sft_trainer.py#L351: preparation of dataset with packing option
here https://github.com/huggingface/trl/blob/64aa06499b2e71537a8e701fad076873b0f3603f/trl/trainer/sft_trainer.py#L663: Actual packing with pack_dataset function on input_ids only.
here https://github.com/huggingface/trl/blob/e0dd5250217305f7f8c2f4a153a6939a2f16e2bf/trl/data_utils.py#L475 : pack_dataset function itself.
From what I understand, the "do not attend to tokens out of the current sentence" is infered only from the eos tokens separating each sentence from the other. In that SFTTrainer follows the approach chosen by GPT3 article ("Language Models are Few shot learners") authors. See the following extract :