Thomas's picture

Thomas

Totototo
·

AI & ML interests

None yet

Recent Activity

Organizations

None yet

Totototo's activity

view reply

Few days later, I'll answer my own question based on what I could see in the code. Feel free to complement if I missed something.

From what I see in the current code of SFTTrainer :

  • the current option of SFTTrainer (packing=True) does not deal with attention masks.
  • it does not deal either with positionnal encoding.

See here https://github.com/huggingface/trl/blob/64aa06499b2e71537a8e701fad076873b0f3603f/trl/trainer/sft_trainer.py#L351: preparation of dataset with packing option
here https://github.com/huggingface/trl/blob/64aa06499b2e71537a8e701fad076873b0f3603f/trl/trainer/sft_trainer.py#L663: Actual packing with pack_dataset function on input_ids only.
here https://github.com/huggingface/trl/blob/e0dd5250217305f7f8c2f4a153a6939a2f16e2bf/trl/data_utils.py#L475 : pack_dataset function itself.

From what I understand, the "do not attend to tokens out of the current sentence" is infered only from the eos tokens separating each sentence from the other. In that SFTTrainer follows the approach chosen by GPT3 article ("Language Models are Few shot learners") authors. See the following extract :

Capture d’écran 2025-05-20 à 12.46.43.png

view reply

Thanks @sirluk for the great article !
One thing unclear to me :

  • SFTTrainer from TRL contains a "packing" option. Does it deal with both the masked attention and the position ids that you mention above ?
  • are the previous concerns from the above comment from @shantanuagarwal relevant if one uses SFTTrainer ?
view reply

Great article : subject is clearly more than hot; overall method looks good and code is neat. Thanks for sharing !

But two things I'd have improved :

  • test set size seems too small. (LLama 3.1 8b is down 8% compared to whole dataset. There is too much variability in the test : thus it seems to me that saying that you gained 7% on the finetuned small model does not hold).
  • costs comparison lack inference time between small and LLMs. Thus, unless I'm mistaken, saying direclty it's 80 times cheaper because inferences are 80 times cheaper does not hold. Maybe inference time of a small model on a small capacity endpoint is larger than that of a large model on a large capacity endpoint, and overall the gain would be less than 80x.

With such a great job done, too bad these two points slightly blur the conclusions, maybe it's easy to adapt ?