Sequence packing logic

#3
by orrzohar - opened

Could you give some details about the sequence packing logic? Besides packing sequences, did you happen to do anything not to apply attention between different sequences?

If yes/no, did you ablate this design choice?

We make the packing has no effect to the computing results by using 1) corss-sample attention mask 2) postion ids resetting. So we think our packing strategy won't influence our results compared to nopacking.

Sign up or log in to comment