Sequence packing logic
#3
by
orrzohar
- opened
Could you give some details about the sequence packing logic? Besides packing sequences, did you happen to do anything not to apply attention between different sequences?
If yes/no, did you ablate this design choice?
We make the packing has no effect to the computing results by using 1) corss-sample attention mask 2) postion ids resetting. So we think our packing strategy won't influence our results compared to nopacking.