Padding token for batched embedding in Transformers?

#24
by ChrisCrass - opened

Wondering if there are any best or special practices for embedding batches of documents with this model. In my own testing I have found that the presence of extra items in a batch (if it causes any padding to occur) can have an impact on the resulting embedding compared to case of a single-document batch.

The tokenizer in the Transformers approach always ends an eos token, but it doesn't add any bos tokens (which are also the same as the eos token), and further it uses the eos/bos token as a padding token... Is that by design?

Tips would be much appreciated

Alibaba-NLP org

Whether batch inference uses padding tokens depends on the tokenizer's padding parameter being set to true or false. We recommend not using the padding mode.

Sign up or log in to comment