Hey @shantanuagarwal , glad you enjoyed the article! Even though I havent tried it out myself you should be able to leverage pytorch flexattention api for this. Have a look at the tutorial here https://pytorch.org/blog/flexattention/. Section "Document Masking/Jagged Sequences" talks about these packed sequence masks.
Lukas
sirluk
AI & ML interests
None yet
Recent Activity
commented on
their
article
14 days ago
Efficient LLM Pretraining: Packed Sequences and Masked Attention
upvoted
a
paper
6 months ago
A Large Recurrent Action Model: xLSTM enables Fast Inference for
Robotics Tasks
authored
a paper
7 months ago
One Initialization to Rule them All: Fine-tuning via Explained Variance
Adaptation
Organizations
sirluk's activity
commented on
Efficient LLM Pretraining: Packed Sequences and Masked Attention
14 days ago
upvoted
a
paper
6 months ago
published
an
article
7 months ago
Article
Efficient LLM Pretraining: Packed Sequences and Masked Attention
By
•
•
38apply_chat_template method not working correctly for llama 3 tokenizer
3
#35 opened 9 months ago
by
sirluk
published
an
article
over 1 year ago
Article
Multilabel Classification using Mistral-7B on a single GPU with quantization and LoRA
By
•
•
19