Mistral sliding_window implementation and flash_attn_func
I am trying to fine-tune mistral-7b using huggingface trainer and flash-attention. I have observed strange behaviour of the sliding_window, where changing its size has no effect on the training at all. I assumed that by manipulating sliding_window size I might be able to fit longer sequences into my model, but neither VRAM usage nor training time seems to be affected by sliding_window size.
Sliding_window in transformers library
I implemented some checks in MistralFlashAttention2 class (tested with a sequence length of 2048 and sliding_window sizes of 1, 512 and 2048). I found that the sliding_window sizes were working correctly:
- use_sliding_window parameter was True
- sliding_window size was correctly displayed
- flash_attn_func was called with sliding_window parameter
Memory usage and training time in sagemaker
I measured peak GPU memory allocated for each training step and also time for each step:
(example)
Peak GPU memory allocated at step 92: 8239454208 bytes
Step 92 took 472.382896900177 seconds.
These measurements did not change, regardless of what sliding_window was set to (same with system logs on wandb). This seems odd to me, can someone help me understand this behaviour?