Flash Attention

by skymeng - opened Apr 3

Apr 3

Hello,
When using Flash Attention with two separately initialized base models (Qwen2.5-1.5B and Qwen2.5-0.5B) trained on identical datasets, the validation loss consistently shows a marked increase compared to training without Flash Attention. Has this issue been documented or studied in existing research?
Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment