Flash Attention

#6
by skymeng - opened

Hello,
When using Flash Attention with two separately initialized base models (Qwen2.5-1.5B and Qwen2.5-0.5B) trained on identical datasets, the validation loss consistently shows a marked increase compared to training without Flash Attention. Has this issue been documented or studied in existing research?
Thanks

Sign up or log in to comment