Flash Attention
#6
by
skymeng
- opened
Hello,
When using Flash Attention with two separately initialized base models (Qwen2.5-1.5B and Qwen2.5-0.5B) trained on identical datasets, the validation loss consistently shows a marked increase compared to training without Flash Attention. Has this issue been documented or studied in existing research?
Thanks