Batch size 'auto' leads to hanging jobs

#1110
by gcamp - opened

Hello
I am trying to reproduce lb evaluation for Qwen2.5-72B, on 8 H100
I am noticing some differences when running with an explicit batch size value (as expected) but when running with 'auto' the evaluation job hangs when computing the batch size. how can I overcome this problem?
Also, I wanted to ask for such model, how do you set your parallelization in accelerate such as num_processes, etc?
Thanks

Sign up or log in to comment