How to get such a large global batch size
hello great work !
One of my questions is how to get such a large global batch size during training. It is mentioned in Table 3 of your paper that you used a global batch size of 16384 for base model.
For me, I used 8 32G V100, and used fp16, gradient checkpointing and ZERO1 to reduce memory usage. When the max length is 128, my batch size per gpu can reach 1024. This is without synchronizing the negative.
However, once I synchronize the negative samples of 8 cards, my memory will be OOM, and my batch size per gpu will drop from 1024 to about 192 when negative sample synchronization is turned on. And if I use more GPUs, this number drops even further.
Is this normal? Do you use negative synchronizing across device during training ?
we train our model on 8 A100-80G. on 32G V100, you can use gradient accumulation to increase the overall batch size