Hardware Question - Single system or multiple?
Nice work - From the readme,
Training took ~30hrs on 5x3090s and used almost 23gb of vram on each. DDP was used for pytorch parallelism.
I think I know the answer to this, but given that you used DDP, does that mean that this was trained across multiple CPU's/systems?s (I am on the hunt for a motherboard/system that can support several GPU's in one system and was initially excited that you may have used such a system)
If this was on a single system, do you happen to know what motherboard/system specs were that support 5x3090's? And if not, then the search continues...
1 CPU, 1 system, multiple GPUs, it's server components, tyan S8030GM2NE, Epyc 7532 32core
DDP is pytorch's parallel implementation for multiple gpus or multiple systems, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Just a newbie question, why did it take so much resources to train 8bit 7B Lora? Usually in text generation webui I can do the same with 8GB vram batch size 1 . Is it do to the data size?
wow, only 8gb vram, that's impressive. There are a number of factors, one is the batch size, as well as gradient accumulation steps. I also used the adamw_bnb_8bit optimizer, and optimizer could effect vram usage... My sequence length was also 1000, and padded, so they would all be 1000 tokens... that contributes for sure. I'm not sure on the total datasize, I don't know if the entire dataset is loaded in vram. The higher lora_r and lora_alpha would also contribute to higher vram usage.
So to be full honest, I'm not 100% certain on everything, you're vram usage seems surprisingly low, and mine seemed surprisingly high.
Edit: update, dataset size seems to have zero effect on vram usage, which makes sense, because it's loaded in in batch_size