How to understand the graph "Tensor parallelism with column linear + row Linear"
#109
by
Yihel
- opened
Ya, but I think they probably just copied and pasted from the previous section.
I think they are trying to say that in a transformer block, the FFN will have 1 hidden layer, which means 2 matrix multiplications. So it should be X * W_1 * W_2.
X * W_1 = Y_1 --> use column parallel (but without alltogether)
Y_1 * W_2 = Y_2 -> use row parallel (now alltogether)