MLP Design Choice
#47
by
scissorstail
- opened
Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.
Hello @scissorstail !
We took advantage of some performance tricks to increase the throughput / MFU during pre-training, e.g., using a single matrix to compute the up and gate states.
gugarosa
changed discussion status to
closed
gugarosa
changed discussion status to
open