MLP Design Choice
#47
by
scissorstail
- opened
Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.