MLP Design Choice

#47
by scissorstail - opened

Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.

https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py#L763

Sign up or log in to comment