MLP Design Choice

#47

by scissorstail - opened 21 days ago

21 days ago

Why does this model series, unlike other models, always use the approach of splitting into two chunks when computing the up_states and gate in the MLP? I’m not sure if this is the right place to ask, but I didn’t really have anywhere else to turn.

https://github.com/huggingface/transformers/blob/40a493c7ed4f19f08eadb0639cf26d49bfa5e180/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py#L763

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment