Salesforce/xgen-mm-phi3-mini-instruct-interleave-r-v1.5 · Question About BLIP-3’s Replacement of Q-Former with Perceiver Resampler

Hi,
I noticed in the BLIP-3 paper that you replaced the Q-Former from BLIP-2 with Flamingo’s original Perceiver Resampler. However, I’m struggling to fully understand the motivation behind this change.
From my perspective, there doesn’t seem to be a fundamental difference between Q-Former and Perceiver Resampler. As I understand it, the Perceiver Resampler mixes self-attention and cross-attention within a single layer, whereas Q-Former alternates between self-attention and cross-attention in separate layers. While there are some subtle architectural differences, they seem largely functionally equivalent.
I’m wondering if you could elaborate on the reasoning behind this design choice. Did Perceiver Resampler provide specific advantages over Q-Former? Or is there something I might be misunderstanding about the differences between the two?
Looking forward to your insights—thank you!