Abstract
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.
Community
How to fine tune this model?
@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.
Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.
The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.
@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.
Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.
Hi @mattbarr , thx for the attention; the code base publication is now finished at https://github.com/NExT-GPT/NExT-GPT.
The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.
Hi @UncleanCode , thx for the interest.
- Regarding your first question, the special tokens are generated by LLM when users ask NExT-GPT to show images, videos, or sounds. Actually, during the training, we insert the pre-defined special tokens into the vocabulary of LLM. For more details, please check the code.
- For the second question, yes we have not conducted experiments at the moment to compare the VQA task. I guess somebody will do this when with our code.
Looking forward for inclusion in Huggingface
Looks very interesting and excited to try it out in buggingface
Hey could I download the model files and dump it into the ckpt folder and run the demo_app.py from your github to run it in on my local?
NExT-GPT: The Future of Any-to-Any Multimodal AI!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper