arxiv:2309.05519

NExT-GPT: Any-to-Any Multimodal LLM

Published on Sep 11, 2023

· Submitted by

akhaliq on Sep 12, 2023

#2 Paper of the day

Upvote

Authors:

Shengqiong Wu ,

Hao Fei ,

Leigang Qu ,

Wei Ji ,

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

View arXiv page View PDF Add to collection

Community

KinGeorge

Sep 12, 2023

This comment has been hidden

edmond

Sep 12, 2023

project page: next-gpt.github.io
Any plan to add it in huggingface ?

AashishKumar

Sep 12, 2023

This comment has been hidden

cutmasta-kun

Sep 13, 2023

How to fine tune this model?

mattbarr

Sep 13, 2023

@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.

Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.

UncleanCode

Sep 14, 2023

•

edited Sep 14, 2023

The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.

scofield7419

Paper author Sep 15, 2023

@cutmasta-kun , seeing as the code and data repos are empty, you don't right now.

Once they aren't empty, the same way you fine-tune any other pre-trained model. I'd recommend starting with LoRA.

Hi @mattbarr , thx for the attention; the code base publication is now finished at https://github.com/NExT-GPT/NExT-GPT.

ChocoWu

Paper author Sep 15, 2023

•

edited Sep 15, 2023

The special tokens like the "AUD" , are they appended to the text response by the LLM or does the LLM also generate them?. That seems to be my point of confusion. In terms of Visual question answering , how does this compare to BLIP-2. The paper shows Image-Text performance where it surpassed BLIP2 but not particularly for visual question answering.

Hi @UncleanCode , thx for the interest.

Regarding your first question, the special tokens are generated by LLM when users ask NExT-GPT to show images, videos, or sounds. Actually, during the training, we insert the pre-defined special tokens into the vocabulary of LLM. For more details, please check the code.
For the second question, yes we have not conducted experiments at the moment to compare the VQA task. I guess somebody will do this when with our code.