Ada-LLaVA Model Card

AdaLLaVA-Prumerge: This model is another verson of Ada-LLaVA-7B combined with LLaVA-PruMerge token pruning technique.

Ada-LLaVA-7B-PruMerge is an open-source adaptive inference framework for multimodal Large Language Models (MLLMs) that dynamically adjusts its operations based on available computational resources and latency requirements.

See the paper for more details: Learning to Inference Adaptively for Multimodal Large Language Models

Model details: https://zhuoyan-xu.github.io/ada-llava/

Model Details

Model Type: Ada LLaVA 7B follows the LLaVA-v1.5 stage-2 training pipeline, with CLIP-ViT-L-336px as visual encoder (336*336 image resolution), Vicuna-v1.5-7B as base LLM and a two-layer MLP as vision-language connector, customized embedding model and MLP as latency scheduler.

It was trained with stage-2 pipeline as LLaVA:

Instruction tuning: Freeze vision encoder, train the remaining model with multimodal instruction following data of tabular and non-tabular tasks.

Code Base: We use the official code of LLaVA-v1.5 for model training and inference, and the saved model checkpoint is uploaded to this repository.

Model Date: Ada-LLaVA 7B was trained in Oct 2024.

License

AdaLLaVA is based on LLaVA-1.5 and thus follows its license. Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.

Intended use

Primary intended uses: The primary use of Ada LLaVA is research on multimodal large multimodal models and chatbots, especially for resource-constrain inference and deployment.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

  • 665K image level instruction data from LLaVA-1.5 stage-2, see details in original LLaVA repo.

Limitations

While Ada-LLaVA is currently limited to processing one image at a time and only applies adaptive operations in its later half of layers, future work could explore multi-image input support and extend the adaptive mechanisms throughout the entire model architecture, including the vision encoder. These improvements would make the model more versatile and applicable to a broader range of real-world scenarios.

Downloads last month
332
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train zhuoyanxu/ada-llava-L-v1.5-7b-prumerge

Collection including zhuoyanxu/ada-llava-L-v1.5-7b-prumerge