#

InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites โโ A Pioneering Open-Source Alternative to GPT-4o
[\[๐ Blog\]](https://internvl.github.io/blog/) [\[๐ค FAQs\]](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html) [\[๐ InternVL2 Blog\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) [\[๐จ๏ธ Chat Demo\]](https://internvl.opengvlab.com/) [\[๐ค HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[๐ Document\]](https://internvl.readthedocs.io/en/latest/) [\[๐ API\]](https://internvl.readthedocs.io/en/latest/get_started/internvl_chat_api.html) [\[๐ Quick Start\]](#quick-start-with-huggingface)
[\[๐ InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[๐ InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[๐ 1.0 ไธญๆ่งฃ่ฏป\]](https://zhuanlan.zhihu.com/p/702946079) [\[๐ 1.5 ไธญๆ่งฃ่ฏป\]](https://zhuanlan.zhihu.com/p/699439759) [\[๐ 2.0 ไธญๆ่งฃ่ฏป\]](https://zhuanlan.zhihu.com/p/706547971)
[Switch to the Chinese version (ๅๆข่ณไธญๆ็)](/README_zh.md)


## News ๐๐๐
- `2024/08/01`: The [Chartmimic](https://chartmimic.github.io/) team evaluated the InternVL2 series models on their benchmark. The InternVL2-26B and 76B models achieved the top two performances among open-source models, with the InternVL2 76B model surpassing GeminiProVision and exhibiting comparable results to Claude-3-opus.
- `2024/08/01`: InternVL2-Pro achieved the SOTA performance among open-source models on the [CharXiv](https://charxiv.github.io/#leaderboard) dataset, surpassing some well-known closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet.
- `2024/07/24`: The [MLVU](https://github.com/JUNJIE99/MLVU) team evaluated InternVL-1.5 on their benchmark. The average performance on the multiple-choice task was 50.4%, while the performance on the generative tasks was 4.02. The performance on the multiple-choice task ranked #1 among all open-source MLLMs.
- `2024/07/18`: ๐ฅ๐ฅ InternVL2-40B achieved SOTA performance among open-source models on the [Video-MME](https://github.com/BradyFU/Video-MME) dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini.
- `2024/07/18`: ๐ฅ InternVL2-Pro achieved the SOTA performance on the [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) and [InfoVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=3) benchmarks.
- `2024/07/04`: ๐ We release the [InternVL2 series](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e). InternVL2-Pro achieved a 62.0% accuracy on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o. The free API of this model can be applied by filling ([application form](https://docs.google.com/forms/d/e/1FAIpQLSfMCzhPr1OOEKau_6jwTU0EiZMSFckDo-HMlc_hUudhF_97rw/viewform?usp=sf_link)) / ([็ณ่ฏท่กจ](https://wj.qq.com/s2/14910502/25a4/)). Other models are available at [HF link](https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e).
- `2024/06/19`: We propose Needle In A Multimodal Haystack ([MM-NIAH](https://github.com/OpenGVLab/MM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
- `2024/05/30`: We release [ShareGPT-4o](https://sharegpt4o.github.io/), a large-scale dataset that we plan to open-source with 200K images, 10K videos, and 10K audios with detailed descriptions.
- `2024/05/29`: We release the Mini-InternVL series, which includes two chat models: [Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) and [Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5). These models achieve impressive performance with minimal size: the 2B model delivers 80% of the performance with only 8% of the model size, and the 4B model achieves 90% of the performance with just 16% of the model size. For more details, please check our [blog](https://internvl.github.io/blog/2024-05-25-Mini-InternVL-1.5/).
- `2024/05/28`: Thanks to the [lmdeploy](https://github.com/InternLM/lmdeploy) team for providing AWQ quantization support. The 4-bit model is available at [OpenGVLab/InternVL-Chat-V1-5-AWQ](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-AWQ).
- `2024/05/13`: InternVL 1.0 can now be used as the [text encoder](https://huggingface.co/OpenGVLab/InternVL-14B-224px) for diffusion models to support multilingual generation natively in over 110 languages worldwide. See [MuLan](https://github.com/mulanai/MuLan) for more details.
- `2024/04/18`: InternVL-Chat-V1-5 has been released at [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
- `2024/02/27`: InternVL is accepted by CVPR 2024 (Oral)! ๐
- `2024/02/24`: InternVL-Chat models have been included in the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
- `2024/02/21`: [InternVL-Chat-V1-2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) for more details.
- `2024/02/12`: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) and [SFT data](./internvl_chat#prepare-training-datasets). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), and both training / evaluation data and scripts are open-sourced.
- `2024/01/24`: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1).
- `2024/01/16`: We release our [customized mmcv/mmsegmentation/mmdetection code](https://github.com/OpenGVLab/InternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale detection and segmentation models.
## TODO List
- [ ] Support vLLM and Ollama
- [x] Rebuild documents using readthedocs
- [x] Support fine-tuning different LLMs with LoRA
- [ ] Support video and PDF input in online demo
- [ ] Release InternVL2 with VisionLLMv2 integration
- [x] Release `requirements.txt` for InternVL2
- [x] Release training / evaluation code for InternVL2 series
- [x] Release Streamlit web UI for InternVL1.5 and InternVL2
## Documents
- Get Started
- Installation: [\[Environment\]](https://internvl.readthedocs.io/en/latest/get_started/installation.html) [\[requirements.txt\]](./requirements.txt)
- Evaluation Data Preparation: [\[InternVL Evaluation\]](https://internvl.readthedocs.io/en/latest/get_started/eval_data_preparation.html)
- Chat Data Format: [\[Meta File\]](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#meta-file) [\[Pure Text\]](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#pure-text-data) [\[Single-Image\]](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#single-image-data) [\[Multi-Image\]](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#multi-image-data) [\[Video\]](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#video-data)
- InternVL-Chat API: [\[InternVL2-Pro\]](https://internvl.readthedocs.io/en/latest/get_started/internvl_chat_api.html#official-api-of-internvl2-pro)
- Local Chat Demo: [\[Streamlit Demo\]](https://internvl.readthedocs.io/en/latest/get_started/local_chat_demo.html#streamlit-demo) [\[Gradio Demo\]](https://internvl.readthedocs.io/en/latest/get_started/local_chat_demo.html#gradio-demo) [\[LMDeploy Demo\]](https://internvl.readthedocs.io/en/latest/get_started/local_chat_demo.html#lmdeploy-demo)
- Tutorials: [\[Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning\]](https://internvl.readthedocs.io/en/latest/tutorials/coco_caption_finetune.html)
- InternVL Family
- InternVL 2.0: [\[Introduction\]](https://internvl.readthedocs.io/en/latest/internvl2.0/introduction.html) [\[Quick Start\]](https://internvl.readthedocs.io/en/latest/internvl2.0/quick_start.html) [\[Finetune\]](https://internvl.readthedocs.io/en/latest/internvl2.0/finetune.html) [\[Evaluation\]](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html) [\[Deployment\]](https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html)
- InternVL 1.5: [\[Introduction\]](https://internvl.readthedocs.io/en/latest/internvl1.5/introduction.html) [\[Quick Start\]](https://internvl.readthedocs.io/en/latest/internvl1.5/quick_start.html) [\[Finetune\]](https://internvl.readthedocs.io/en/latest/internvl1.5/finetune.html) [\[Evaluation\]](https://internvl.readthedocs.io/en/latest/internvl1.5/evaluation.html) [\[Deployment\]](https://internvl.readthedocs.io/en/latest/internvl1.5/deployment.html)
- InternVL 1.2: [\[Introduction\]](https://internvl.readthedocs.io/en/latest/internvl1.2/introduction.html) [\[Quick Start\]](https://internvl.readthedocs.io/en/latest/internvl1.2/quick_start.html) [\[Finetune\]](https://internvl.readthedocs.io/en/latest/internvl1.2/finetune.html) [\[Evaluation\]](https://internvl.readthedocs.io/en/latest/internvl1.2/evaluation.html)
- InternVL 1.1: [\[Introduction\]](https://internvl.readthedocs.io/en/latest/internvl1.1/introduction.html) [\[Quick Start\]](https://internvl.readthedocs.io/en/latest/internvl1.1/quick_start.html) [\[Evaluation\]](https://internvl.readthedocs.io/en/latest/internvl1.1/evaluation.html)
- InternVL 1.0: [\[Classification\]](https://internvl.readthedocs.io/en/latest/internvl1.0/classification.html) [\[CLIP-Benchmark\]](https://internvl.readthedocs.io/en/latest/internvl1.0/clip_benchmark.html) [\[Segmentation\]](https://internvl.readthedocs.io/en/latest/internvl1.0/segmentation.html) [\[InternVL-Chat-LLaVA\]](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_chat_llava.html) [\[InternVL-G\]](https://internvl.readthedocs.io/en/latest/internvl1.0/internvl_g.html)
## Compared with SOTA VLLMs

## Model Zoo
#### Multimodal Large Language Model (InternVL 2.0)
Model |
Date |
HF Link |
MS Link |
Note |
Mini‑InternVL‑Chat‑4B‑V1‑5 |
2024.05.28 |
๐ค link |
๐ค link |
๐๐ 16% of the model size, 90% of the performance |
Mini‑InternVL‑Chat‑2B‑V1‑5 |
2024.05.19 |
๐ค link |
๐ค link |
๐ 8% of the model size, 80% of the performance |
InternVL‑Chat‑V1‑5 |
2024.04.18 |
๐ค link |
๐ค link |
support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. |
InternVL‑Chat‑V1‑2‑Plus |
2024.02.21 |
๐ค link |
๐ค link |
more SFT data and stronger |
InternVL‑Chat‑V1‑2 |
2024.02.11 |
๐ค link |
๐ค link |
scaling up LLM to 34B |
InternVL‑Chat‑V1‑1 |
2024.01.24 |
๐ค link |
๐ค link |
support Chinese and stronger OCR |
InternVL‑Chat‑19B |
2023.12.25 |
๐ค link |
๐ค link |
English multimodal dialogue |
InternVL‑Chat‑13B |
2023.12.25 |
๐ค link |
๐ค link |
English multimodal dialogue |
#### Vision Foundation Model (InternVL 1.0-1.5)