|
Metadata-Version: 2.1 |
|
Name: deepseek_vl2 |
|
Version: 1.0.0 |
|
Summary: DeepSeek-VL2 |
|
Author: DeepSeek-AI |
|
License: MIT License |
|
|
|
Copyright (c) 2023 DeepSeek |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this software and associated documentation files (the "Software"), to deal |
|
in the Software without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Software, and to permit persons to whom the Software is |
|
furnished to do so, subject to the following conditions: |
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Software. |
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
SOFTWARE. |
|
|
|
Project-URL: homepage, https://github.com/deepseek-ai/DeepSeek-VL2 |
|
Requires-Python: >=3.8 |
|
Description-Content-Type: text/markdown |
|
License-File: LICENSE-CODE |
|
License-File: LICENSE-MODEL |
|
Requires-Dist: torch>=2.0.1 |
|
Requires-Dist: transformers>=4.38.2 |
|
Requires-Dist: timm>=0.9.16 |
|
Requires-Dist: accelerate |
|
Requires-Dist: sentencepiece |
|
Requires-Dist: attrdict |
|
Requires-Dist: einops |
|
Provides-Extra: gradio |
|
Requires-Dist: gradio==3.48.0; extra == "gradio" |
|
Requires-Dist: gradio-client==0.6.1; extra == "gradio" |
|
Requires-Dist: mdtex2html==1.3.0; extra == "gradio" |
|
Requires-Dist: pypinyin==0.50.0; extra == "gradio" |
|
Requires-Dist: tiktoken==0.5.2; extra == "gradio" |
|
Requires-Dist: tqdm==4.64.0; extra == "gradio" |
|
Requires-Dist: colorama==0.4.5; extra == "gradio" |
|
Requires-Dist: Pygments==2.12.0; extra == "gradio" |
|
Requires-Dist: markdown==3.4.1; extra == "gradio" |
|
Requires-Dist: SentencePiece==0.1.96; extra == "gradio" |
|
Provides-Extra: lint |
|
Requires-Dist: isort; extra == "lint" |
|
Requires-Dist: black[jupyter]>=22.6.0; extra == "lint" |
|
Requires-Dist: pylint[spelling]>=2.15.0; extra == "lint" |
|
Requires-Dist: flake8; extra == "lint" |
|
Requires-Dist: flake8-bugbear; extra == "lint" |
|
Requires-Dist: flake8-comprehensions; extra == "lint" |
|
Requires-Dist: flake8-docstrings; extra == "lint" |
|
Requires-Dist: flake8-pyi; extra == "lint" |
|
Requires-Dist: flake8-simplify; extra == "lint" |
|
Requires-Dist: ruff; extra == "lint" |
|
Requires-Dist: pyenchant; extra == "lint" |
|
Requires-Dist: pre-commit; extra == "lint" |
|
|
|
<!-- markdownlint-disable first-line-h1 --> |
|
<!-- markdownlint-disable html --> |
|
<!-- markdownlint-disable no-duplicate-header --> |
|
|
|
<div align="center"> |
|
<img src="images/logo.svg" width="60%" alt="DeepSeek LLM" /> |
|
</div> |
|
<hr> |
|
<div align="center"> |
|
|
|
<a href="https://www.deepseek.com/" target="_blank"> |
|
<img alt="Homepage" src="images/badge.svg" /> |
|
</a> |
|
<a href="" target="_blank"> |
|
<img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20VL-536af5?color=536af5&logoColor=white" /> |
|
</a> |
|
<a href="https://huggingface.co/deepseek-ai" target="_blank"> |
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" /> |
|
</a> |
|
|
|
</div> |
|
|
|
|
|
<div align="center"> |
|
|
|
<a href="https://discord.gg/Tc7c45Zzu5" target="_blank"> |
|
<img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" /> |
|
</a> |
|
<a href="images/qr.jpeg" target="_blank"> |
|
<img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" /> |
|
</a> |
|
<a href="https://twitter.com/deepseek_ai" target="_blank"> |
|
<img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" /> |
|
</a> |
|
|
|
</div> |
|
|
|
<div align="center"> |
|
|
|
<a href="LICENSE-CODE"> |
|
<img alt="Code License" src="https://img.shields.io/badge/Code_License-MIT-f5de53?&color=f5de53"> |
|
</a> |
|
<a href="LICENSE-MODEL"> |
|
<img alt="Model License" src="https://img.shields.io/badge/Model_License-Model_Agreement-f5de53?&color=f5de53"> |
|
</a> |
|
</div> |
|
|
|
|
|
<p align="center"> |
|
<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#3-model-download"><b>📥 Model Download</b></a> | |
|
<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#4-quick-start"><b>⚡ Quick Start</b></a> | |
|
<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#5-license"><b>📜 License</b></a> | |
|
<a href="https://github.com/deepseek-ai/DeepSeek-VL2/tree/main?tab=readme-ov-file#6-citation"><b>📖 Citation</b></a> <br> |
|
<a href="./DeepSeek_VL2_paper.pdf"><b>📄 Paper Link</b></a> | |
|
<a href="https://arxiv.org/abs/2412.10302"><b>📄 Arxiv Paper Link</b></a> | |
|
<a href=""><b>👁️ Demo</b></a> |
|
</p> |
|
|
|
|
|
|
|
Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. |
|
DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. |
|
|
|
|
|
[DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding]() |
|
|
|
Zhiyu Wu*, Xiaokang Chen*, Zizheng Pan*, Xingchao Liu*, Wen Liu**, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan*** (* Equal Contribution, ** Project Lead, *** Corresponding author) |
|
|
|
 |
|
|
|
|
|
✅ <b>2024-12-25</b>: Gradio Demo Example, Incremental Prefilling and VLMEvalKit Support. |
|
|
|
✅ <b>2024-12-13</b>: DeepSeek-VL2 family released, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>. |
|
|
|
|
|
|
|
We release the DeepSeek-VL2 family, including <code>DeepSeek-VL2-tiny</code>, <code>DeepSeek-VL2-small</code>, <code>DeepSeek-VL2</code>. |
|
To support a broader and more diverse range of research within both academic and commercial communities. |
|
Please note that the use of this model is subject to the terms outlined in [License section](#5-license). |
|
|
|
### Huggingface |
|
|
|
| Model | Sequence Length | Download | |
|
|--------------|-----------------|-----------------------------------------------------------------------------| |
|
| DeepSeek-VL2-tiny | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-tiny) | |
|
| DeepSeek-VL2-small | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2-small) | |
|
| DeepSeek-VL2 | 4096 | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/deepseek-vl2) | |
|
|
|
|
|
## 4. Quick Start |
|
|
|
### Installation |
|
|
|
On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command: |
|
|
|
```shell |
|
pip install -e . |
|
``` |
|
|
|
### Simple Inference Example with One Image |
|
|
|
**Note: You may need 80GB GPU memory to run this script with deepseek-vl2-small and even larger for deepseek-vl2.** |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM |
|
|
|
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM |
|
from deepseek_vl2.utils.io import load_pil_images |
|
|
|
|
|
# specify the path to the model |
|
model_path = "deepseek-ai/deepseek-vl2-tiny" |
|
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path) |
|
tokenizer = vl_chat_processor.tokenizer |
|
|
|
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) |
|
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() |
|
|
|
## single image conversation example |
|
conversation = [ |
|
{ |
|
"role": "<|User|>", |
|
"content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.", |
|
"images": ["./images/visual_grounding_1.jpeg"], |
|
}, |
|
{"role": "<|Assistant|>", "content": ""}, |
|
] |
|
|
|
|
|
pil_images = load_pil_images(conversation) |
|
prepare_inputs = vl_chat_processor( |
|
conversations=conversation, |
|
images=pil_images, |
|
force_batchify=True, |
|
system_prompt="" |
|
).to(vl_gpt.device) |
|
|
|
|
|
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) |
|
|
|
|
|
outputs = vl_gpt.language.generate( |
|
inputs_embeds=inputs_embeds, |
|
attention_mask=prepare_inputs.attention_mask, |
|
pad_token_id=tokenizer.eos_token_id, |
|
bos_token_id=tokenizer.bos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
max_new_tokens=512, |
|
do_sample=False, |
|
use_cache=True |
|
) |
|
|
|
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False) |
|
print(f"{prepare_inputs['sft_format'][0]}", answer) |
|
``` |
|
|
|
And the output is something like: |
|
``` |
|
<|User|>: <image> |
|
<|ref|>The giraffe at the back.<|/ref|>. |
|
|
|
<|Assistant|>: <|ref|>The giraffe at the back.<|/ref|><|det|>[[580, 270, 999, 900]]<|/det|><|end▁of▁sentence|> |
|
``` |
|
|
|
|
|
|
|
**Note: You may need 80GB GPU memory to run this script with deepseek-vl2-small and even larger for deepseek-vl2.** |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM |
|
|
|
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM |
|
from deepseek_vl2.utils.io import load_pil_images |
|
|
|
|
|
|
|
model_path = "deepseek-ai/deepseek-vl2-tiny" |
|
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path) |
|
tokenizer = vl_chat_processor.tokenizer |
|
|
|
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) |
|
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() |
|
|
|
|
|
conversation = [ |
|
{ |
|
"role": "<|User|>", |
|
"content": "This is image_1: <image>\n" |
|
"This is image_2: <image>\n" |
|
"This is image_3: <image>\n Can you tell me what are in the images?", |
|
"images": [ |
|
"images/multi_image_1.jpeg", |
|
"images/multi_image_2.jpeg", |
|
"images/multi_image_3.jpeg", |
|
], |
|
}, |
|
{"role": "<|Assistant|>", "content": ""} |
|
] |
|
|
|
|
|
pil_images = load_pil_images(conversation) |
|
prepare_inputs = vl_chat_processor( |
|
conversations=conversation, |
|
images=pil_images, |
|
force_batchify=True, |
|
system_prompt="" |
|
).to(vl_gpt.device) |
|
|
|
|
|
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) |
|
|
|
|
|
outputs = vl_gpt.language.generate( |
|
inputs_embeds=inputs_embeds, |
|
attention_mask=prepare_inputs.attention_mask, |
|
pad_token_id=tokenizer.eos_token_id, |
|
bos_token_id=tokenizer.bos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
max_new_tokens=512, |
|
do_sample=False, |
|
use_cache=True |
|
) |
|
|
|
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False) |
|
print(f"{prepare_inputs['sft_format'][0]}", answer) |
|
``` |
|
|
|
And the output is something like: |
|
``` |
|
<|User|>: This is image_1: <image> |
|
This is image_2: <image> |
|
This is image_3: <image> |
|
Can you tell me what are in the images? |
|
|
|
<|Assistant|>: The images show three different types of vegetables. Image_1 features carrots, which are orange with green tops. Image_2 displays corn cobs, which are yellow with green husks. Image_3 contains raw pork ribs, which are pinkish-red with some marbling.<|end▁of▁sentence|> |
|
``` |
|
|
|
|
|
|
|
**Note: We use incremental prefilling to inference within 40GB GPU using deepseek-vl2-small.** |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM |
|
|
|
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM |
|
from deepseek_vl2.utils.io import load_pil_images |
|
|
|
|
|
|
|
model_path = "deepseek-ai/deepseek-vl2-small" |
|
vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path) |
|
tokenizer = vl_chat_processor.tokenizer |
|
|
|
vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) |
|
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() |
|
|
|
|
|
conversation = [ |
|
{ |
|
"role": "<|User|>", |
|
"content": "This is image_1: <image>\n" |
|
"This is image_2: <image>\n" |
|
"This is image_3: <image>\n Can you tell me what are in the images?", |
|
"images": [ |
|
"images/multi_image_1.jpeg", |
|
"images/multi_image_2.jpeg", |
|
"images/multi_image_3.jpeg", |
|
], |
|
}, |
|
{"role": "<|Assistant|>", "content": ""} |
|
] |
|
|
|
|
|
pil_images = load_pil_images(conversation) |
|
prepare_inputs = vl_chat_processor( |
|
conversations=conversation, |
|
images=pil_images, |
|
force_batchify=True, |
|
system_prompt="" |
|
).to(vl_gpt.device) |
|
|
|
with torch.no_grad(): |
|
|
|
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) |
|
|
|
|
|
inputs_embeds, past_key_values = vl_gpt.incremental_prefilling( |
|
input_ids=prepare_inputs.input_ids, |
|
images=prepare_inputs.images, |
|
images_seq_mask=prepare_inputs.images_seq_mask, |
|
images_spatial_crop=prepare_inputs.images_spatial_crop, |
|
attention_mask=prepare_inputs.attention_mask, |
|
chunk_size=512 # prefilling size |
|
) |
|
|
|
|
|
outputs = vl_gpt.generate( |
|
inputs_embeds=inputs_embeds, |
|
input_ids=prepare_inputs.input_ids, |
|
images=prepare_inputs.images, |
|
images_seq_mask=prepare_inputs.images_seq_mask, |
|
images_spatial_crop=prepare_inputs.images_spatial_crop, |
|
attention_mask=prepare_inputs.attention_mask, |
|
past_key_values=past_key_values, |
|
|
|
pad_token_id=tokenizer.eos_token_id, |
|
bos_token_id=tokenizer.bos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
max_new_tokens=512, |
|
|
|
do_sample=False, |
|
use_cache=True, |
|
) |
|
|
|
answer = tokenizer.decode(outputs[0][len(prepare_inputs.input_ids[0]):].cpu().tolist(), skip_special_tokens=False) |
|
|
|
print(f"{prepare_inputs['sft_format'][0]}", answer) |
|
``` |
|
|
|
And the output is something like: |
|
``` |
|
<|User|>: This is image_1: <image> |
|
This is image_2: <image> |
|
This is image_3: <image> |
|
Can you tell me what are in the images? |
|
|
|
<|Assistant|>: The first image contains carrots. The second image contains corn. The third image contains meat.<|end▁of▁sentence|> |
|
``` |
|
|
|
|
|
```shell |
|
|
|
CUDA_VISIBLE_DEVICES=0 python inference.py --model_patn "deepseek-ai/deepseek-vl2" |
|
|
|
|
|
CUDA_VISIBLE_DEVICES=0 python inference.py --model_patn "deepseek-ai/deepseek-vl2-small" --chunck_size 512 |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
* Install the necessary dependencies: |
|
```shell |
|
pip install -e .[gradio] |
|
``` |
|
|
|
* then run the following command: |
|
|
|
```shell |
|
|
|
CUDA_VISIBLE_DEVICES=2 python web_demo.py \ |
|
--model_name "deepseek-ai/deepseek-vl2-tiny" \ |
|
--port 37914 |
|
|
|
|
|
|
|
|
|
|
|
CUDA_VISIBLE_DEVICES=2 python web_demo.py \ |
|
--model_name "deepseek-ai/deepseek-vl2-small" \ |
|
--port 37914 \ |
|
--chunk_size 512 |
|
|
|
|
|
CUDA_VISIBLE_DEVICES=2 python web_demo.py \ |
|
--model_name "deepseek-ai/deepseek-vl2" \ |
|
--port 37914 |
|
``` |
|
|
|
* **Important**: This is a basic and native demo implementation without any deployment optimizations, which may result in slower performance. For production environments, consider using optimized deployment solutions, such as vllm, sglang, lmdeploy, etc. These optimizations will help achieve faster response times and better cost efficiency. |
|
|
|
|
|
|
|
This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use. |
|
|
|
## 6. Citation |
|
|
|
``` |
|
@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels, |
|
title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, |
|
author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan}, |
|
year={2024}, |
|
eprint={2412.10302}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.10302}, |
|
} |
|
``` |
|
|
|
## 7. Contact |
|
|
|
If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]). |
|
|