Transformers documentation
DeepseekVL
This model was released on 2024-03-08 and added to Hugging Face Transformers on 2025-07-25.
DeepseekVL
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
You can find all the original Deepseek-VL checkpoints under the DeepSeek-community organization.
Click on the Deepseek-VL models in the right sidebar for more examples of how to apply Deepseek-VL to different vision and language tasks.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
import torch
from transformers import pipeline
pipe = pipeline(
    task="image-text-to-text",
    model="deepseek-community/deepseek-vl-1.3b-chat",
    device=0,
    dtype=torch.float16
)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            { "type": "text", "text": "Describe this image."},
        ]
    }
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses torchao to only quantize the weights to int4.
import torch
from transformers import TorchAoConfig, DeepseekVLForConditionalGeneration, AutoProcessor
quantization_config = TorchAoConfig(
    "int4_weight_only",
    group_size=128
)
model = DeepseekVLForConditionalGeneration.from_pretrained(
    "deepseek-community/deepseek-vl-1.3b-chat",
    dtype=torch.bfloat16,
    device_map="auto",
    quantization_config=quantization_config
)Notes
- Do inference with multiple images in a single conversation. - import torch from transformers import DeepseekVLForConditionalGeneration, AutoProcessor model = DeepseekVLForConditionalGeneration.from_pretrained( "deepseek-community/deepseek-vl-1.3b-chat", dtype=torch.float16, device_map="auto", attn_implementation="sdpa" ) processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat") messages = [ [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between"}, {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, {"type": "text", "text": " and "}, {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"} ] } ], [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"}, {"type": "text", "text": "What do you see in this image?"} ] } ] ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, padding=True, truncation=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device, dtype=model.dtype) generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
DeepseekVLConfig
class transformers.DeepseekVLConfig
< source >( text_config: typing.Optional[transformers.models.auto.configuration_auto.AutoConfig] = None vision_config: typing.Optional[transformers.models.auto.configuration_auto.AutoConfig] = None image_token_id: int = 100015 **kwargs )
Parameters
-  text_config (Union[AutoConfig, dict], optional, defaults toLlamaConfig) — The config object or dictionary of the text backbone.
-  vision_config (Union[AutoConfig, dict], optional, defaults toSiglipVisionConfig) — The config object or dictionary of the vision backbone.
-  image_token_id (int, optional, defaults to 100015) — The index representing image tokens in the model’s token vocabulary.
This is the configuration class to store the configuration of a DeepseekVLModel. It is used to instantiate a DeepseekVL model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the DeepseekVL deepseek-community/deepseek-vl-1.3b-chat architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import DeepseekVLConfig, DeepseekVLModel
>>> # Initializing a DeepseekVL deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> configuration = DeepseekVLConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-1.3b-chat style configuration
>>> model = DeepseekVLModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configDeepseekVLProcessor
class transformers.DeepseekVLProcessor
< source >( image_processor tokenizer chat_template = None num_image_tokens = 576 )
Parameters
- image_processor (DeepseekVLImageProcessor) — The image processor is a required input.
- tokenizer (LlamaTokenizerFast) — The tokenizer is a required input.
-  chat_template (str, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
-  num_image_tokens (int, optional, defaults to 576) — The number of special image tokens used as placeholders for visual content in text sequences.
Constructs a DeepseekVL processor which wraps a DeepseekVL Image Processor and a Llama tokenizer into a single processor.
DeepseekVLProcessor offers all the functionalities of DeepseekVLImageProcessor and LlamaTokenizerFast. See the
__call__() and decode() for more information.
This method forwards all its arguments to LlamaTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to LlamaTokenizerFast’s decode(). Please refer to the docstring of this method for more information.
DeepseekVLImageProcessor
class transformers.DeepseekVLImageProcessor
< source >( do_resize: bool = True size: typing.Optional[dict[str, int]] = None min_size: int = 14 resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None do_pad: typing.Optional[bool] = True **kwargs )
Parameters
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overridden by thedo_resizeparameter in thepreprocessmethod.
-  size (dict, optional, defaults to{"height" -- 384, "width": 384}): Size of the output image after resizing. Can be overridden by thesizeparameter in thepreprocessmethod.
-  min_size (int, optional, defaults to 14) — The minimum allowed size for the resized image. Ensures that neither the height nor width falls below this value after resizing.
-  resample (PILImageResampling, optional, defaults toResampling.BICUBIC) — Resampling filter to use if resizing the image. Only has an effect ifdo_resizeis set toTrue. Can be overridden by theresampleparameter in thepreprocessmethod.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image by the specified scalerescale_factor. Can be overridden by thedo_rescaleparameter in thepreprocessmethod.
-  rescale_factor (intorfloat, optional, defaults to1/255) — Scale factor to use if rescaling the image. Only has an effect ifdo_rescaleis set toTrue. Can be overridden by therescale_factorparameter in thepreprocessmethod.
-  do_normalize (bool, optional, defaults toTrue) — Whether to normalize the image. Can be overridden by thedo_normalizeparameter in thepreprocessmethod. Can be overridden by thedo_normalizeparameter in thepreprocessmethod.
-  image_mean (floatorlist[float], optional, defaults toIMAGENET_STANDARD_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_meanparameter in thepreprocessmethod. Can be overridden by theimage_meanparameter in thepreprocessmethod.
-  image_std (floatorlist[float], optional, defaults toIMAGENET_STANDARD_STD) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_stdparameter in thepreprocessmethod. Can be overridden by theimage_stdparameter in thepreprocessmethod.
-  do_convert_rgb (bool, optional, defaults toTrue) — Whether to convert the image to RGB.
-  do_pad (bool, optional, defaults toTrue) — Whether to pad the image to square or not.
Constructs a DEEPSEEK_VL image processor.
pad_to_square
< source >( image: ndarray background_color: typing.Union[int, tuple[int, int, int]] = 0 data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None  ) → np.ndarray
Parameters
-  image (np.ndarray) — The image to pad.
-  background_color (intortuple[int, int, int], optional, defaults to 0) — The color to use for the padding. Can be an integer for single channel or a tuple of integers representing for multi-channel images. If passed as integer in multi-channel mode, it will default to0in subsequent channels.
-  data_format (strorChannelDimension, optional) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format. If unset, will use same as the input image.
 
-  input_data_format (strorChannelDimension, optional) — The channel dimension format for the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
 
Returns
np.ndarray
The padded image.
Pads an image to a square based on the longest edge.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None do_convert_rgb: typing.Optional[bool] = None background_color: typing.Union[int, tuple[int, int, int], NoneType] = None do_pad: typing.Optional[bool] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )
Parameters
-  images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (dict[str, int], optional, defaults toself.size) — Controls the size of the image afterresize. The shortest edge of the image is resized tosize["shortest_edge"]whilst preserving the aspect ratio. If the longest edge of this resized image is >int(size["shortest_edge"] * (1333 / 800)), then the image is resized again to make the longest edge equal toint(size["shortest_edge"] * (1333 / 800)).
-  resample (PILImageResampling, optional, defaults toself.resample) — Resampling filter to use if resizing the image. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image values between [0 - 1].
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorlist[float], optional, defaults toself.image_mean) — Image mean to normalize the image by ifdo_normalizeis set toTrue.
-  image_std (floatorlist[float], optional, defaults toself.image_std) — Image standard deviation to normalize the image by ifdo_normalizeis set toTrue.
-  do_convert_rgb (bool, optional, defaults toself.do_convert_rgb) — Whether to convert the image to RGB.
-  background_color (tuple[int, int, int]) — The background color to use for the padding.
-  do_pad (bool, optional, defaults toself.do_pad) — Whether to pad the image to square or not.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Preprocess an image or batch of images.
resize
< source >( image: ndarray size: typing.Union[dict[str, int], int] resample: Resampling = <Resampling.BICUBIC: 3> data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs  ) → np.ndarray
Parameters
-  image (np.ndarray) — Image to resize.
-  size (dict[str, int]orint) — The size to resize the image to. If a dictionary, it should have the keys"height"and"width".
-  resample (PILImageResampling, optional, defaults toPILImageResampling.BICUBIC) —PILImageResamplingfilter to use when resizing the image e.g.PILImageResampling.BICUBIC.
-  data_format (ChannelDimensionorstr, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- None: will be inferred from input
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Returns
np.ndarray
The resized image.
Resize an image to dynamically calculated size.
DeepseekVLImageProcessorFast
class transformers.DeepseekVLImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.deepseek_vl.image_processing_deepseek_vl_fast.DeepseekVLFastImageProcessorKwargs] )
Constructs a fast Deepseek Vl image processor.
pad_to_square
< source >( images: torch.Tensor background_color: typing.Union[int, tuple[int, int, int]] = 0  ) → torch.Tensor
Parameters
-  images (torch.Tensor) — The images to pad.
-  background_color (intortuple[int, int, int], optional, defaults to 0) — The color to use for the padding. Can be an integer for single channel or a tuple of integers representing for multi-channel images. If passed as integer in multi-channel mode, it will default to0in subsequent channels.
Returns
torch.Tensor
The padded images.
Pads an image to a square based on the longest edge.
DeepseekVLModel
class transformers.DeepseekVLModel
< source >( config )
Parameters
- config (DeepseekVLModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Deepseek Vl Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None cache_position: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs )
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using DeepseekVLImageProcessor. See DeepseekVLImageProcessor.call() for details (DeepseekVLProcessor uses DeepseekVLImageProcessor for processing images).
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  logits_to_keep (Union[int, torch.Tensor], defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
The DeepseekVLModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
DeepseekVLForConditionalGeneration
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None cache_position: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using DeepseekVLImageProcessor. See DeepseekVLImageProcessor.call() for details (DeepseekVLProcessor uses DeepseekVLImageProcessor for processing images).
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  logits_to_keep (Union[int, torch.Tensor], defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
The DeepseekVLForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, DeepseekVLForConditionalGeneration
>>> model = DeepseekVLForConditionalGeneration.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
>>> processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
>>> messages = [
...     {
...         "role": "user", "content": [
...             {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
...             {"type": "text", "text": "Where is the cat standing?"},
...         ]
...     },
... ]
>>> inputs = processor.apply_chat_template(
...     messages,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
...     add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]