Visualizing How VLMs Work

Community Article Published October 7, 2025

Upvote

Introduction

Processor
Image Processor

Text Processor

Data Preparation

Model Architecture
Embedding Layer

Vision Model

Connector

Input Merger

Decoder

Conclusion

Introduction

Visual Language Models (VLMs) are autoregressive AI models that process both text and images as input. In this post, we’ll take a closer look at how VLMs like Idefics3 and SmolVLM operate under the hood exploring how they merge visual and textual information to generate coherent outputs.

As model reference in this blogpost we will use HuggingFaceTB/SmolVLM-256M-Instruct.

Processor

Image Processor

The processor prepares both text and image data into a unified format suitable for the model. For images, it performs a sequence of transformations before converting them into token-like representations the model can understand.

The image pipeline can be visualized as follows:

A key step in this process is image splitting, as seen in this code snippet. Each image is divided into smaller patches (or “splits”), which are individually encoded.

When text accompanies images, each image is first represented by a <image> token in the text sequence. Depending on the number of splits, each image is then expanded into: $64 \times \text{number\_of\_splits}$

The constant 64 comes from the relation: $\frac{512^2 / 16^2}{4^2} = 64$

We’ll get back to this formula later, but for now, think of it as:

Each image split is represented by 64 tokens.

Text Processor

When an image is included, the text must also reflect it using the <image> placeholder. The processor follows these steps:

Insert placeholders – Each image in the input is represented by a <image> token within the text.
Count splits – The processor determines how many splits each image has after preprocessing.
Expand tokens – Each <image> token is then replaced (or "expanded") into a sequence based on the number of image splits, according to the formula below.

To fully grasp how image tokens are expanded, check the encoding example below and review the following references:

Documentation example – shows a practical example of how expansion happens.

Main function – the core implementation that defines the expansion logic.

Encoding Example (click to expand)

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")

# Load images
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe this image?"}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

out = processor.decode(inputs["input_ids"][0])

print(out.replace("<image>", ".")) # printing the full <image> will bload the screen

<|im_start|>User:<fake_token_around_image><row_1_col_1>................................................................<fake_token_around_image><row_1_col_2>................................................................<fake_token_around_image><row_1_col_3>................................................................<fake_token_around_image><row_1_col_4>................................................................
<fake_token_around_image><row_2_col_1>................................................................<fake_token_around_image><row_2_col_2>................................................................<fake_token_around_image><row_2_col_3>................................................................<fake_token_around_image><row_2_col_4>................................................................
<fake_token_around_image><row_3_col_1>................................................................<fake_token_around_image><row_3_col_2>................................................................<fake_token_around_image><row_3_col_3>................................................................<fake_token_around_image><row_3_col_4>................................................................

<fake_token_around_image><global-img>................................................................<fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant:

Data Preparation

Before being fed into the model, both the image and text data are prepared and aligned. As in most autoregressive models, the output sequence is shifted to the right to teach the model how to predict the next token based on all previous ones.

However, since image representations cannot be directly predicted, they are masked using the <pad> token. This prevents the model from computing a loss over non-text (visual) tokens.

Model Architecture

Embedding Layer

From a high-level view, SmolVLM consists of five main components (illustrated on the right). The text processing starts with the prompt, which is tokenized and passed through an embedding layer. This layer transforms discrete tokens into high-dimensional vector representations, producing a tensor of shape: $[\text{sequence\_length}, 576]$

This tensor now encodes the vector representation of the input text, and it serves as the foundation for all subsequent computations, note that this tensor already has 13 x 64 placeholder tokens for the upcoming processed image (where 13 is the number of image splits)

Vision Model

Patch Embedding

For the visual branch, the input tensor has the shape [splits, num_channels, height, width]. For example, for an image that can be split into 13 splits of RGB images is represented as [13, 3, 512, 512].

The patch embedding layer transforms this tensor into a sequence of visual tokens, making it compatible with the Transformer architecture. It does this by dividing the image into small, non-overlapping patches and projecting each one into a high-dimensional vector space, the core principle behind this is to adapt the RGB channel as an embedding dim and expand that 3 -> 768

This operation is implemented as a 2D convolution, where kernel_size and stride are both set to the patch size. This ensures each convolutional window processes one patch without overlap:

self.patch_embedding = nn.Conv2d(
    in_channels=config.num_channels,  # 3 (RGB channels)
    out_channels=self.embed_dim,      # 768
    kernel_size=self.patch_size,      # 16
    stride=self.patch_size,           # 16
    padding="valid",
)

When applied to [13, 3, 512, 512], the transformation proceeds as follows:

Step	Operation	Output Shape	Description
1	Conv2d	`[13, 768, 32, 32]`	Each 16×16 patch is embedded into a 768-dimensional vector (`512 / 16 = 32`).
2	Reshape	`[13, 1024, 768]`	The 32×32 grid is flattened into 1024 patches (visual tokens).

The resulting tensor [13, 1024, 768] represents each image as a sequence of 1024 embedded patches, ready to be processed alongside text embeddings within the Transformer.

Positional Encoder

The positional encoder injects information about patch order and spatial layout just as positional encodings do for words in a text model. Since transformers have no inherent sense of order, these encodings allow the model to understand where each patch is located within the image.

Encoder

The encoder operates in a straightforward manner:

A Multi-Head Attention (MHA) layer captures relationships between image patches, enabling the model to reason about spatial dependencies.
The output is then passed through a feed-forward network (MLP) to refine and project the features.

This attention implementation is close to how BERT functions, it is essential note that the output of the vision_model is [13,1024,768] given the example input we are using.

Connector

The connector serves as the bridge between the vision encoder and the language model, ensuring both modalities share a compatible embedding space. Its two main functions are:

Compressing the visual output (reducing the number of tokens).
Casting the embedding dimension to match that of the text embeddings.

Pixel Shuffle

The pixel shuffle operation compresses the spatial dimension of the visual features while preserving important spatial relationships. In practical terms, it reduces the number of image tokens from 1024 → 64, drastically lowering the sequence length while maintaining representational richness.

This transformation unfolds as follows:

Description Code

Description	Code
`# (split height and width) → [splits, H, W, C] # (apply transformation on width dimension) → [splits, H, W/scale, C×scale] # (permute dimensions) → [splits, W/scale, H, C×scale] # (apply transformation on height dimension) → [splits, W/scale, H/scale, C×scale²] # (permute dimensions back) → [splits, H/scale, W/scale, C×scale²] # (merge height and width back) → [splits, (H/scale)×(W/scale), C×scale²]`

# (split height and width) 
→ [splits, H, W, C] 
#  (apply transformation on width dimension)
→  [splits, H, W/scale, C×scale] 
#  (permute dimensions)
→ [splits, W/scale, H, C×scale] 
#  (apply transformation on height dimension)
→ [splits, W/scale, H/scale, C×scale²] 
# (permute dimensions back)
→ [splits, H/scale, W/scale, C×scale²] 
# (merge height and width back)
→ [splits, (H/scale)×(W/scale), C×scale²]

by doing so we get a final dimension equal to $[\text{splits}, \text{H'} / \text{scale} \times \text{W'} / \text{scale}, \text{C} \times \text{scale}^2 ]$ resulting in an output dimension equal to

$[13, 1024/4^2, 768 * 4^2] \rightarrow [13, 64, 12288]$

By progressively shuffling and regrouping pixels, this operation compresses both height and width dimensions while maintaining the spatial order of visual information in each direction. The result is a more compact tensor that preserves essential image features.

Modality Projection

The connector also applies a modality_projection layer which is a linear transformation to match the embedding_dimension of the text tokens, by doing so we get $[13, 64, 12288] \rightarrow [13, 64, 576]$

Input Merger

The input merger is a non-learnable layer, and is in fact a function responsible for integrating visual and textual embeddings into a single input sequence. It replaces each <image> placeholder token in the text sequence with the corresponding visual embeddings produced by the connector.

In practice, this function scans the token IDs for <image> tokens and substitutes them with their preprocessed image representations. This step effectively merges both modalities into a single, continuous tensor that can be fed directly into the decoder.

Decoder

The decoder functions similarly to that of a traditional autoregressive language model. It is composed of stacked Masked Multi-Head Attention (MHA) layers followed by a Language Modeling (LM) head.

The Masked MHA ensures that each token can only attend to previous tokens (including visual ones), preserving causality during text generation.
The LM head maps the decoder’s hidden states back to vocabulary logits, enabling the model to predict the next token.

This allow the VLM to generate coherent multimodal outputs seamlessly grounding textual predictions in visual context.

The key thing to keep on mind here, since we have no way of figuring out the output for the image tokens we use a pad token in the target to skip calculating loss over it.

Conclusion

In this post, we explored the inner workings of Visual Language Models (VLMs) like SmolVLM, breaking down how they process and integrate multimodal data from raw pixels and text to coherent, grounded outputs.

Here’s a quick recap of each stage:

Processor: Prepares and aligns raw text and image inputs.
Vision Module: Converts pixel data into high-dimensional patch embeddings.
Connector: Compresses and projects visual features into the same embedding space as text tokens.
Input Merger: Replaces placeholder tokens with visual embeddings to form a unified multimodal sequence.
Decoder: Generates context-aware text by attending to both visual and textual information.

At their core, VLMs don’t just see and read, they reason across modalities. This architecture allows them to handle multiple images, text-only prompts, or even image-only inputs, making SmolVLM a flexible and powerful foundation for a wide range of multimodal applications.

Community

munro01

19 days ago

That’s a brilliant explanation of how VLMs function! Their ability to process and reason across different inputs truly sets them apart. It reminds me of how the letterboxed game blends visuals and words creatively, challenging both perception and logic. Similarly, SmolVLM’s flexible architecture enhances interaction, making it ideal for image-only, text-only, or combined tasks perfectly balancing precision and creativity in multimodal AI understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Visualizing How VLMs Work

Introduction Processor Image Processor Text Processor Data Preparation Model Architecture Embedding Layer Vision Model Connector Input Merger Decoder Conclusion Introduction

Processor

Image Processor

Text Processor

Data Preparation

Model Architecture

Embedding Layer

Vision Model

Connector

Input Merger

Decoder

Conclusion

Community

Introduction

Processor
Image Processor

Text Processor

Data Preparation

Model Architecture
Embedding Layer

Vision Model

Connector

Input Merger

Decoder

Conclusion

Introduction