Building Multimodal RAG Systems: Supercharging Retrieval with MultiModal Embeddings and LLMs

Community Article Published May 1, 2025

for more details: https://www.youtube.com/watch?v=BI2ROqd38t4

Summary:

The project discusses the development of a multimodal retrieval-augmented generation system capable of processing images, text, and tables, moving beyond traditional text-based systems. It highlights the limitations of converting images into text descriptions and introduces a new approach using (Cohere's Embed V4 model)[https://cohere.com/blog/embed-4], which generates fixed-size multimodal embeddings for efficient retrieval. The workflow involves creating embeddings for document images, storing them in a vector store, and using a MultiModal LLM like Gemini to generate answers based on user queries.

Introduction

Most Retrieval Augmented Generation (RAG) systems today are text-only, but real-world documents contain a rich mix of content types including images, tables, and charts. Traditional approaches to handling non-text content involve generating text descriptions (captions) of visual elements, then embedding these captions — essentially converting everything to text. This approach, however, loses substantial contextual information and depends heavily on the quality of prompt.

This article explores building a true multimodal RAG system that directly processes images alongside text, enabling more accurate and comprehensive information retrieval from documents containing visual elements.

The Problems with Text-Only RAG for Visual Content

When dealing with documents containing images, tables, and infographics, traditional text-based RAG systems face several limitations:

Information loss - Converting rich visual data to text descriptions loses contextual information
Caption quality dependency - Results heavily depend on the quality of AI-generated captions
Prompt engineering burden - Requires careful prompt design for the vision-to-text conversion
Inability to reason visually - Text descriptions struggle to capture spatial relationships and visual patterns

Consider financial reports with complex charts, medical documents with diagnostic images, or technical documentation with diagrams — text descriptions simply cannot capture all relevant details.

Two Approaches to Vision-Enhanced RAG

We'll explore an API-based solution using Cohere's Embed v4 + Gemini:

Let's dive into the implementation details!

Cohere Embed v4 + Gemini

Cohere's Embed v4 model offers state-of-the-art multimodal embeddings that can directly encode images for retrieval. Unlike some approaches that create multi-level embeddings (which have high memory requirements), Embed v4 generates fixed-size embeddings compatible with standard vector stores.

Implementation Steps

Setup and Authentication

import cohere
cohere_api_key = "" # Replace with your Cohere API key
co = cohere.ClientV2(api_key=cohere_api_key)

from google import genai
gemini_api_key = "" # Replace with your Gemini API key
client = genai.Client(api_key=gemini_api_key)

Image Processing Functions

import PIL
import io
import base64

max_pixels = 1568*1568  # Max resolution for images

# Resize too large images
def resize_image(pil_image):
    org_width, org_height = pil_image.size
    if org_width * org_height > max_pixels:
        scale_factor = (max_pixels / (org_width * org_height)) ** 0.5
        new_width = int(org_width * scale_factor)
        new_height = int(org_height * scale_factor)
        pil_image.thumbnail((new_width, new_height))

# Convert images to a base64 string
def base64_from_image(img_path):
    pil_image = PIL.Image.open(img_path)
    img_format = pil_image.format if pil_image.format else "PNG"
    
    resize_image(pil_image)
    
    with io.BytesIO() as img_buffer:
        pil_image.save(img_buffer, format=img_format)
        img_buffer.seek(0)
        img_data = f"data:image/{img_format.lower()};base64,"+base64.b64encode(img_buffer.read()).decode("utf-8")
    
    return img_data

Create Embeddings for Images

import numpy as np
import os
import requests
import tqdm

img_paths = []
doc_embeddings = []

for name, url in tqdm.tqdm(images.items()):
    img_path = os.path.join(img_folder, name)
    img_paths.append(img_path)
    
    # Download the image if needed
    if not os.path.exists(img_path):
        response = requests.get(url)
        response.raise_for_status()
        
        with open(img_path, "wb") as fOut:
            fOut.write(response.content)
    
    # Get the base64 representation of the image
    api_input_document = {
        "content": [
            {"type": "image", "image": base64_from_image(img_path)},
        ]
    }
    
    # Call the Embed v4.0 model
    api_response = co.embed(
        model="embed-v4.0",
        input_type="search_document",
        embedding_types=["float"],
        inputs=[api_input_document],
    )
    
    # Store embedding
    emb = np.asarray(api_response.embeddings.float[0])
    doc_embeddings.append(emb)

doc_embeddings = np.vstack(doc_embeddings)

Implement Search Function

def search(question, max_img_size=800):
    # Compute the embedding for the query
    api_response = co.embed(
        model="embed-v4.0",
        input_type="search_query",
        embedding_types=["float"],
        texts=[question],
    )
    
    query_emb = np.asarray(api_response.embeddings.float[0])
    
    # Compute cosine similarities
    cos_sim_scores = np.dot(query_emb, doc_embeddings.T)
    
    # Get the most relevant image
    top_idx = np.argmax(cos_sim_scores)
    hit_img_path = img_paths[top_idx]
    
    return hit_img_path

Answer Generation with Gemini

def answer(question, img_path):
    prompt = [f"""Answer the question based solely on the information from the image.
               Question: {question}""", PIL.Image.open(img_path)]
    
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-04-17",
        contents=prompt
    )
    
    return response.text

Putting It All Together

question = "What is the net profit for Nike?"
top_image_path = search(question)
answer_text = answer(question, top_image_path)

Arabic Dictionary Navigation and Definition Retrieval Usa Case:

I implemented this VisionRAG system and extensively tested it with Arabic dictionary pages, achieving impressive results across various queries.

I created a database of Arabic dictionary page images and tested the system with natural language queries in both Arabic and English:

Basic Definition Lookups
- Query: "ما معنى البروز؟" (What is the meaning of "brose"?)
- Process: The system searched through dictionary pages and retrieved the exact page containing the definition
- Result: The system successfully returned the definition of "brose" as shown in the dictionary image, explaining it refers to a type of food made with oatmeal, as indicated in the glossary of terms in the image under the word "brose"

Advanced Considerations

Vector Quantization for Efficiency

Embedding vectors can be quantized (reduced from 32-bit to 4-bit or 8-bit precision) to significantly decrease storage requirements while maintaining most of the retrieval performance. This makes multimodal RAG systems more practical for large-scale deployments.

Hybrid Approaches

For optimal performance, consider:

Embedding both images and associated text together
Creating a reranker setup to improve retrieval quality
Using local models for privacy-sensitive applications
Combining multiple modalities (text, images, tables) in the same system

Conclusion

Multimodal RAG represents the next evolution in information retrieval systems, enabling far more powerful document understanding and query answering capabilities. By directly processing visual elements rather than converting everything to text, these systems can:

Preserve contextual information in visual data
Answer complex questions requiring visual reasoning
Process documents more naturally, as humans do
Handle information-dense visualizations like charts and tables

Whether you choose a cloud API-based approach or a fully local implementation, multimodal RAG opens up new possibilities for applications in finance, healthcare, technical documentation, education, and beyond.

This article was prepared based on research into multimodal RAG systems and implementation details from various sources. The code examples are for educational purposes and may need adjustments for production use. For me check : https://www.youtube.com/watch?v=V1VOdoEFaDw

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote