OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval

A compact multi-modal embedding model that creates unified embeddings for text and images, enabling efficient retrieval across modalities without intermediate VLM transformations.

Features

  • 1536d unified embedding space
  • Text2Text, Text2Image, and Image2Image retrieval support
  • Direct embedding without VLM conversion steps
  • Layout preservation for image data

Performance

Cross-Modal Retrieval (vs CLIP-ViT-B/32)

  • Hits@1: 0.428 (+60.8%)
  • Hits@5: 0.651 (+38.9%)

Correlation Metrics (vs LaBSE)

  • STS-B Pearson: 0.800 (+9.7%)
  • STS-B Spearman: 0.795 (+7.3%)
  • SICK Pearson: 0.782 (+6.3%)

Error Metrics (vs LaBSE)

  • STS-B MSE: 3.222 (-19.6%)
  • SICK MSE: 0.750 (-41.5%)

Installation & Usage

Install package:

pip install sportsvision

Basic usage:

import torch
from sportsvision.research.configs import UnifiedEmbedderConfig
from sportsvision.research.models import UnifiedEmbedderModel
from transformers import AutoConfig, AutoModel
from PIL import Image

# Register the custom configuration and model
AutoConfig.register("unified_embedder", UnifiedEmbedderConfig)
AutoModel.register(UnifiedEmbedderConfig, UnifiedEmbedderModel)

# Initialize the model from the pretrained repository
emb_model = AutoModel.from_pretrained("sportsvision/omniemb-v1")

# Determine the device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Move the model to the device
emb_model = emb_model.to(device)

# Set the model to evaluation mode
emb_model.eval()

# Sample texts
texts = [
    "Playoff season is exciting!",
    "Injury updates for the team."
]

# Encode texts to obtain embeddings
text_embeddings = emb_model.encode_texts(texts)
print("Text Embeddings:", text_embeddings)

# Sample images
image_paths = [
    "path_to_image1.jpg",
    "path_to_image2.jpg"
]

# Load images using PIL
images = [Image.open(img_path).convert('RGB') for img_path in image_paths]

# Encode images to obtain embeddings
image_embeddings = emb_model.encode_images(images)
print("Image Embeddings:", image_embeddings)

Training

  • Fine-tuned CLIP architecture
  • Trained on VisRAG dataset using contrastive loss
  • Evaluation scripts and detailed methodology documentation coming soon

Limitations

  • Currently being benchmarked against ImageBind and other similar models
  • Working on model extensions

Citation

If you use this model in your research, please cite:

@misc{kodathala2024omniemb,
  author = {Kodathala, Varun},
  title = {OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/sportsvision/omniemb-v1}}
}
Downloads last month
60
Safetensors
Model size
431M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for sportsvision/omniemb-v1

Finetuned
(53)
this model

Dataset used to train sportsvision/omniemb-v1