OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval
A compact multi-modal embedding model that creates unified embeddings for text and images, enabling efficient retrieval across modalities without intermediate VLM transformations.
Features
- 1536d unified embedding space
- Text2Text, Text2Image, and Image2Image retrieval support
- Direct embedding without VLM conversion steps
- Layout preservation for image data
Performance
Cross-Modal Retrieval (vs CLIP-ViT-B/32)
- Hits@1: 0.428 (+60.8%)
- Hits@5: 0.651 (+38.9%)
Correlation Metrics (vs LaBSE)
- STS-B Pearson: 0.800 (+9.7%)
- STS-B Spearman: 0.795 (+7.3%)
- SICK Pearson: 0.782 (+6.3%)
Error Metrics (vs LaBSE)
- STS-B MSE: 3.222 (-19.6%)
- SICK MSE: 0.750 (-41.5%)
Installation & Usage
Install package:
pip install sportsvision
Basic usage:
import torch
from sportsvision.research.configs import UnifiedEmbedderConfig
from sportsvision.research.models import UnifiedEmbedderModel
from transformers import AutoConfig, AutoModel
from PIL import Image
# Register the custom configuration and model
AutoConfig.register("unified_embedder", UnifiedEmbedderConfig)
AutoModel.register(UnifiedEmbedderConfig, UnifiedEmbedderModel)
# Initialize the model from the pretrained repository
emb_model = AutoModel.from_pretrained("sportsvision/omniemb-v1")
# Determine the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Move the model to the device
emb_model = emb_model.to(device)
# Set the model to evaluation mode
emb_model.eval()
# Sample texts
texts = [
"Playoff season is exciting!",
"Injury updates for the team."
]
# Encode texts to obtain embeddings
text_embeddings = emb_model.encode_texts(texts)
print("Text Embeddings:", text_embeddings)
# Sample images
image_paths = [
"path_to_image1.jpg",
"path_to_image2.jpg"
]
# Load images using PIL
images = [Image.open(img_path).convert('RGB') for img_path in image_paths]
# Encode images to obtain embeddings
image_embeddings = emb_model.encode_images(images)
print("Image Embeddings:", image_embeddings)
Training
- Fine-tuned CLIP architecture
- Trained on VisRAG dataset using contrastive loss
- Evaluation scripts and detailed methodology documentation coming soon
Limitations
- Currently being benchmarked against ImageBind and other similar models
- Working on model extensions
Citation
If you use this model in your research, please cite:
@misc{kodathala2024omniemb,
author = {Kodathala, Varun},
title = {OmniEmb-v1: Multi-Modal Embeddings for Unified Retrieval},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/sportsvision/omniemb-v1}}
}
- Downloads last month
- 60
Model tree for sportsvision/omniemb-v1
Base model
openai/clip-vit-large-patch14