πŸ–ΌοΈπŸ“ OneEncoder: A Unified Text & Image & Audio Model

OneEncoder is a lightweight framework for cross-modal alignment, focusing on efficiently integrating text and images (with future extensions to other modalities). Unlike traditional methods relying on massive modality-specific encoders, OneEncoder progressively aligns different data types, making it cost-effective and performant even on small paired datasets.

πŸš€ Key Features

βœ… Multimodal Alignment: Initially supports text & image, with extension to other modalities.
βœ… Lightweight & Efficient: Avoids full retraining when adding new modalities.
βœ… Superior Performance: Outperforms models that require large specialized datasets.

🎯 Applications

  • Visual Question Answering (VQA)
  • Image-Text Retrieval
  • Multimodal Content Understanding

πŸ“ Authors

πŸ“Œ Bilal FAYE, Hanane AZZAG, Mustapha LEBBAH, Djamel BOUCHAFFRA

πŸ“„ Research Paper

πŸ“œ arXiv: OneEncoder: Progressive Cross-Modal Alignment

πŸ“Œ Resources

πŸ”— GitHub Repo: OneEncoder
πŸš€ Hugging Face Demo: OneEncoder Retriever
πŸ““ Demo Notebook: OneEncoder Demos
πŸ”Š OneEncoder for Text, Image: HF Model

Downloads last month
26
Safetensors
Model size
304M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for bilalfaye/OneEncoder-text-image-audio

Finetuned
(123)
this model

Space using bilalfaye/OneEncoder-text-image-audio 1