we are very proud to introduce jinaai/jina-clip-v1, aka "jina-embeddings-multimodal".
The OpenAI CLIP openai/clip-vit-base-patch32 have nice performance to align text and image modality, that user can perform cross-modal text image retrieval or image classification on top of it. However, due to the training data and recipe, it can not:
1. model longer sequence of text inputs (77 token constraint). 2. align text representations (CLIP Text Tower is weak for text search).
1. Stronger cross-modal performance against OpenAI sets, 2% and 6% improvement on cross-modal retrieval recall@5. 2. Text tower of the JinaCLIP is a strong text encoder, reach the same performance as jinaai/jina-embeddings-v2-base-en, 165% improvement on MTEB[BEIR] recall@5. 3. Image tower of the JinaCLIP also shows strong performance in image-image search (CBIR), 12% recall improvement on Cifar100 test set.
If you are working on MuRAG (multimodal-retrieval argumented generation), try it out!
In the vector search setup, we normally combine a fast embedding model and an accurate but slow reranker model.
The newly released @jinaai rerankers are small in size and almost as accurate as our base reranker. This means given a time constraint, it can scoring more candidate documents from embedding models and have a better chance to feed LLM the correct context for RAG generation.
These models are available on Huggingface and has been integrated into the latest SentenceTransformers 2.7.0. Check it out!
This demo shows why on-device ML is so important: 1. Privacy - local inference means no user data is sent to the cloud 2. No server latency - empowers developers to build real-time applications 3. Lower costs - no need to pay for bandwidth and processing of streamed video
@jinaai, we've recently launched an interesting model: jinaai/jina-colbert-v1-en. In this post, I'd like to give you a quick introduction to ColBERT: the multi-vector search & late interaction retriever.
As you may already know, we've been developing embedding models such as jinaai/jina-embeddings-v2-base-en for some time. These models, often called 'dense retrievers', generate a single representation for each document.
Embedding models like Jina-v2 have the advantage of quick integration with vector databases and good performance within a specific domain.
When discussing tasks within a specific domain, it means embedding models can perform very well by "seeing similar distributions". However, this also suggests that they might only perform "okay" on tasks outside of that domain and require fine-tuning.
Now, let's delve into multi-vector search and late-interaction models. The idea is quite simple:
1. During model training, you apply dimensionality reduction to decrease the vector dimensionality from 768 to 128 to save storage. 2. Now, with one query and one document, you match each query token embedding against every token embedding in the document to find the maximum similarity score. Repeat this process for each token in the query, from the second to the last, and then sum up all the maximum similarity scores.
This process is called multi-vector search because if your query has 5 tokens, you're keeping 5 * 128 token embeddings. The "max similarity sum-up" procedure is termed late interaction.
Multi-vector & Late interaction retrievers have the advantage of:
1. Excellent performance outside of a specific domain since they match at a token-level granularity. 2. Explainability: you can interpret your token-level matching and understand why the score is higher/lower.
We've been busy cooking up some interesting models at @jinaai, with a recent highlight being the release of our first batch of bilingual embedding models.
Internally labeled as X+EN, where X represents the target language and EN stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.
We're also excited to announce that a Spanish bilingual embedding will be released in approximately two weeks.
Our evaluation across various MLM tasks has demonstrated that the Bilingual Backbone consistently outperforms state-of-the-art Multilingual Backbones like XLM-Roberta (given its focus on just two languages).
Despite being three times smaller than the leading multilingual models (e5-multilingual-large), our released bilingual embedding models have shown superior performance compared to e5-multilingual-large, excelling in both monolingual and cross-lingual search tasks.
Currently, we're putting the finishing touches on the technical report, which should be available on Arxiv by next week.
Looking ahead, the embedding team is gearing up for jina-embeddings-v3 with some initial groundwork already underway. Stay tuned for more updates!