Post
160
New state-of-the-art BERT-size retrieval model: *cde-small-v2* 🥳🍾
Hi everyone! We at Cornell are releasing a new retrieval model this week. It uses the contextual embeddings framework, is based on ModernBERT backbone, and gets state-of-the-art results on the MTEB benchmark for its model size (140M parameters). cde-small-v2 gets an average score of 65.6 across the 56 datasets and sees improvements from our previous model in *every* task domain (retrieval, classification, etc.).
We made a lot of changes to make this model work. First of all, ModernBERT has a better tokenizer, which probably helped this work out-of-the-box. We also followed the principles from the CDE paper and used harder clusters and better hard-negative filtering, which showed a small performance improvement. And we made a few small changes that have been shown to work on the larger models: we disabled weight decay, masked out the prefix tokens during pooling, and added a residual connection from the first-stage to the second-stage for better gradient flow.
We're still looking for a computer sponsor to help us scale CDE to larger models. Since it's now state-of-the-art at the 100M parameter scale, it seems to be a reasonable bet that we could train a state-of-the-art large model if we had the GPUs. If you're interested in helping with this, please reach out!
Here's a link to the model: jxm/cde-small-v2
And here's a link to the paper: Contextual Document Embeddings (2410.02525)
Hi everyone! We at Cornell are releasing a new retrieval model this week. It uses the contextual embeddings framework, is based on ModernBERT backbone, and gets state-of-the-art results on the MTEB benchmark for its model size (140M parameters). cde-small-v2 gets an average score of 65.6 across the 56 datasets and sees improvements from our previous model in *every* task domain (retrieval, classification, etc.).
We made a lot of changes to make this model work. First of all, ModernBERT has a better tokenizer, which probably helped this work out-of-the-box. We also followed the principles from the CDE paper and used harder clusters and better hard-negative filtering, which showed a small performance improvement. And we made a few small changes that have been shown to work on the larger models: we disabled weight decay, masked out the prefix tokens during pooling, and added a residual connection from the first-stage to the second-stage for better gradient flow.
We're still looking for a computer sponsor to help us scale CDE to larger models. Since it's now state-of-the-art at the 100M parameter scale, it seems to be a reasonable bet that we could train a state-of-the-art large model if we had the GPUs. If you're interested in helping with this, please reach out!
Here's a link to the model: jxm/cde-small-v2
And here's a link to the paper: Contextual Document Embeddings (2410.02525)