SPLADE-Index python package: An ultra-fast search index for SPLADE sparse retrieval models
#8
by
rasyosef
- opened
SPLADE-Index⚡
splade-index
: https://github.com/rasyosef/splade-index
SPLADE is a neural retrieval model which learns query/document sparse expansion. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).
For more information about SPLADE models, please refer to the following.
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
- List of Pretrained Sparse Encoder (Sparse Embeddings) Models
- Training and Finetuning Sparse Embedding Models with Sentence Transformers v5.
Installation
You can install splade-index
with pip:
pip install splade-index
Recommended (but optional) dependencies:
# To speed up the top-k selection process, you can install `jax`
pip install "jax[cpu]"
Quickstart
Here is a simple example of how to use splade-index
:
from sentence_transformers import SparseEncoder
from splade_index import SPLADE
# Download a SPLADE model from the 🤗 Hub
model = SparseEncoder("rasyosef/splade-tiny")
# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
# Create the SPLADE retriever and index the corpus
retriever = SPLADE()
retriever.index(model=model, documents=corpus)
# Query the corpus
queries = ["does the fish purr like a cat?"]
# Get top-k results as a tuple of (doc ids, documents, scores). All three are arrays of shape (n_queries, k).
results = retriever.retrieve(queries, k=2)
doc_ids, result_docs, scores = results.doc_ids, results.documents, results.scores
for i in range(doc_ids.shape[1]):
doc_id, doc, score = doc_ids[0, i], result_docs[0, i], scores[0, i]
print(f"Rank {i+1} (score: {score:.2f}) (doc_id: {doc_id}): {doc}")
# You can save the index to a directory
retriever.save("animal_index_splade")
# ...and load it when you need it
import splade_index
reloaded_retriever = splade_index.SPLADE.load("animal_index_splade", model=model)