BM25S Index

This is a BM25S index created with the bm25s library (version 0.0.1dev0), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

Installation

You can install the bm25s library with pip:

pip install "bm25s==0.1.3"

# Include extra dependencies like stemmer
pip install "bm25s[full]==0.1.3"

# For huggingface hub usage
pip install huggingface_hub

Loading a bm25s index

You can use this index for information retrieval tasks. Here is an example:

import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("xhluca/bm25s-nq-index", revision="main")

# You can retrieve now
query = "a cat is a feline"

results = retriever.retrieve(bm25s.tokenize(query), k=3)

Saving a bm25s index

You can save a bm25s index to the Hugging Face Hub. Here is an example:

import bm25s
from bm25s.hf import BM25HF

# Create a BM25 index and add documents
retriever = BM25HF()
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]
corpus_tokens = bm25s.tokenize(corpus)
retriever.index(corpus_tokens)

token = None  # You can get a token from the Hugging Face website
retriever.save_to_hub("xhluca/bm25s-nq-index", token=token)

Stats

This dataset was created using the following data:

Statistic Value
Number of documents 2681468
Number of tokens 116237970
Average tokens per document 43.3486321671562

Parameters

The index was created with the following parameters:

Parameter Value
k1 1.5
b 0.75
delta 0.5
method lucene
idf method lucene
Downloads last month
19
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Space using xhluca/bm25s-nq-index 1

Collection including xhluca/bm25s-nq-index