The BM25SRetriever for the wiki2021 corpus

The corpus was created by the Atlas project and the index was built using the FlexRAG library.

Corpus Attribute	Value
Language	English
Domain	Wikipedia
Size	37.5M (33.1M text, 4.3M infobox)
Dump Date	Dec 2021
Provideer	Atlas
License	CC-BY-SA 3.0

Index Attribute	Value
Index Type	BM25S
Index Method	Lucene
Preprocessing	LengthFilter(min_char=10, max_char=4096)
Provideer	FlexRAG
License	CC-BY-SA 3.0

Installation

You can install the FlexRAG library with pip:

pip install flexrag

Loading a `FlexRAG` retriever

You can use this retriever for information retrieval tasks. Here is an example:

from flexrag.retriever import LocalRetriever

# Load the retriever from the HuggingFace Hub
retriever = LocalRetriever.load_from_hub("FlexRAG/wiki2021_atlas_bm25s")

# You can retrieve now
results = retriever.search("Who is Bruce Wayne?")

Running the RAG application with the retriever

You can run the GUI application of the RAG assistant with this retriever. Here is an example:

python -m flexrag.entrypoints.run_interactive \
    assistant_type=modular \
    modular_config.used_fields=[title,text] \
    modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
    modular_config.response_type=original \
    modular_config.generator_type=openai \
    modular_config.openai_config.model_name='gpt-4o-mini' \
    modular_config.openai_config.api_key=$OPENAI_KEY \
    modular_config.do_sample=False

You can also run the FlexRAG's RAG evaluation pipeline with this retriever. Here is an example that evaluates the ModularAssistant with the retriever on the Natural Questions test split:

OUTPUT_PATH=<path_to_output>
DB_PATH=<path_to_database>
OPENAI_KEY=<your_openai_key>

python -m flexrag.entrypoints.run_assistant \
    name=nq \
    split=test \
    output_path=${OUTPUT_PATH} \
    assistant_type=modular \
    modular_config.used_fields=[title,text] \
    modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
    modular_config.generator_type=openai \
    modular_config.openai_config.model_name='gpt-4o-mini' \
    modular_config.openai_config.api_key=$OPENAI_KEY \
    modular_config.do_sample=False \
    eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
    eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
    eval_config.retrieval_success_rate_config.eval_field=text \
    eval_config.response_preprocess.processor_type=[simplify_answer]

License

As the corpus is based on the CC-BY-SA 3.0 license, the retriever is also licensed under the same license.

FlexRAG
/

wiki2021_atlas_bm25s

The BM25SRetriever for the wiki2021 corpus

Installation

Loading a `FlexRAG` retriever

Running the RAG application with the retriever

License

Related Links

The BM25SRetriever for the wiki2021 corpus

Installation

Loading a FlexRAG retriever

Running the RAG application with the retriever

License

Related Links

Loading a `FlexRAG` retriever