The BM25SRetriever for the wiki2021 corpus

The corpus was created by the Atlas project and the index was built using the FlexRAG library.

Corpus Attribute Value
Language English
Domain Wikipedia
Size 37.5M (33.1M text, 4.3M infobox)
Dump Date Dec 2021
Provideer Atlas
License CC-BY-SA 3.0
Index Attribute Value
Index Type BM25S
Index Method Lucene
Preprocessing LengthFilter(min_char=10, max_char=4096)
Provideer FlexRAG
License CC-BY-SA 3.0

Installation

You can install the FlexRAG library with pip:

pip install flexrag

Loading a FlexRAG retriever

You can use this retriever for information retrieval tasks. Here is an example:

from flexrag.retriever import LocalRetriever

# Load the retriever from the HuggingFace Hub
retriever = LocalRetriever.load_from_hub("FlexRAG/wiki2021_atlas_bm25s")

# You can retrieve now
results = retriever.search("Who is Bruce Wayne?")

Running the RAG application with the retriever

You can run the GUI application of the RAG assistant with this retriever. Here is an example:

python -m flexrag.entrypoints.run_interactive \
    assistant_type=modular \
    modular_config.used_fields=[title,text] \
    modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
    modular_config.response_type=original \
    modular_config.generator_type=openai \
    modular_config.openai_config.model_name='gpt-4o-mini' \
    modular_config.openai_config.api_key=$OPENAI_KEY \
    modular_config.do_sample=False

You can also run the FlexRAG's RAG evaluation pipeline with this retriever. Here is an example that evaluates the ModularAssistant with the retriever on the Natural Questions test split:

OUTPUT_PATH=<path_to_output>
DB_PATH=<path_to_database>
OPENAI_KEY=<your_openai_key>

python -m flexrag.entrypoints.run_assistant \
    name=nq \
    split=test \
    output_path=${OUTPUT_PATH} \
    assistant_type=modular \
    modular_config.used_fields=[title,text] \
    modular_config.retriever_type="FlexRAG/wiki2021_atlas_bm25s" \
    modular_config.generator_type=openai \
    modular_config.openai_config.model_name='gpt-4o-mini' \
    modular_config.openai_config.api_key=$OPENAI_KEY \
    modular_config.do_sample=False \
    eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
    eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
    eval_config.retrieval_success_rate_config.eval_field=text \
    eval_config.response_preprocess.processor_type=[simplify_answer]

License

As the corpus is based on the CC-BY-SA 3.0 license, the retriever is also licensed under the same license.

Related Links

FlexRAG Related Links:

Downloads last month
16
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.