|
# Wiki-RAG |
|
This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia. |
|
|
|
๐ Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag) |
|
|
|
|
|
|
|
Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia. |
|
|
|
Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page. |
|
|
|
There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=). |
|
|
|
Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`. |
|
|
|
|
|
|
|
|
|
## ๐ ๏ธ Usage |
|
|
|
```python |
|
from huggingface_hub import hf_hub_download |
|
import faiss |
|
import pickle |
|
|
|
# Download index and metadata |
|
index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX) |
|
meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA) |
|
|
|
# Load FAISS index |
|
index = faiss.read_index(index_path) |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
# Do it for yourself, from Scratch |
|
1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c) |
|
`aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2` |
|
2. Extract Wikipedia into machine-readable code: |
|
`python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json` |
|
3. Get list of top 100k or 1M articles, by page-views from |
|
`https://dumps.wikimedia.org/other/pageviews/2024/2024-12/` |
|
4. load abstracts into RAG |
|
|
|
|
|
|
|
|
|
# Helpful Links: |
|
1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/` |
|
2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/` |