royrin
/

wiki-rag

Model card Files Files and versions Community

royrin commited on Apr 13

Commit

229f6d5

verified ·

1 Parent(s): 24f2dd5

Delete wiki_index__top_100000__2025-04-11

Browse files

Files changed (1) hide show

wiki_index__top_100000__2025-04-11 +0 -52

wiki_index__top_100000__2025-04-11 DELETED Viewed

@@ -1,52 +0,0 @@
-# Wiki-RAG
-This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.
-📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)
-Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.
-Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.
-There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).
-Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.
-## 🛠️ Usage
-```python
-from huggingface_hub import hf_hub_download
-import faiss
-import pickle
-# Download index and metadata
-index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
-meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)
-# Load FAISS index
-index = faiss.read_index(index_path)
-```
-# Do it for yourself, from Scratch
-1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
-    `aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
-2. Extract Wikipedia into machine-readable code:
-    `python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
-3. Get list of top 100k or 1M articles, by page-views from
-    `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
-4. load abstracts into RAG
-# Helpful Links:
-1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
-2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`