Delete wiki_index__top_100000__2025-04-11
Browse files
wiki_index__top_100000__2025-04-11
DELETED
@@ -1,52 +0,0 @@
|
|
1 |
-
# Wiki-RAG
|
2 |
-
This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.
|
3 |
-
|
4 |
-
📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.
|
9 |
-
|
10 |
-
Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.
|
11 |
-
|
12 |
-
There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).
|
13 |
-
|
14 |
-
Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
## 🛠️ Usage
|
20 |
-
|
21 |
-
```python
|
22 |
-
from huggingface_hub import hf_hub_download
|
23 |
-
import faiss
|
24 |
-
import pickle
|
25 |
-
|
26 |
-
# Download index and metadata
|
27 |
-
index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
|
28 |
-
meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)
|
29 |
-
|
30 |
-
# Load FAISS index
|
31 |
-
index = faiss.read_index(index_path)
|
32 |
-
|
33 |
-
```
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
# Do it for yourself, from Scratch
|
39 |
-
1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
|
40 |
-
`aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
|
41 |
-
2. Extract Wikipedia into machine-readable code:
|
42 |
-
`python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
|
43 |
-
3. Get list of top 100k or 1M articles, by page-views from
|
44 |
-
`https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
|
45 |
-
4. load abstracts into RAG
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
# Helpful Links:
|
51 |
-
1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
|
52 |
-
2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|