royrin commited on
Commit
229f6d5
·
verified ·
1 Parent(s): 24f2dd5

Delete wiki_index__top_100000__2025-04-11

Browse files
Files changed (1) hide show
  1. wiki_index__top_100000__2025-04-11 +0 -52
wiki_index__top_100000__2025-04-11 DELETED
@@ -1,52 +0,0 @@
1
- # Wiki-RAG
2
- This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.
3
-
4
- 📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)
5
-
6
-
7
-
8
- Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.
9
-
10
- Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.
11
-
12
- There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).
13
-
14
- Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.
15
-
16
-
17
-
18
-
19
- ## 🛠️ Usage
20
-
21
- ```python
22
- from huggingface_hub import hf_hub_download
23
- import faiss
24
- import pickle
25
-
26
- # Download index and metadata
27
- index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
28
- meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)
29
-
30
- # Load FAISS index
31
- index = faiss.read_index(index_path)
32
-
33
- ```
34
-
35
-
36
-
37
-
38
- # Do it for yourself, from Scratch
39
- 1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
40
- `aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
41
- 2. Extract Wikipedia into machine-readable code:
42
- `python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
43
- 3. Get list of top 100k or 1M articles, by page-views from
44
- `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
45
- 4. load abstracts into RAG
46
-
47
-
48
-
49
-
50
- # Helpful Links:
51
- 1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
52
- 2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`