updated-index-july
#1
by
ffuuugor
- opened
Updated wiki faiss indices with a few data quality issues addressed:
- Corrupted outputs from wikiextractor (PR #345)
- Increased minimum article length 100 -> 500 characters, as a catch-all against corrupted data.
RAG seems to go crazy at super-short articles with corrupted content, and assigns them a very high similarity score to almost any query. For instance, the article "Jo Marie Payton-France" is currently ranked as
top-2 for query "Virulence"
top-2 for query "Influenza"
top-5 for query "Apple"
top-7 for query "UK"
Index comparison
=== Query: Influenza ===
Old index results (April):
- Influenza (ID: 19572217)
- Jo Marie Payton-France (ID: 18393596)
- Human mortality from H5N1 (ID: 10615296)
- Super monkey ball adventure (ID: 22947958)
- Influenza A virus subtype H3N2 (ID: 2957262)
- J-TREC (ID: 36794400)
- 2011 Svalbard Polar Bear Attack (ID: 32990749)
- 2014 world series (ID: 79606406)
- Homages (ID: 27175720)
- Obsidian (disambiguation) (ID: 2501281)
New index results (July):
- Influenza (ID: 19572217)
- Human mortality from H5N1 (ID: 10615296)
- Influenza A virus subtype H3N2 (ID: 2957262)
- Influenza A virus (ID: 440479)
- Avian influenza (ID: 442916)
- 2009 swine flu pandemic (ID: 22555940)
- Influenza C virus (ID: 3833671)
- List of academic databases and search engines (ID: 1005923)
- United States influenza statistics by flu season (ID: 63599668)
- Influenza-like illness (ID: 19980133)
ffuuugor
changed pull request status to
open
royrin
changed pull request status to
merged