Updated wiki faiss indices with a few data quality issues addressed:

  1. Corrupted outputs from wikiextractor (PR #345)
  2. Increased minimum article length 100 -> 500 characters, as a catch-all against corrupted data.

RAG seems to go crazy at super-short articles with corrupted content, and assigns them a very high similarity score to almost any query. For instance, the article "Jo Marie Payton-France" is currently ranked as

top-2 for query "Virulence"
top-2 for query "Influenza"
top-5 for query "Apple"
top-7 for query "UK"

Index comparison

=== Query: Influenza ===

Old index results (April):

  1. Influenza (ID: 19572217)
  2. Jo Marie Payton-France (ID: 18393596)
  3. Human mortality from H5N1 (ID: 10615296)
  4. Super monkey ball adventure (ID: 22947958)
  5. Influenza A virus subtype H3N2 (ID: 2957262)
  6. J-TREC (ID: 36794400)
  7. 2011 Svalbard Polar Bear Attack (ID: 32990749)
  8. 2014 world series (ID: 79606406)
  9. Homages (ID: 27175720)
  10. Obsidian (disambiguation) (ID: 2501281)

New index results (July):

  1. Influenza (ID: 19572217)
  2. Human mortality from H5N1 (ID: 10615296)
  3. Influenza A virus subtype H3N2 (ID: 2957262)
  4. Influenza A virus (ID: 440479)
  5. Avian influenza (ID: 442916)
  6. 2009 swine flu pandemic (ID: 22555940)
  7. Influenza C virus (ID: 3833671)
  8. List of academic databases and search engines (ID: 1005923)
  9. United States influenza statistics by flu season (ID: 63599668)
  10. Influenza-like illness (ID: 19980133)
ffuuugor changed pull request status to open
royrin changed pull request status to merged

Sign up or log in to comment