About

As large language models (LLMs) continue to improve, evaluating how well they avoid hallucinations (producing information that is unfaithful or factually incorrect) has become increasingly important. While many models claim to be reliable, their factual grounding can vary significantly across tasks and settings.

This leaderboard provides a standardised evaluation of how different LLMs perform on hallucination detection tasks. Our goal is to help researchers and developers understand which models are more trustworthy in both grounded (context-based) and open-ended (real-world knowledge) settings. We use Verify by KlusterAI, an automated hallucination detection tool, to evaluate the factual consistency of model outputs.

Tasks

We evaluate each model using two benchmarks:

Retrieval-Augmented Generation (RAG setting)

RAG evaluates how well a model stays faithful to a provided context when answering a question. The input consists of a synthetic or real context paired with a relevant question. Models are expected to generate answers using only the information given, without adding external knowledge or contradicting the context.

Source: HaluEval QA
Dataset Size: 10,000 question-context pairs
Prompt Format: Prompt with relevant context document
Temperature: 0 (to enforce deterministic, grounded outputs)
System Prompt: Instructs the model to only use the document and avoid guessing.

Real-World Knowledge (Non-RAG setting)

This setting evaluates how factually accurate a model is when no context is provided. The model must rely solely on its internal knowledge to answer a broad range of user questions across many topics. The answers are then verified using web search to determine factual correctness.

Source: Filtered from UltraChat prompts
Dataset Size: 11,746 single-turn user queries
Prompt Format: Single user prompt without additional context
Temperature: 1 (to reflect natural, fluent generation)
System Prompt: Encourages helpfulness, accuracy, and honesty when unsure.

Evaluation Method

We use Verify, a hallucination detection tool built by KlusterAI, to classify model outputs:

In the RAG setting, Verify checks if the output contradicts, fabricates, or strays from the input document.
In the real-world knowledge setting, Verify uses search queries to fact-check the answer based on current, public information.

Each model's hallucination rate is computed as:

Hallucination Rate = (Number of hallucinated outputs) / (Total number of prompts)

A lower hallucination rate indicates better performance.