cleanup text
Browse files
docs.md
CHANGED
@@ -2,14 +2,6 @@
|
|
2 |
keywords: hallucination detection documentation, LLM hallucination benchmark, RAG evaluation guide, Verify API, kluster.ai, retrieval-augmented generation evaluation, large language model accuracy
|
3 |
-->
|
4 |
|
5 |
-
# About
|
6 |
-
|
7 |
-
As large language models (LLMs) continue to improve, evaluating how well they avoid hallucinations (producing information that is unfaithful or factually incorrect) has become increasingly important. While many models claim to be reliable, their factual grounding can vary significantly across tasks and settings.
|
8 |
-
|
9 |
-
This leaderboard provides a standardised evaluation of how different LLMs perform on hallucination detection tasks. Our goal is to help researchers and developers understand which models are more trustworthy in both grounded (context-based) and open-ended (real-world knowledge) settings. We use [Verify](https://platform.kluster.ai/verify) by [kluster.ai](https://platform.kluster.ai/), an automated hallucination detection API, to evaluate the factual consistency of model outputs.
|
10 |
-
|
11 |
-
---
|
12 |
-
|
13 |
# Tasks
|
14 |
|
15 |
We evaluate each model using two benchmarks:
|
|
|
2 |
keywords: hallucination detection documentation, LLM hallucination benchmark, RAG evaluation guide, Verify API, kluster.ai, retrieval-augmented generation evaluation, large language model accuracy
|
3 |
-->
|
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
# Tasks
|
6 |
|
7 |
We evaluate each model using two benchmarks:
|