Ryan McConville
commited on
Commit
Β·
27960a9
1
Parent(s):
db0d8a2
fix char display
Browse files- introduction.md +5 -5
- submit.md +3 -3
introduction.md
CHANGED
@@ -6,16 +6,16 @@ The **LLM Hallucination Detection Leaderboard** is a public, continuously update
|
|
6 |
|
7 |
### Why does hallucination detection matter?
|
8 |
|
9 |
-
* **User Trust & Safety
|
10 |
-
* **Retrieval-Augmented Generation (RAG) Quality
|
11 |
-
* **Regulatory & Compliance Pressure
|
12 |
|
13 |
### How we measure hallucinations
|
14 |
|
15 |
We evaluate each model on two complementary benchmarks and compute a *hallucination rate* (lower = better):
|
16 |
|
17 |
-
1. **HaluEval-QA (RAG setting)
|
18 |
-
2. **UltraChat Filtered (Non-RAG setting)
|
19 |
|
20 |
Outputs are automatically verified by [Verify](https://platform.kluster.ai/verify) from [kluster.ai](https://kluster.ai/), which cross-checks claims against the source document or web results.
|
21 |
|
|
|
6 |
|
7 |
### Why does hallucination detection matter?
|
8 |
|
9 |
+
* **User Trust & Safety**: Hallucinations undermine confidence and can damage reputation.
|
10 |
+
* **Retrieval-Augmented Generation (RAG) Quality**: In enterprise workflows, LLMs must remain faithful to supplied context. Measuring hallucinations highlights which models respect that constraint.
|
11 |
+
* **Regulatory & Compliance Pressure**: Upcoming AI regulations require demonstrable accuracy standards. Reliable hallucination metrics can help you meet these requirements.
|
12 |
|
13 |
### How we measure hallucinations
|
14 |
|
15 |
We evaluate each model on two complementary benchmarks and compute a *hallucination rate* (lower = better):
|
16 |
|
17 |
+
1. **HaluEval-QA (RAG setting)**: Given a question *and* a supporting document, the model must answer *only* using the provided context.
|
18 |
+
2. **UltraChat Filtered (Non-RAG setting)**: Open-domain questions with **no** extra context test the model's internal knowledge.
|
19 |
|
20 |
Outputs are automatically verified by [Verify](https://platform.kluster.ai/verify) from [kluster.ai](https://kluster.ai/), which cross-checks claims against the source document or web results.
|
21 |
|
submit.md
CHANGED
@@ -18,15 +18,15 @@ Please email **[email protected]** with the subject line:
|
|
18 |
|
19 |
Attach **one ZIP file** that contains **all of the following**:
|
20 |
|
21 |
-
1. **`model_card.md
|
22 |
β’ Name and version
|
23 |
β’ Architecture / base model
|
24 |
β’ Training or finetuning procedure
|
25 |
β’ License
|
26 |
β’ Intended use & known limitations
|
27 |
β’ Contact information
|
28 |
-
2. **`results.csv
|
29 |
-
3. (Optional) **`extra_notes.md
|
30 |
|
31 |
---
|
32 |
|
|
|
18 |
|
19 |
Attach **one ZIP file** that contains **all of the following**:
|
20 |
|
21 |
+
1. **`model_card.md`**: A short Markdown file describing your model:
|
22 |
β’ Name and version
|
23 |
β’ Architecture / base model
|
24 |
β’ Training or finetuning procedure
|
25 |
β’ License
|
26 |
β’ Intended use & known limitations
|
27 |
β’ Contact information
|
28 |
+
2. **`results.csv`**: A CSV file with **one row per prompt** and **one column per field** (see schema below).
|
29 |
+
3. (Optional) **`extra_notes.md`**: Anything else you would like us to know (e.g., additional analysis).
|
30 |
|
31 |
---
|
32 |
|