Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
Abstract
As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, currently little is known about LLM knowledge of UK Government public health information. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries, created via an automated pipeline. We also release a new dataset of the extracted UK Government public health guidance documents used as source text for PubHealthBench. Assessing 24 LLMs on PubHealthBench we find the latest private LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, whilst there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses on public health topics.
Community
PubHealthBench is a benchmark designed to provide a broad assessment of LLM knowledge of current UK Government public health guidance. The full benchmark dataset is available at - https://huggingface.co/datasets/Joshua-Harris/PubHealthBench
Any comments or feedback would be really appreciated!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text (2025)
- MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks (2025)
- A Scalable Framework for Evaluating Health Language Models (2025)
- LLMs Outperform Experts on Challenging Biology Benchmarks (2025)
- GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners (2025)
- Automatic Legal Writing Evaluation of LLMs (2025)
- Med-CoDE: Medical Critique based Disagreement Evaluation Framework (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper