Comparing Open-source and Proprietary LLMs in Medical AI
Co-authored with: Nasir Hayat, Svetlana Maslenkova, Clément Christophe, Praveenkumar Kanithi, Ronnie Rajan
In the rapidly evolving world of artificial intelligence (AI), large language models (LLMs) are making waves across various industries, including healthcare. But how well do these models actually perform on medical tasks? Let's dive into a brief overview of recent evaluations of both closed and open-source LLMs on popular medical benchmark datasets. We also describe the methods followed, costs involved, and other relevant factors in obtaining these performance results.
Figure 1: Performance of LLMs in Medical Benchmarks |
Key Takeaways
The proprietary edge persists: Closed-source models, led by GPT-4o and Claude Sonnet, maintain a performance lead in medical benchmarks; however, the gap is narrowing as open-source models continue to improve.
Size matters, but it's not everything: While larger models generally performed better, some smaller open-source models showed surprising competitiveness; this suggests that data, architecture and training strategies play crucial roles alongside model size.
General vs. specialized knowledge: These general-purpose LLMs demonstrate impressive medical knowledge, but their limitations in complex medical scenarios underscore the ongoing need for specialized medical AI development (e.g., see our proposed clinical LLMs [1]).
Beyond benchmarks: these benchmarks only scratch the surface of LLM capabilities in real-world clinical applications; more comprehensive evaluation frameworks, such as MEDIC [2], are required for responsible AI deployment in healthcare.
The open-source challenge: While currently trailing proprietary models in these benchmarks, open-source LLMs are rapidly evolving. Their competitive performance, combined with transparency and accessibility, positions them as important players in the medical AI landscape.
Why this matters?
Assessing the performance of LLMs in medical contexts isn't just an academic exercise. Many of these models, especially proprietary ones, aren't easily accessible due to cost barriers. Moreover, comprehensive information about their performance is often not readily available. This lack of transparency can be problematic when considering their potential use in healthcare applications.
Medical benchmarks
To evaluate LLMs in medical contexts, researchers rely on a variety of benchmark datasets. In this brief overview, we focus on some of the most widely used multiple-choice question-answering datasets in the medical domain:
MedQA (USMLE): a dataset that contains questions similar to those on the United States Medical Licensing Examination (USMLE), and covers a wide range of medical topics and specialties (n = 1,273).
NEJM-QA: it contains board residency exams questions that cover specialties such as internal medicine, general surgery, pediatrics, psychiatry, and obstetrics & gynecology (n = 614).
MMLU: this dataset includes only the medical-related subsets (clinical knowledge, college biology, college medicine, medical genetics, professional medicine and anatomy) from the broader MMLU benchmark (n = 1,089).
MMLU-Pro: an extension of MMLU, this dataset integrates more difficult questions; only the health-related subset was kept (n = 818).
These datasets are popular choices for assessing medical AI capabilities as they cover a range of medical knowledge, from basic health information to more complex professional-level questions. The multiple-choice format allows for straightforward evaluation and comparison across different models.
Evaluation Methods
When assessing LLMs on this type of tasks, the choice of evaluation method can significantly impact the results. It's important to understand these methods to better interpret performance results accurately. In our evaluation, we focused on a simple approach, but it's worth nothing that there are various strategies: zero-shot, few-shot (which involves providing the model with a few examples of question-answer pairs before asking it to respond to the question of interest), and prompting strategies, such as chain-of-thought prompting, which encourages the model to show its reasoning.
We have used a zero-shot approach. The model is given a question without any additional context or examples. The prompt template used is shown below:
You are an AI designed to answer multiple-choice questions. For each question, select exactly one answer option. Do not provide explanations or commentary unless explicitly requested. Base your selection solely on the information given in the question and answer choices. If uncertain, choose the most likely correct answer based on the available information.
Question:
{question}
Options:
(A) {option_1}
(B) {option_2}
(C) {option_3}
(D) {option_4}
...
Correct answer: (
For our evaluation metric, we used exact matching based on the LLM's response. This means the model's answer must exactly match the correct answer (or option) to be considered correct. We report the accuracy as the proportion of questions correctly answered by each model.
While the method used is straightforward, it's worth noting that other evaluation methods exist (as mentioned above). For an overview of other evaluation methods of multiple-choice question-answering datasets, including exact-matching, you can refer to this paper [3]. It demonstrates that different evaluation methods can yield varying results; hence, considering the specific approach is crucial when interpreting performance results.
Performance Deep Dive
Now, let's dive into the core of our findings. Our evaluation encompassed both open-source and proprietary LLMs; the latter were accessed through their respective APIs (by the corresponding providers).
Model | Proprietary | License | Release Date |
---|---|---|---|
Claude 3.5 Sonnet | Anthropic | Proprietary | Jun 2024 |
Claude 3 Opus | Anthropic | Proprietary | Mar 2024 |
Gemini 1.5 Pro | Proprietary | Feb 2024 | |
GPT 4o | OpenAI | Proprietary | May 2024 |
GPT 4o mini | OpenAI | Proprietary | Jul 2024 |
Llama 3.1 405B | Meta | Llama 3.1 Community | Jul 2024 |
Llama 3.1 70B | Meta | Llama 3.1 Community | Jul 2024 |
Mistral Large 2 | Mistral | Mistral Research | Jul 2024 |
Nemotron 4 340B | Nvidia | NVIDIA Open Model | Jun 2024 |
Qwen 2.5 72B | Alibaba | Qianwen LICENSE | Sep 2024 |
Benchmark Performance Overview
Looking at the results across all benchmarks (see Figure 1), we can make a few interesting observations:
In general, closed-source models, particularly GPT 4o and Claude 3.5 Sonnet, demonstrated superior performance across the benchmarks; open-source model showed competitive results but generally lagged behind their proprietary counterparts.
Larger models tended to perform better overall, with some notable exceptions among open-source models; it's important to note that the sizes of proprietary models are not disclosed, making direct size-to-performance comparisons challenging.
All models performed relatively well on the medical-related subset of the MMLU benchmark; MMLU has become a popular benchmark featured in various AI leaderboards; the widespread strong performance on this benchmark may indicate that LLMs are well fine-tuned and/or optimized for this benchmark; therefore, this raises questions about its continued ability to differentiate LLM's capabilities in the medical domain.
Finally, while these general-purpose LLMs showed decent performance across medical benchmarks, they demonstrated limitations in certain areas, particularly with challenging medical cases (e.g., as observed with the NEJM-QA benchmark).
The Cost Factor: API Access and Evaluation Expenses
When considering the use of these models, it's essential to factor in the cost of API access for closed-source models. Providers like OpenAI, Google, and Anthropic offer powerful models, but their pricing can vary widely. Costs typically depend on factors like the number of tokens processed and the specific model used.
To give an idea of the expenses involved, let's look at the cost of evaluating models on the MedQA benchmark:
Anthropic's models: approximately $7.65 using Claude 3 Opus and $1.60 for Claude 3.5 Sonnet for generating answers for all MedQA questions.
OpenAI's models: around $0.65 using GPT 4o and $0.02 using the batch API for the same benchmark dataset.
These figures highlight the cost considerations when working with proprietary LLMs at scale. For open-source models, we leveraged a high-performance computing cluster, utilizing two nodes equipped with 16 NVIDIA H100 GPUs each, to deploy and evaluate the models efficiently.
The Need for Comprehensive Evaluations
While these benchmark results provide valuable insights, it's crucial to remember that real-world medical applications often require a more detailed evaluation. That's why we've proposed MEDIC [2], a new evaluation framework designed to assess LLMs more comprehensively across various medical use cases.
MEDIC aims to:
- Evaluate models on a broader range of medical tasks and clinical use-cases
- Consider factors like safety, bias, reasoning and data understanding
- Provide more actionable insights for specific healthcare applications using refined metrics to better capture the complexities of clinical language and decision-making
In conclusion, current benchmarks show promising results for LLMs in medical contexts. Nevertheless, there's still much work to be done in thoroughly evaluating these models for real-world healthcare applications. As these technologies continue to evolve, frameworks like MEDIC may play a crucial role in ensuring their safe and effective deployment in the medical field.
References
[1] Med42-v2: A Suite of Clinical LLMs paper
[2] MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications paper
[3] Beyond Metrics: A Critical Analysis of the Variability in LLM Evaluation Frameworks paper