Performance Comparison: Llama-3.2 vs. Llama-3.1 LLMs and Smaller Models (3B, 1B) in Medical and Healthcare AI Domains πŸ©ΊπŸ§¬πŸ’Š

Community Article Published September 26, 2024

Exploring the performance of different LLaMA-3.2 models from Meta, focusing on both larger models (90B, 70B) and smaller models (3B, 1B) to see how well they handle medical knowledge tasks. Here's an overview of our findings, without any fine-tuning involved.

image/png

Key Observations

Llama-3.1 70B Outperforms Llama-3.2 90B

Despite the higher parameter count, Llama-3.2 90B was outperformed by Llama-3.1 70B, especially in specialized tasks like MMLU College Biology and Professional Medicine.

Meta-Llama-3.2-90B Vision Instruct and Base Models: Are They the Same?

One intriguing finding was the identical performance of the Meta-Llama-3.2-90B Vision Instruct and Base models across all datasets, which is unusual for instruction-tuned models. More on that below.

Detailed Comparison

Here’s how the models performed in the medical field, using datasets like MMLU College Biology, Professional Medicine, and PubMedQA.

πŸ₯‡ Meta-Llama-3.1-70B-Instruct

  • Average Score: 84%
  • MMLU College Biology: 95.14%
  • MMLU Professional Medicine: 91.91%
  • This model demonstrated strong performance across the board, making it the overall top performer.

πŸ₯ˆ Meta-Llama-3.2-90B-Vision (Instruct & Base)

  • Average Score: 83.95% (tied for second place)
  • MMLU College Biology: 93.06%
  • MMLU Professional Medicine: 91.18%
  • Interestingly, both the Instruct and Base versions of this model performed identically across all datasets.

πŸ₯‰ Meta-Llama-3-70B-Instruct

  • Average Score: 82.24%
  • MMLU Medical Genetics: 93%
  • MMLU College Biology: 90.28%
  • This model performed exceptionally well in Medical Genetics.

Small Models Analysis

Evaluating the smaller models to see how they compare in handling medical tasks.

image/jpeg

πŸ₯‡ Phi-3-4k

  • Average Score: 68.93%
  • MMLU College Biology: 84.72%
  • MMLU Clinical Knowledge: 75.85%
  • Thanks to efforts by Sebastien Bubeck, this model came out on top in the smaller models category.

πŸ₯ˆ Meta-Llama-3.2-3B-Instruct

  • Average Score: 64.15%
  • MMLU College Biology: 70.83%
  • PubMedQA: 70.6%

πŸ₯‰ Meta-Llama-3.2-3B

  • Average Score: 60.36%
  • MMLU College Biology: 63.89%
  • PubMedQA: 72.8%

Identical Performance in Vision Models: What's Going On?

The most surprising result from this study was the identical performance of the Meta-Llama-3.2-90B Vision Instruct and Base versions. Typically, Instruct models are fine-tuned for specific tasks and expected to perform differently from Base models. However, both versions achieved the exact same average score of 83.95%, with identical results across all 9 datasets.

Unusual Consistency Across Models

Noticed this pattern with the Meta-Llama-3.2-11B Vision models. Both the Instruct and Base versions scored 72.8% on average, showing no variation in performance. This raises an interesting question: could the vision tuning of these models be less dependent on task-specific instruction?

Conclusion

In summary, Llama-3.1-70B remains the top performer for medical tasks, outperforming the larger Llama-3.2-90B model. When it comes to smaller models, Phi-3-4k led the pack, while the performance of Meta-Llama-3.2 Vision modelsβ€”both Instruct and Baseβ€”was identical, pointing to potential optimization in vision model tuning for medical applications.

For the detailed results, check out the JSON file on GitHub.

If you find this useful in your work, please cite the article as follows:

@misc{MedLLama3,
  author = {Ankit Pal},
  title = {Performance Comparison: Llama-3 Models in Medical and Healthcare AI Domains},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/blog/aaditya/llama3-in-medical-domain}}
}

Interested in Medical AI, follow @OpenlifesciAI for the daily Medical LLMs papers/model updates. Join our Discord community of 500+ experts to discuss Medical LLMs, datasets, benchmarks, and more!