Can an open-source large language model make the grade in radiology?

Aug 13, 2024

Meta’s Llama 3 70B open-source large language model (LLMs) offers comparable performance to proprietary models in answering multiple-choice radiology test questions, according to research published August 13 in Radiology.

A team led by Lisa Adams, MD, of Technical University Munich in Germany found that Llama 3 70B's performance was not inferior to OpenAI’S GPT-4, Google DeepMind’s Gemini Ultra, or Anthropic’s Claude models.

“This demonstrates the growing capabilities of open-source LLMs, which offer privacy, customization, and reliability comparable to that of their proprietary counterparts, but with far fewer parameters, potentially lowering operating costs when using optimization techniques such as quantization,” the group wrote.

The researchers tested the models -- including versions of another open-source LLM from Mixtral -- on 50 multiple-choice test questions from a publicly available 2022 in-training test from the American College of Radiology (ACR) as well as 85 additional board-style examination questions. Images were excluded from the analysis.

Performance on ACR diagnostic in-training exam questions
	GPT-3.5 Turbo	Mixtral 8 x 22B	Gemini Ultra	Claude 3 Opus	GPT-4 Turbo	Llama 3 70B
Accuracy	58%	64%	72%	78%	78%	74%
Performance on radiology board exam-style questions
Accuracy	61%	72%	72%	76%	82%	80%

With the exception of the Mistral 8 x 22B open-source model (p = 0.15), the differences in performance between Llama 3 70B and the other LLMs did not reach statistical significance for the ACR in-training exam questions. Llama 3 70B did significantly outperform GPT-3.5 Turbo (p = 0.05), however, on the radiology board-exam style questions.

The authors emphasized that important limitations still remain for these types of models in radiology applications.

“Multiple-choice formats test only specific knowledge, missing broader clinical complexities,” they wrote. “More nuanced benchmarks are needed to assess LLM skill in radiology, including disease and treatment knowledge, guideline adherence, and real-world case ambiguities. The lack of multimodality in open-source models is a critical shortcoming in the image-centric field of radiology.”

What’s more, all LLMs face the challenge of producing unreliable outputs, including false-positive findings and hallucinations, they said.

“However, open-source LLMs offer important advantages for radiology by allowing deep customization of architecture and training data,” they wrote. “This adaptability enables the creation of specialized models that can outperform generalist proprietary models, supporting the development of tailored clinical assistants and decision support tools.”

Nonetheless, the research results highlight the potential and growing competitiveness of open-source LLMs in healthcare, according to the authors. And a larger version of Llama 3 that incorporates 400 billion parameters is expected to be released later this year.

The full study can be found here.