LLMs decrease in accuracy over time on radiology exams

Nov 21, 2024

Large language models (LLMs) demonstrate high accuracy on radiology exams, yet decrease in accuracy over time, according to research published November 20 in the European Journal of Radiology.

The study provides a foundational benchmark for future LLM performance evaluations in the field, noted lead author Mitul Gupta, a medical student at the University of Texas at Austin, and colleagues.

“Before integrating large language models (LLMs) into clinical or educational settings, a thorough understanding of their accuracy, consistency, and stability over time is paramount,” the group wrote.

Since their introduction, LLMs like GPT-4, GPT-3.5, Claude, and Google Bard have demonstrated near-expert-level performance on radiology exams, the authors noted. Yet there is limited to no comparative information on model performance, accuracy, and reliability over time, they wrote.

Thus, the group evaluated and monitored the performance and internal reliability of LLMs in radiology over a three-month period.

The researchers queried GPT-4, GPT-3.5, Claude, and Google Bard monthly from November 2023 to January 2024, using multiple-choice practice questions from the ACR Diagnostic in Training Exam (DXIT) (n = 172). Questions covered various radiology disciplines, including breast, cardiothoracic, gastrointestinal, genitourinary, musculoskeletal, neuroradiology, nuclear medicine, pediatrics, ultrasound, interventional radiology, and radiology physics.

The researchers assessed overall model accuracy over the period by subspecialty and evaluated internal consistency through answer mismatch or intramodel discordance between test runs. Regardless of whether a model correctly or incorrectly answered the question, if an answer to a question changed from one time point to another, it was deemed a “mismatch,” the researchers noted.

Overall, GPT-4 performed with highest average accuracy (76 %), followed by Google Bard (73%), Claude (71%), and GPT-3.5 (63%), while model performance over the three months varied within models, according to the analysis.

LLM performance on 172 DXIT questions over 3 months and at each time point
Month	GPT-3.5	Claude	Google Bard	GPT-4
November 2023	71%	70%	76%	82%
December 2023	58%	72%	70%	77%
January 2024	62%	73%	74%	74%
Average	63%	71%	73%	78%

In addition, LLM performance varied significantly across radiology subspecialties, with significant differences in questions related to chest (p = 0.0161), physics (p = 0.0201), ultrasound (p = 0.007), and pediatrics (p = 0.0225), the researchers found. The other eight subspecialties did not show significant difference within models.

“The observed strengths and weaknesses of LLMs across different radiology subspecialties suggest that their use might be more appropriate in some areas than others, necessitating a targeted approach to their integration in curricula,” the researchers wrote.

While LLM performance approaches “passing” levels on radiology exams, the risk of model deterioration and the potential for inaccurate responses over time pose significant concerns, the researchers wrote.

To address these challenges, the group suggested that standardized ground-truth benchmarking tools are needed to gauge LLM performance and to overcome the “black box” nature of decision-making by LLMs.

“Further work is needed to continue developing and refining these initial, standardized radiology performance benchmarking test metrics,” the researchers concluded.

The full study is available here.