ChatGPT-4.0 bests other large language models on ACR exam questions

Dec 1, 2023

ChatGPT-4.0 performs well on the image-independent American College of Radiology Diagnostic In-Training Exam (ACR DXIT) practice questions, a study presented November 29 at the RSNA 2023 annual meeting found.

Presenter Christopher Kaufmann, MD, from the University of Texas at Austin talked about results from his team’s comparative study of large language models, which showed that ChatGPT-4.0 achieved the highest scores.

“The results demonstrate the powerful efficiency and improving accuracy of evolving publicly available AI tools when applied to the radiology-specific domain,” Kaufmann said.

Large language AI models such as ChatGPT and Google Bard have become an area of interest for radiologists within the past year. Previous studies have examined these models and their clinical utility in clinical- and patient-facing settings. However, Kaufmann pointed out that radiologists need to know the output accuracy, relevance, and reliability of these models before determining their clinical utility by specific domain.

Kaufmann and colleagues compared the latest publicly accessible large language models across multiple subspecialty areas of radiology.

They used ACR DXIT practice test question set from 2022, specifically image-independent questions that were distributed across various radiology disciplines. The team also used three publicly available large language model platforms: ChatGPT 3.5 & 4.0, Google Bard, and Windows BingChat. The questions were entered into the AI interface in their original text format.

In total, the team included 42 ACR DXIT questions in the study. The group found that ChatGPT 4.0 answered 90.5% correctly (n = 38), while ChatGPT 3.5 answered 79% of questions correctly (n = 33).

The researchers also found that despite GPT-4.0’s overall advantage, two previously correct responses using GPT-3.5 were outputted incorrectly using GPT4.

Meanwhile, Google Bard answered 71% correctly in which all three draft responses were correct (n = 30). Bard also had partially correct responses in 14% of outputs (n = 6). Finally, the team found that BingChat performed the lowest with 60% correct (n = 25).

Kaufmann said these results demonstrate how it’s important to determine the radiology-domain specific performance and output reliability of large language models for potential use in practices. He added that it’s important to continuously monitor updates to these models and how they impact their performance on such exams.

“With their rapid evolution, up-to-date accuracy and trustworthiness of AI technologies will remain key criteria for their ultimate clinical adoption and specific uses in practice,” Kaufmann said.

He concluded that future studies should include radiology-specific inputs on validated data and specialty-specific benchmarking among other needs.