Direct imaging inputs improve performance of large language models

Jul 9, 2024

Using direct radiologic image inputs can improve the diagnostic accuracy of large language models, according to research published July 9 in Radiology.

A team led by Pae Sun Suh, MD, from the University of Ulsan in Seoul, South Korea found that while ChatGPT-4V did not achieve as high an accuracy as radiologists, it showed promise as a supportive diagnostic tool.

“The large language models generated differential diagnoses rapidly with substantial performance, highlighting the potential supportive role of a large language model in diagnostic decision-making,” Suh and colleagues wrote.

AI that uses large language models has experienced upgrades since the first models were released in late 2022. ChatGPT for instance is on its fourth version while Google has been working to improve its Gemini (formerly Bard) model.

The researchers highlighted that while previous studies have analyzed the performance of these models in radiologic settings, they used text-based inputs without images.

ChatGPT-4V is a preview model of GPT-4 with vision in OpenAI’s application programming interface. This means it can accept image inputs in prompts. Google’s Gemini Pro Vision, meanwhile, is a recently developed multimodal model that uses Google DeepMind to process imaging data.

Suh and co-authors studied the performance of these two models in generating differential diagnoses at different temperatures. They compared the results with those of radiologists using Radiology Diagnosis Please cases that were published between 2008 and 2023.

Images that were input into the models included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The team tasked the models with providing three differential diagnoses, repeated five times at "temperatures" of 0, 0.5, and 1 (the higher this measure, the more risks the models take in their generated responses by using more creativity). The team also set a statistical significance threshold of < 0.007 by using Bonferroni adjustments.

Meanwhile, eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses. They considered the result correct if the generated diagnoses included the final diagnosis after five repetitions.

The study included 190 cases, which included the following subspecialties: neuroradiology (n = 53), multisystem (n = 27), gastrointestinal (n = 25), genitourinary (n = 23), musculoskeletal (n = 17), chest (n = 16), cardiovascular (n = 12), pediatric (n = 12), and breast (n = 5).

The models demonstrated overall improved accuracy as temperature settings increased. However, the researchers reported that after adjustment, they did not achieve statistical significance.

Performance of large language models on Radiology Diagnosis Please cases
	ChatGPT-4V			Gemini Pro Vision
Temperature	0	0.5	1	0	0.5	1
Accuracy	41%	45%	49%	29%	36%	39%

The radiologists, meanwhile, achieved an overall accuracy of 61%, significantly higher than that of Gemini Pro Vision at temperature 1 (p < 0.001). However, the accuracy achieved by the radiologists did not achieve statistical significance compared with ChatGPT-4V at temperature 1 after adjustment (p = 0.02).

Finally, the radiologists, with an accuracy range of 45% to 88%, outperformed the large language models at temperature 1 in most subspecialties (range, 24% to 75%).

The study authors called for future research to study the clinical decision support of large language models when used with radiologists compared with radiologists alone or compared with other medical diagnostic models.

In an accompanying editorial, Mizuki Nishino, MD, from Harvard Medical School, and David Ballard, MD, from the Mallinckrodt Institute of Radiology in St. Louis, wrote that while questions remain for future research, “it is clear the next big wave of multimodal large language models has already arrived in diagnostic radiology, and it is time to figure out how to ride the wave.”

The full study can be found here.