Large language models underperform in breast imaging

Apr 30, 2024

Large language models such as ChatGPT and Google Gemini fall short in breast imaging, a study published April 30 in Radiology found.

Researchers led by Andrea Cozzi, MD, PhD, of the Imaging Institute of Southern Switzerland in Lugano reported that these models reported that these models show only moderate agreement with radiologists when classifying breast imaging findings by BI-RADS category.

“Simply put, we cannot use large language models as a medical device,” Cozzi told AuntMinnie.com. “Eventually, I am convinced we will get there. But we will need a standard, reliable development process.”

Radiologists have explored the potential of large language models in interpreting mammography images. Previous studies have shown that models such as ChatGPT can generate appropriate responses to patients’ questions and increase patient education, its ability to classify suspicious findings on imaging has been suspect.

Cozzi and colleagues evaluated the agreement between human readers and large language models for BI-RADS categories. The categories were assigned based on breast imaging reports written in three languages: English, Italian, and Dutch. The models included were ChatGPT-4, ChatGPT-3.5, and Google Gemini (formerly, Bard). The researchers also assessed the impact of discordant category assignments on clinical management.

Across 2,400 reports included in the study, agreement between the original and reviewing radiologists was high, with a Gwet agreement coefficient (AC1 value) of 0.91. Meanwhile, agreement between the original radiologists and the large language models was moderate. This included AC1 values of 0.52 for GPT-4, 0.48 for GPT-3.5, and 0.42 for Gemini.

Across human readers and large language models, the team observed differences in the frequency of BI-RADS category upgrades or downgrades that would result in changed clinical management or negatively impacted clinical management.

Frequency of BI-RADS category changes in clinical management of breast imaging findings
Finding	Human readers	Gemini	GPT-3.5	GPT-4	p-value (between human readers and large language models
Changes in clinical management	4.9%	25.5%	23.9%	18.1%	< 0.001
Negatively impacted clinical management	1.5%	18.1%	14.3%	10.6%	< 0.001

The study authors highlighted that these results raise concerns about how the “unwarranted” use of large language models by patients and healthcare professionals could lead to consequences. They added that this puts the need for regulation of publicly available generically trained large language models in the spotlight, as well as quick development of context-trained extensions of these tools.

Cozzi echoed that sentiment, saying that radiologists should be “cautious stakeholders” while the technology grows and becomes more accurate in health and medicine.

“We need to … push for regulation at every level, national, internationally, and at the global level,” he told AuntMinnie.com. “We need flexible, but … very well-defined regulations for the use of large language models in the healthcare sector.”

The full study can be found here.