ChatGPT demonstrates mixed results in assigning BI-RADS categories

Nov 1, 2024

ChatGPT demonstrates modest accuracy when assigning BI-RADS scores for mammograms and breast ultrasound exams, according to research published October 30 in Clinical Imaging.

A team led by Marc Succi, MD, from Mass General Brigham in Boston found that two iterations of the large language model (LLM) could correctly assign BI-RADS scores in two out of every three cases, with better performance seen for BI-RADS 5 cases. The models achieved the lowest scores in assigning lower BI-RADS categories.

“These findings provide breast radiologists with a valuable foundation for understanding the current capabilities and limitations of off-the-shelf large LLMs in image interpretation,” Succi told AuntMinnie.com.

Previous reports suggest that large language models can correctly recommend appropriate imaging modalities for patients based on their clinical presentation. They can also correctly determine BI-RADS categories based on textual imaging reports, according to an earlier 2024 study.

Succi and colleagues conducted a pilot study that explored whether ChatGPT-4 and ChatGPT-4o, the latter of which adds multimodal processing, can assist with generating BI-RADS scores from mammographic and breast ultrasound images.

The team tested both models using 77 breast cancer images from radiopaedia.org and analyzed images in separate sessions to avoid bias.

Both ChatGPT-4 and ChatGPT-4o scored 66.2 % accuracy across all BI-RADS cases. However, this varied among BI-RADS categories. The models scored the highest when assessing BI-RADS 5 cases, 84.4% for GPT-4 and 88.9% for GPT-4o. Both models, however, scored 0% when assigning BI-RADS 3 categories and struggled with BI-RADS 1 and 2 categories.

“The models were able to handle high-risk cases effectively but tended to overestimate the severity of lower-risk cases,” Succi said.

Of the incorrect BI-RADS 1 to 3 grading for GPT-4 and GPT-4o, 64.2% and 76.4% were two grades higher than the correct score, respectively. When compared to the ground truth, the models achieved an interrater agreement of 0.72 and 0.68, respectively.

Finally, both models achieved higher accuracy for mammograms at 67.6% compared with 55.6 % for ultrasound images.

Succi said that the subtle differences in lower-risk cases may be harder for the LLMs to distinguish.

“Additionally, the models might have been trained on datasets that contain more high-risk cases, potentially influencing their accuracy,” he added.

Succi said that the research team is committed to discovering ways that LLMs can effectively support clinicians, with current projects spanning a range of applications, both in and outside of radiology.

“We're particularly interested in applications of AI for patient triage and patient education,” he told AuntMinnie.com.

The full study can be accessed here.