Radiologists best ChatGPT, Gemini in assigning PI-RADS categories

Nov 15, 2024

Large language models (LLMs) are not ready to assign PI-RADS classifications for prostate cancer, suggest findings published November 13 in the British Journal of Radiology.

A team led by Kang-Lung Lee, MD, from the University of Cambridge in England found that radiologists outperformed all LLMs analyzed in its study, including ChatGPT and Google Gemini, in terms of accuracy for PI-RADS classification based on prostate MRI text reports.

“While LLMs, including online models, may be a valuable tool, it's essential to be aware of their limitations and exercise caution in their clinical application,” Lee told AuntMinnie.com.

Since their introduction in late 2022, LLMs have demonstrated potential for clinical utilization, including in radiology departments. Radiology researchers continue to explore their potential as well as their current limitations.

Lee and colleagues tested the ability of these chatbots in assigning PI-RADS categories based on clinical text reports. They included 100 consecutive multiparametric prostate MRI reports for patients who had not undergone biopsy. Two radiologists classified the reports and these were compared with responses generated by the following models: ChatGPT-3.5, ChatGPT-4, Google Bard, and Google Gemini.

Out of the total reports, 52 were originally reported as PI-RADS 1-2, nine as PI-RADS 3, 19 as PI-RADS 4, and 20 as PI-RADS 5.

The radiologists outperformed all the LLMs. However, the researchers observed that the successor models (ChatGPT-4 and Gemini) outperformed their predecessors.

Accuracy of radiologists, large language models in PI-RADS classification
Reader	Accuracy
Senior radiologist	95%
Junior radiologist	90%
ChatGPT-4	83%
Gemini	79%
ChatGPT-3.5	67%
Bard	67%

Bard and Gemini bested the ChatGPT models in PI-RADS categories 1 and 2. These included F1 scores of 0.94 and 0.98 for the Google models while GPT-3.5 and GPT-4 achieved F1 scores of 0.77 and 0.94, respectively.

However, for PI-RADS 4 and 5 cases, GPT-3.5 and GTP-4 (F1, 0.95 and 0.98, respectively) outperformed Bard and Gemini (F1, 0.71 and 0.87, respectively).

Bard also assigned a non-existent PI-RADS 6 “hallucination” for two patients. PI-RADS contains five categories.

“This hallucination phenomenon, however, was not observed in ChatGPT-3.5, ChatGPT-4, or Gemini,” Lee said.

Finally, the team observed varying inter-reader agreements between the original reports and the radiologists and models. These included the following kappa values: senior radiologist, 0.93; junior radiologist, 0.84; GPT-4, 0.86; Gemini, 0.81; GPT-3.5, 0.65; Bard, 0.57.

Lee said that despite the results, LLMs have the potential to assist radiologists in assigning or verifying PI-RADS categories after completing text reports. This includes offering significant support to less experienced readers in making accurate decisions.

“Furthermore, not all radiologists include PI-RADS scores in their reports, which can create challenges when patients are referred to another hospital,” Lee told AuntMinnie.com. “In such cases, LLMs can streamline the process for healthcare professionals at referral centers by efficiently generating PI-RADS categories from existing text reports.”

The researchers called for future research to study the utility of LLMs in assisting residents with reading reports, as well as investigating where these models may still be lagging. This could offer further insights into how these models may be applied in training environments, they noted.

The full study can be found here.