Large language models (LLMs) show promise for automating PI-RADS classifications of prostate lesions from MRI report data, according to a study published January 2 in BMC Urology.
The study findings could translate to better patient care, as correct classification of prostate lesions is key to appropriate treatment, wrote Betul Akdal Dolek, MD, of Ankara Bilkent City Hospital, and colleague Muhammed Said Besler, MD, of Istanbul Medeniyet University, both in Turkey.
"An accurate PI-RADS classification is crucial for making the right decision about whether to perform a biopsy," they explained. "Misclassifying low-risk lesions as high-risk can lead to unnecessary procedures, while misclassifying high-risk lesions as low-risk can result in a delayed diagnosis of clinically relevant prostate cancer."
Prostate cancer is one of the leading causes of cancer illness and death among men around the world, and it's crucial to appropriately assess patients' risk. The PI-RADS system is used with MR imaging and has become an "important decision-support tool in urology, directly impacting biopsy recommendations, treatment plans and follow-up strategies," the authors wrote, noting as well that the PI-RADS 3 category can be a challenging one, "as its equivocal nature may lead to unnecessary biopsies or delayed diagnosis, with important clinical implications."
Advances in AI and LLMs show promise for automating and standardizing radiology reporting, and "identifying PI-RADS categories directly from MRI reports through generative AI could potentially streamline clinical workflows, ensure reporting consistency, and reduce interobserver variability," Dolek and Besler wrote.
They conducted a study to evaluate the performance of LLMs for assigning PI-RADS categories based on 146 structured prostate MRI reports produced between October 2023 and October 2024. The reports included prostate-specific antigen (PSA) values (ng/mL) and PSA density values (ng/mL/cm³); the team included the following LLMs in the study: GPT-4o, GPT-o1, Google Gemini 1.5 Pro, and Google Gemini 2.0 Experimental Advanced, and used radiologist consensus as the reference standard. Dolek and Besler measured agreement between the LLMs and the radiologist readers using Cohen's kappa, and calculated accuracy and F1 scores for three PI-RADS risk groups (low, intermediate, and high).
Overall, the two authors found that GPT-o1 showed the best overall performance when it came to interreader agreement, while all of the LLMs showed weak performance for the PI-RADS 3 category.
Comparison of 4 LLMs for classifying MRI findings into PI-RADS categories | ||||
Performance measure | Google Gemini 2.0 Experimental Advanced | Google Gemini 1.5 Pro | GPT-4o | GPT-o1 |
| Radiologist reader agreement (Cohen's kappa) | 0.66 | 0.73 | 0.74 | 0.87 |
FI score | ||||
| Low-risk PI-RADS category | 0.81 | 0.93 | 0.92 | 0.93 |
| Equivocal-risk PI-RADS category | 0.57 | 0.54 | 0.53 | 0.75 |
| High-risk PI-RADS category | 0.88 | 0.86 | 0.91 | 1.00 |
Accuracy | ||||
| Low-risk PI-RADS category | 81.3% | 93.3% | 92% | 93.3% |
| Equivocal-risk PI-RADS category | 57.1% | 53.6% | 53.6% | 75% |
| High-risk PI-RADS category | 88.4% | 86% | 90.7% | 100% |
Although the study findings are positive, more research is needed, according to the authors.
"[LLMs'] failure in PI-RADS 3 lesions indicates that multicenter validation, larger datasets, and multimodality integration are needed before they can be used clinically for prostate cancer diagnosis and urological decision-making," they concluded.
Access the full study here.















