ChatGPT-4 is an effective way to extract recommendations for additional imaging (RAIs) in radiology reports, according to a study published January 29 in the American Journal of Roentgenology.
The findings suggest that use of the tool could "facilitate tracking and timely completion of clinically necessary recommendations for additional imaging and thereby potentially reduce diagnostic delays," wrote a team led by Kathryn Li of Brigham and Women's Hospital and Harvard Medical School, both in Boston.
More than 10% of radiology reports include a recommendation for additional imaging (RAI), and delays in or failure to follow through on these can lead not only to diagnostic error but also to patient harm, the group explained. Rates of adherence to RAIs range from 29% to 77%.
"RAIs are more likely to be completed in a timely fashion when presented using actionable phrasing," the team noted. "However, only a minority of RAIs in radiology reports are crafted in such a manner."
Using large language models (LLMs) with radiology reports could address this problem, Li and colleagues wrote. They conducted a study to assess the performance of LLMs for flagging actionable details of RAIs from 231 free-text impression sections from diagnostic radiology exam reports; the exams were performed across modalities and care settings within five subspecialties (abdominal imaging, musculoskeletal imaging, neuroradiology, nuclear medicine, thoracic imaging) in August 2023.
Of the total reports, the team used 25 to prompt two large language models (ChatGPT-3.5 and ChatGPT-4) to cull details about modality, body part, time frame, and rationale of the RAI and the remaining 206 to test the prompt with both GPT-3.5 and GPT-4.
Two reviewers -- a fourth-year medical student and radiologist from the relevant subspecialty -- classified the LLM results as correct or incorrect for extracting four actionable details of RAIs in comparison with report impressions, the team noted. A third reviewer helped resolve any discrepancies.
The investigators found:
Performance of two LLMs (GPT-3.5 and GPT-4) for classifying report impressions for RAI by a variety of measures | ||
---|---|---|
RAI measure | Incorrect | Correct |
Modality | 94.2% | 95.6% |
Body part | 88.3% | 89.3% |
Time frame | 95.1% | 96.1% |
Rationale | 88.8% | 89.8% |
Both ChatGPT-3.5 and ChatGPT-4 had an accuracy of 91.7% (189/206) for extracting recommendations for additional imaging rationale, but the group reported that, according to consensus assessments, ChatGPT-4 was more accurate than ChatGPT-3.5 for flagging the following:
- RAIs by modality (94.2% vs. 85.4%, p < 0.001)
- RAIs by body part (86.9% vs. 77.2%, p = 0.004)
- RAIs by timeframe (99% vs. 95.6%, p = 0.02)
The study suggests that ChatGPT-4 is a suitable tool for handling radiology reports' free-text impression sections, according to the authors.
"By accurately extracting actionable details of RAIs, [ChatGPT-4] could enable semiautomated or automated tracking of completion of the recommended follow-up, as part of initiatives to improve adherence," they concluded. "[It] could also be used to improve the inclusion of these details in the RAI in the first place."
The complete study can be accessed here.