An AI model matched the performance of a board-certified radiologist and outperformed a radiology resident in recommending follow-up imaging from routine radiology reports, according to a study published April 16 in Nature: Scientific Reports.
The results "support [the model's] role as decision support for standardized, guideline-aligned follow-up," wrote a team led by Kenan Kaya, MD, of the University of Cologne in Germany.
Follow-up imaging recommendations can vary across radiologists, despite established guidelines, the group explained. Kaya and colleagues evaluated whether a large language model (LLM) -- GPT-4o -- could standardize follow-up timing and modality selection from radiology reports.
They conducted a study that included a random sample of 100 CT/MRI cases drawn equally from four oncologic subspecialties -- head and neck, liver, lung, and pancreas -- from two German academic medical centers. GPT-4o and two radiologist readers (R1, resident; R2, board-certified) generated follow-up recommendations from report text. Two senior radiologists then assessed all reader results (blinded to source), rating completeness (i.e., whether all pathologies warranting follow-up were addressed), modality appropriateness, timing accuracy, and overall quality on a five-point scale.
The group reported that GPT-4o generated follow-up recommendations with overall quality comparable to the experienced radiologist and superior to the trainee, with high completeness and generally appropriate follow up timing and modality.
Performance of GPT-4o compared to radiologist readers for recommending follow-up imaging from routine radiology reports | |||
Measure | Resident reader (R1) | Board-certified reader (R2) | GPT-4o |
| Median global quality (based on a five-point rating score; indicates the middle value of overall quality scores)* | 4 | 4 | 4 |
| Relative treatment effect (expressed as a value between 0 and 1) | 0.43 | 0.51 | 0.56 |
| Correctness of follow-up timing | 75% | 90% | 96% |
| Completeness of follow-up | 91% | 80% | 92% |
* GPT-4o exceeded R1 (p < 0.01) but did not differ from R2 (p = 0.06).
The team did not find any significant differences among readers for appropriateness of imaging modality. It also reported that GPT-4o showed strongest results in lung imaging findings, where timing correctness reached 100%, and in pancreas and liver, while head and neck yielded lower accuracy for all raters.
"Our findings suggest that GPT-4o can reliably identify reportable pathologies that warrant follow-up and map them to appropriate guideline frameworks to propose modality and interval … [and] provide evidence supporting further evaluation and potential implementation of GPT-4o as a clinical decision support tool in medical workflows," the authors concluded.
Access the full article here.


















