GPT-4o matches experienced radiologists for follow-up imaging recs

Apr 17, 2026

An AI model matched the performance of a board-certified radiologist and outperformed a radiology resident in recommending follow-up imaging from routine radiology reports, according to a study published April 16 in Nature: Scientific Reports.

The results "support [the model's] role as decision support for standardized, guideline-aligned follow-up," wrote a team led by Kenan Kaya, MD, of the University of Cologne in Germany.

Follow-up imaging recommendations can vary across radiologists, despite established guidelines, the group explained. Kaya and colleagues evaluated whether a large language model (LLM) -- GPT-4o -- could standardize follow-up timing and modality selection from radiology reports.

They conducted a study that included a random sample of 100 CT/MRI cases drawn equally from four oncologic subspecialties -- head and neck, liver, lung, and pancreas -- from two German academic medical centers. GPT-4o and two radiologist readers (R1, resident; R2, board-certified) generated follow-up recommendations from report text. Two senior radiologists then assessed all reader results (blinded to source), rating completeness (i.e., whether all pathologies warranting follow-up were addressed), modality appropriateness, timing accuracy, and overall quality on a five-point scale.

The group reported that GPT-4o generated follow-up recommendations with overall quality comparable to the experienced radiologist and superior to the trainee, with high completeness and generally appropriate follow up timing and modality.

Performance of GPT-4o compared to radiologist readers for recommending follow-up imaging from routine radiology reports
Measure	Resident reader (R1)	Board-certified reader (R2)	GPT-4o
Median global quality (based on a five-point rating score; indicates the middle value of overall quality scores)*	4	4	4
Relative treatment effect (expressed as a value between 0 and 1)	0.43	0.51	0.56
Correctness of follow-up timing	75%	90%	96%
Completeness of follow-up	91%	80%	92%

^{* GPT-4o exceeded R1 (p < 0.01) but did not differ from R2 (p = 0.06).}

The team did not find any significant differences among readers for appropriateness of imaging modality. It also reported that GPT-4o showed strongest results in lung imaging findings, where timing correctness reached 100%, and in pancreas and liver, while head and neck yielded lower accuracy for all raters.

"Our findings suggest that GPT-4o can reliably identify reportable pathologies that warrant follow-up and map them to appropriate guideline frameworks to propose modality and interval … [and] provide evidence supporting further evaluation and potential implementation of GPT-4o as a clinical decision support tool in medical workflows," the authors concluded.

Access the full article here.