GPT-4 could prove useful proofreading CT reports

GPT-4 showed strong performance identifying factual errors in radiology reports, yet struggled to prioritize clinically significant findings, according to a study published January 28 in Radiology.

The results are from experiments involving more than 10,000 head CT reports and ultimately illustrate the chatbot’s strengths and weaknesses, noted corresponding author Dukyong Yoon, MD, PhD, of Ajou University in Suwon, South Korea, and colleagues.

“Recognizing these strengths and limitations, GPT-4 can serve as a tool for proofreading radiology reports, thereby enhancing radiologist accuracy and efficiency in generating reports,” the group wrote.

Escalating demand for imaging has led to greater workloads for radiologists and resulted in burnout and an increase in errors on reports, with researchers now exploring whether various AI models can help correct errors. Given that GPT-4 has shown potential in radiology in generating impressions, data mining, formatting, labeling, and speech recognition error detection, the authors hypothesized it may also have potential in proofreading reports.

Thus, to test its feasibility for the task, the group first optimized GPT-4 using 100 unaltered head CT reports and 100 that included applied interpretive and factual errors. Interpretive errors included errors such as omissions of findings, while factual errors included such errors as discrepancies in numerical measurements of findings.

Next, using 400 reports with undetected errors, the researchers evaluated GPT-4’s performance in detecting errors and compared its performance to eight human readers. Finally, the researchers tested GPT-4 on 10,000 unaltered reports that were deemed error-free by physicians and conducted an analysis of false-positive results.

According to the research, GPT-4 demonstrated “commendable performance” in error detection, with a sensitivity of 84% for interpretive errors and 89% for factual errors, the authors found. Compared with GPT-4, human readers had worse factual error detection sensitivity (0.33 to 0.69 vs. 0.89) and took longer to review (82 to 121 seconds vs. 16 seconds).

In addition, in the 10,000 reports, GPT-4 detected 96 errors, with a low positive predictive value of 0.05, yet 14% of the false-positive responses were potentially beneficial, the authors noted.

“OpenAI’s GPT-4 could detect, reason, and revise errors in head CT reports, demonstrating its feasibility as a tool for proofreading radiology reports,” the group wrote.

Ultimately, the impression section of a radiology report is not merely a summary but also an assessment of the order of importance of the findings made by the radiologist, the authors noted. While GPT-4 excelled at identifying factual consistency, it failed to prioritize the clinical significance of multiple findings, they added. Thus, with an increase in the number of impressions, GPT-4 faced challenges in coherently organizing these elements, which ultimately led to a lower detection sensitivity and a higher rate of false-positive results, according to the authors.

Nonetheless, the study highlights the possibility of synergistic collaboration between radiologists and large language models like GPT-4 in report proofreading, the group suggested.

“Intervention by GPT-4 in the early stages of proofreading could greatly lessen the workload of both the trainee and the attending radiologist,” the researchers concluded.

The full study is available here.

Page 1 of 661
Next Page