Large language models help patients understand radiology reports

Jul 2, 2024

Large language models can generate summaries that could help patients better understand their radiology reports, though human oversight is needed, according to research published July 1 in the Journal of the American College of Radiology.

A team led by Kayla Berigan, MD, from the University of Vermont Medical Center in Burlington, found that patients who received radiology reports, with and without a large language model-generated summary, were more likely to have a better understanding of their reports compared with the standard of care.

“This small pilot study supports a rationale for patient-friendly reports, including those generated by large language models,” the Berigan team wrote.

The 21st Century Cures Act mandated immediate patient access to radiology reports. Previous studies suggest that patients prefer immediate access to results via the patient portal but may not understand them and prefer lay language or summary versions.

Since large language models such as ChatGPT and Gemini (formerly Google Bard) have become publicly available, patients have sought answers to medical questions from these chatbots. However, generated answers from the models may contain incorrect information and not match up with recommendations from radiologists.

Berigan and colleagues studied the impact of summary reports, with or without summaries generated by large language models, on patient comprehension in a prospective clinical setting.

The study included data from 99 patients who were divided into one of the following cohorts: standard of care, Gemini-generated summary, Scanslated report, and a combined approach with a Gemini-generated summary plus a Scanslated report. All groups had access to the standard report in MyChart (Epic Systems) and were issued surveys to gauge their understanding.

The researchers found a significant group effect on the level of understanding (p = 0.04). The cohorts that were given Scanslated reports or the combined reporting approach reported having a higher level of understanding compared with the standard of care. This included odds ratios of 3.09 (Scanslated cohort) and 5.02 (combined approach), respectively.

However, the team observed no significant group effect on the need to search report contents online (p = 0.07).

Of 51 large language model-generated summaries provided to patients, 80.4% (n = 41) required editing before release. The study authors noted that this was done “usually” to remove suggestions of prognosis, treatment, or causality.

The authors highlighted that this finding underscores the need for human oversight before widespread clinical deployment.

“Future work should measure impact in larger, more diverse populations, expand to various clinical settings, evaluate potential return on investment, and refine large language model performance,” they wrote.

The full study can be found here.