Large language models simplify radiology report impressions

Mar 26, 2024

Large language models (LLMs) can significantly simplify radiology report impressions, enhancing readability for patients, according to research published March 26 in Radiology.

In a study involving 750 radiology reports, a team of researchers from Yale University tested four different LLMs – including ChatGPT, Bard, and Bing -- and found that all four were able to significantly simplify report impressions. Model performance did differ, however, based on the wording of the prompt used.

“Our study highlights how radiology reports, which are complex medical documents that implement language and style above the college graduate reading level, can be simplified by LLMs,” wrote the team led by Rushabh Doshi. “Patients may use publicly available LLMs at home to simplify their reports, or medical practices could adapt automatic simplification into their workflow.”

As the complex medical terminology in radiology reports can be confusing to patients or induce anxiety, the researchers sought to assess if LLMs could make these reports more readable. They gathered 150 CT, 150 MRI, 150 ultrasound, and 150 diagnostic mammography reports from the Medical Information Mart for Intensive Care (MIMIC-IV) database.

Next, they queried ChatGPT-3.5, ChatGPT 4, Bing (Microsoft), and Bard -- now known as Gemini -- (Google) using three different prompts:

“Simplify this radiology report.”
“I am a patient. Simplify this radiology report.”
“Simplify this radiology report at the 7th grade level.”

They then provided the radiology report impression. Compared with the original report, all four models simplified readability of the impressions for each of the prompts (p < 0.001). However, some models provided simpler output than others (p < 0.05) based on the prompt.

The researchers also observed that prompts with additional context – i.e., “I am a patient” and “simplify at a 7th-grade level” yielded better performance across all four models. Their findings should not be considered an endorsement of any particular LLM, as each has advantages and disadvantages, according to the authors.

“Careful fine-tuning and customization for each LLM may ensure optimal simplification, while maintaining the clinical integrity of reports,” the authors wrote. “A longitudinal study design, as well as a more diverse data set, are recommended to improve the validity and generalizability of these results.”

In an accompanying editorial, Amir Ali Rahsepar, MD, of Northwestern University in Chicago, noted the need for careful development and oversight of these models. Expert medical advice should be incorporated during the development of these models to ensure reliability and patient-centered outputs.

“A large, multicenter diverse study is necessary to pinpoint areas requiring LLM fine-tuning: to identify suitable prompts for generating accurate, simplified, and empathetic reports; and to establish specific guidance where needed,” Rahsepar wrote. “Such efforts can translate the promise of LLMs into tangible benefits for patient understanding, communication, and ultimately, health care outcomes.”

The full study can be found here.