A large language model (LLM)-based educational tool could help patients better understand complex imaging terms, suggest findings published June 10 in the Journal of the American College of Radiology.
The novel tool, named RadGPT, achieved high ratings for generated concept-based explanations about imaging findings when evaluated by radiologists, wrote a team led by Sanna Herwald, MD, PhD, from Stanford Health Care in California.
“RadGPT creates consistently high-quality LLM-generated explanations and question-and-answer pairs that are tailored to individual radiology reports,” Herwald and colleagues wrote.
The Cures Act Final Rule requires that patients have real-time access to their radiology reports. However, patients may have a hard time understanding the technical language that these reports contain.
Patients also continue to seek medical advice from LLM-based chatbots, such as OpenAI’s ChatGPT or Google’s Gemini. Previous studies suggest mixed results with educating patients, with some showing that these models can make imaging reports easier to read for patients and others saying patients are better off reading educational materials on imaging societies’ websites.
The Herwald team developed RadGPT with the goal of integrating concept extraction with an LLM (ChatGPT-4) to help patients understand their radiology reports.
For the study, RadGPT generated 150 concept explanations and 390 question-and-answer pairs from 30 radiology report impressions taken between 2012 and 2020. Reports included images from CT, MRI, and x-ray exams.
These led to the creation of concept-based explanations and concept-based question-and-answer pairs, where questions were generated using either a fixed template or an LLM.
One board-certified radiologist and four radiology residents rated the material quality from the generated response by using a standardized rubric. The researchers used a five-point Likert scale to measure these ratings.
The team reported the following findings:
The radiologists on average rated the RadGPT-generated concept-level explanations at 4.8 out of 5, with 95% of concept explanations receiving an average rating of 4 or higher.
The radiologists gave half of all concept explanations the highest possible rating of 5, while 5% of concept-level explanations had an average rating of less than 4.
On a 3-point scale, LLM-generated questions on average were rated significantly higher in quality than the template-based questions (2.9 vs. 2.6, p < 0.001 from a mixed effects model).
On the 5-point Likert scale, the quality of answers to LLM-generated questions was rated significantly higher on average than answers to template-based questions. However, the absolute difference was small (4.7 vs. 4.6, p = 0.001 from a mixed effects model).
Also, RadGPT generated three pairs of questions and answers tailored to each individual radiology report, but it was not limited to a predesignated concept. On a 3-point scale, the overall average rating of the 90 report-level LLM-generated questions was the highest rating of 3. And 92% of report-level LLM-generated questions received the highest rating from all ratings, according to the study.
Finally, the researchers reported high inter-rater agreement for all types of RadGPT-generated material. This included the following measures: a Fleiss’ kappa value of 0.66 and a 51% complete rater agreement across all answer and explanation ratings; and a Fleiss’ kappa value of 0.65 and 71% complete rater agreement across all question ratings.
The findings support RadGPT as a safe tool with the potential to improve patient engagement and health outcomes, the study authors wrote. They added that RadGPT can do so “without increasing the burden of healthcare workers and promote health equity regardless of patients’ education level or medical literacy.”
The full study can be read here.