ChatGPT, Google Bard simplify lung reports, but at high reading level

Jun 21, 2023

2022 09 16 21 47 3714 Artificial Intelligence Lung Color 400

Large-language models like ChatGPT and Google Bard can simplify baseline AI responses to patient questions on lung cancer screening, but not quite to an optimal reading level, suggest findings published June 21 in the American Journal of Roentgenology.

Researchers led by Hana Haver, MD, from the University of Maryland in Baltimore found that while ChatGPT's baseline responses to patient questions were around the 12th-grade level, asking the large-language models to simplify resulted in junior-high to high school sophomore-level responses.

"However, for all ... models, simplified responses' overall mean readability remained too difficult for the average adult patient," Haver and colleagues wrote.

Radiologists and patients alike are showing interest in the potential of large-language models, though in different aspects. While radiologists are interested in its use as a clinical assistant for patient communication, patients are submitting health-related questions to the models. Taking information from various sources, the models generate answers for the patients.

While early diagnosis of lung cancer leads to better health outcomes, less than 6% of eligible U.S. residents have reportedly undergone screening via low-dose CT. Previous reports indicate that low health literacy may play a role in this trend.

Meanwhile, the average U.S. adult has an eighth-grade reading level. The American Medical Association recommends that health material should be written at a sixth-grade or less level for patients to read.

Haver and co-authors wanted to explore the use of three large-language models for simplifying generated responses to common questions about lung cancer and lung cancer screening. They had ChatGPT answer such questions to produce baseline responses. From there, they used ChatGPT, ChatGPT-4, and Google Bard to simplify the baseline responses and generate new responses.

Also, three fellowship-trained cardiothoracic radiologists independently reviewed both baseline and simplified responses. To measure readability, the researchers used the Flesch Reading Ease scale, which has a range of 0 to 100. They also used an online tool to measure readability at the U.S. grade level.

The team found that the average reading ease score and average readability for ChatGPT's baseline responses were 49.7 and grade 12.6, respectively. However, this improved with simplified responses.

Comparison of large-language model performance in simplifying responses
	ChatGPT	ChatGPT-4	Bard
Reading ease score	62	68	71
U.S. grade level	10	9.6	8.2

*All data achieved statistical significance.

The team also reported that Bard and ChatGPT-4 showed significantly simpler responses compared to ChatGPT (p = 0.003, 0.02, respectively). However, differences between GPT-4 and Bard were not significantly different (p = 0.14).

Additionally, the researchers assessed the proportion of questions with "adequately" readable responses based on a reading ease score of 60 or higher. They found that this was 21% (4/19) for baseline responses. For simplified responses, the proportion was 58% (11/19) for ChatGPT, 79% (15/19) for GPT-4, and 95% (18/19) for Bard.

Based on a readability score of eighth-grade level or lower, the researchers also found that the proportion was 5% (1/19) for baseline responses. For simplified responses, this was 16% (3/19) for ChatGPT, 32% (6/19) for GPT-4, and 53% (10/19) for Bard.

While the simplified responses showed improvement in reading ease and overall readability, the researchers suggested that large-language models have a long way to go before being able to truly answer questions by patients.

They added that since some of the simplified responses were deemed clinically inappropriate, "physician oversight remains critical if using the large-language models for this purpose."

The study can be found in its entirety here.