ChatGPT-4 shows improvement over GPT-3.5 in text-based imaging cases

Jan 18, 2024

ChatGPT-4 has greater diagnostic accuracy than ChatGPT-3.5 when assessing text-based medical imaging cases, a study published January 16 in Radiology found.

Researchers led by David Li, MD, from the London Health Sciences Centre in London, Ontario, Canada found that the newer version of ChatGPT was superior across all subspecialties and Snapchat messages from the journal’s “Diagnosis Please” cases.

“Our study highlights the pressing need for more robust and continuous large language model monitoring systems before clinical deployment,” Li and colleagues wrote.

Large language models such as ChatGPT have interested radiologists within the past year for their ability to comprehend and generate human-like text. However, the researchers pointed out that it remains to be seen how upgrades in ChatGPT improve performance in providing diagnoses in radiology cases.

The team investigated the respective diagnostic accuracy of ChatGPT-3.5 and ChatGPT-4 in solving text-based Radiology “Diagnosis Please” cases. It selected 287 cases published between 1998 and 2023. It also assessed the diagnostic accuracy of both iterations of ChatGPT with snapshots from March and June 2023, using the top five differential diagnoses generated from text inputs of history, findings, and both combined. Imaging findings were originally characterized by radiologists.

ChatGPT-4’s accuracy overall improved by 19.8 percentage points over ChatGPT-3.5 for the March snapshots. For the June snapshots, ChatGPT-4’s accuracy was 11 percentage points higher than that of ChatGPT-3.5.

Between the March and June snapshots, the team reported a decrease in accuracy of 5.92 percentage points for ChatGPT-4 and an increase of 2.79 percentage points for ChatGPT-3.5 within the same period.

The researchers called this an “unexpected” finding, writing that this echoes similar reports of ChatGPT-4’s performance varying between snapshots.

“This variability could stem from optimization on competing metrics, such as safety or inference speed, potentially leading to instability in real-world performance,” Li and co-authors wrote.

The researchers also assessed the performance of the large language models on 10 radiology subspecialties, using breast imaging as the reference. They found that the only subspecialty that was significantly tied to greater accuracy in favor of ChatGPT-4 was in head and neck cases.

Finally, diagnostic accuracy increased by 17.3% between ChatGPT-3.5 and ChatGPT-4 across all subspecialties and snapshots.

The study authors suggested that despite differences between their study’s experimental setting and clinical practice, large language models such as ChatGPT “could potentially serve as a decision support tool in future diagnostic workflows, particularly for creatively broadening differential diagnoses under supervision by radiologists.”

The full study can be found here.