ChatGPT-3.5 and ChatGPT-4 can produce differential diagnoses from transcribed radiologic findings of patient cases across a wide range of subspecialties, according to a study published October 15 in Radiology.
A team led by Shawn Sun, MD, of the University of California, Irvine, tested the models on 339 cases from the textbook Top 3 Differentials in Radiology. It found that GPT-3.5 achieved an overall accuracy of 53.7% for final diagnosis and GPT-4 an accuracy of 66.1%. False statements remain an issue, however.
“The hallucination effect poses a major concern moving forward, but the significant improvement with the newer model, GPT-4, is encouraging,” Sun and colleagues noted.
The burgeoning interest in ChatGPT as a potentially useful tool in medicine highlights the need for systematic evaluations of its capabilities and limitations, according to the authors. In their study, the group evaluated the accuracy, reliability, and repeatability of differential diagnoses produced by ChatGPT from transcribed radiologic findings.
The investigators culled 339 cases across multiple radiologic subspecialties from Top 3 Differentials in Radiology. They converted the cases into standardized prompts and analyzed responses for accuracy via a comparison with the final diagnosis and the top three differential diagnoses provided in the textbook, which served as the ground truth.
They then tested the algorithms’ reliability by identifying factually incorrect statements and fabricated references and measured test-retest repeatability by obtaining 10 independent responses from both algorithms for 10 cases in each subspecialty.
Key findings included the following:
- In 339 radiologic cases, ChatGPT-3.5 and ChatGPT-4 achieved a top one diagnosis accuracy of 53.7% and 66.1% (p < 0.001) and a mean differential score of 0.5 and 0.54 (p = 0.06).
- ChatGPT-3.5 generated hallucinated references 39.9% of the time and generated false statements in 16.2% of cases, while ChatGPT-4 hallucinated references 14.3% (p < 0.001) of the time and generated false statements in 4.7% (p < 0.001) of cases.
- Repeatability testing of ChatGPT-4 showed a range of average pairwise percent agreement across subspecialties of 59% to 93% for the most likely diagnosis and a range of 26% to 49% for the top three differential diagnoses.
“ChatGPT produced accurate diagnoses from transcribed radiologic findings for a majority of cases; however, hallucinations and repeatability issues were present,” the researchers wrote.
In the end, ChatGPT-4’s rate of less than 5% of false statements on cases is likely acceptable for most adjunct educational purposes, according to the researchers. The use of these algorithms with expert oversight, expecting a certain level of falsehoods, may be the best current strategy, they suggested.
“Most radiology trainees and physicians will be able to spot these false statements if it is understood that hallucinations occur despite the confident tone of the algorithm response,” the group concluded.
In an accompanying editorial, Paul Chang, MD, of the University of Chicago, suggested that while such feasibility studies with generative AI in radiology have been welcome and useful, they may have already served their primary role.
“If we are to effectively cross the chasm between proof-of-concept feasibility and real-world application, it is probably time to start addressing more challenging problems and hypotheses using more advanced approaches,” he wrote.
The full study is available here.