ChatGPT-4 outperformed human clinicians in determining pretest and post-test disease probability after a negative test result involving chest radiographs and mammograms, according to a research letter published December 11 in JAMA Network Open.
Investigators led by Adam Rodman, MD, from Beth Israel Deaconess Medical Center in Boston did find, however, that ChatGPT-4 did not perform as well after positive test results.
“However, even if imperfect, probabilistic recommendations from [large language models] might improve human diagnostic performance through collective intelligence, especially if AI diagnostic aids can combine probabilistic, narrative, and heuristic approaches to diagnosis,” Rodman and colleagues wrote.
Imaging tests are a first-line tool for determining diagnoses, but the researchers underscored that health practitioners “often perform poorly” at estimating probabilities of disease before and after imaging exams are performed.
Medical researchers over the past year have experimented with using large language models to help with clinical workflows and assist with disease diagnosis, and previous reports suggest these models can understand clinical reasoning to an extent.
Rodman and co-authors explored the ability of one such model, ChatGPT-4, to perform probabilistic reasoning. They compared its performance with a survey of 553 human clinicians from various specialties.
The clinicians performed probabilistic reasoning in a series of five cases with scientific reference standards. The researchers copied each case into ChatGPT-4 and a prompt designed to make the AI commit to a specific pretest and post-test probability.
The cases included the following exams: chest radiography for pneumonia, mammography for breast cancer, stress test for coronary artery disease, urine culture for urinary tract disease, and hypothetical testing.
ChatGPT-4 achieved less errors in pretest and post-test probability after a negative result in all five cases, the team reported. This also went for when the median estimate from the model differed more from the correct answer than the median human estimate.
“For example, for the asymptomatic bacteriuria case, the median pretest probability was 26% for the [large language model] versus 20% for humans and the mean absolute error was 26.2 [5,240%] versus 32.2 [6,450%],” the researchers wrote.
ChatGPT-4 also had a narrower distribution of responses compared with humans, the team found. It also demonstrated higher accuracy than humans in estimating posttest probability after a positive test result in two cases, breast cancer and hypothetical testing. It performed comparatively accurately to clinicians in the chest radiograph pneumonia and cardiac ischemia cases, and less accurately in the urinary tract infection case.
“Other than the fifth test case, when the AI formally solved a basic statistical reasoning question, the range of the [large language model’s] probability outputs in response to clinical vignettes appeared to be emergent from its stochastic nature,” the study authors wrote.
They also called for future studies to investigate the performance of large language models in more complex cases.
The full study can be found here.