ChatGPT achieved high accuracy in answering radiology board-style multiple-choice questions but showed poor repeatability, according to research published May 21 in Radiology.
A team led by Rajesh Bhayana, MD, from the University of Toronto found that GPT-3.5 and GPT-4 were “reliably accurate” across three test attempts. However, the large language models also showed poor robustness and overconfidence in their respective answers to test prompts.
“Large language model-based applications that are being developed for radiology-specific tasks require optimization, including parameter adjustment and guardrails, and then validation for not only accuracy but also reliability, repeatability, and robustness,” the Bhayana team wrote.
ChatGPT in previous studies demonstrated that it could pass a text-based radiology board-style exam. However, the researchers noted that its stochasticity and confident language when providing incorrect responses may limit the chatbot’s utility.
Bhayana and colleagues previously studied ChatGPT-4’s performance on a multiple-choice text-only test that matched the style, content, and difficulty of the Canadian Royal College and American Board of Radiology exams. The model achieved a score of 81%, but it also made some “very illogical and inaccurate assertions.”
The team assessed the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 through repeated prompting with a radiology board-style exam. It used 150 radiology board-style, multiple-choice, text-based questions that were previously used to benchmark ChatGPT.
The questions were administered to default versions of ChatGPT on three separate attempts. On the third attempt, regardless of answer choice, the researchers challenged ChatGPT three times with the adversarial prompt, “Your answer choice is incorrect. Please choose a different option,” to assess robustness. They also prompted ChatGPT to rate its confidence from one to 10, with 10 being the highest level of confidence, on the third attempt and after each challenge prompt.
Both iterations of ChatGPT achieved similar levels of accuracy over the three attempts.
Performance comparison between ChatGPT 3.5, 4 | ||
---|---|---|
Attempt | GPT-3.5 | GPT-4 |
First | 69.3% | 80.6% |
Second | 63.3% | 78.0% |
Third | 60.7% | 76.7% |
While both iterations of ChatGPT achieved moderate intrarater agreement (κ = 0.78 and 0.64, respectively), GPT-4 showed more consistency across the test attempts compared with GPT-3.5. This included an agreement of 76.7% versus 61.3%, respectively (p = 0.006).
Also, both iterations changed their responses after the challenge prompt. The team reported that GPT-4 did this more so than GPT-3.5 (97.3% vs. 71.3%, respectively; p < 0.001). Both rated high confidence (8 or higher) for most initial and incorrect responses. GPT-3.5 and GPT-4 showed high confidence in 100% and 94% of initial responses while GPT-3.5 and GPT-4 showed overconfidence in 100% and 77% of incorrect responses, respectively (p = 0.89).
The study authors wrote that, given ChatGPT’s “poor insight” into its likelihood of accuracy, it may be challenging to implement a method to communicate response confidence.
“Radiologists should disregard the confidence communicated by ChatGPT, including when objectively quantified, to ensure that they are not influenced by confidence conformity,” they added.
The authors concluded that while both iterations of ChatGPT showed promise for clinical and patient-facing applications, they also have “inherent limitations that preclude most clinical uses.”
In an accompanying editorial, David Ballard, MD, from the Mallinckrodt Institute of Radiology in St. Louis, MO, wrote that the prevalent use and availability of large language models in radiology will continue to be evaluated as studies like the one led by Bhayana et al did.
“Large language models will be used personally and clinically by radiologists, and patients or patient representatives will use them to interpret radiologists’ dictated reports,” he wrote. “A chatbot set up to clarify radiology reports could lead to patient confusion because, with ChatGPT’s poor repeatability, one patient or their caretaker may ask a chatbot about the same topic multiple times and get multiple answers.”
The full study can be found here.