Sunday, November 26 | 1:30 p.m.-1:40 p.m. | S4-SSCH02-4 | Room N228
While ChatGPT and Google Bard were both able to answer nonexpert questions about lung cancer prevention, screening, and terminology commonly used in radiology reports, ChatGPT won this battle of large-language models (LLMs), according to this study.
With two radiologists assisting for accuracy, 40 questions were created for comparing ChatGPT and Bard. Consistency among the answers and accuracy of the two LLMs were evaluated, with consistency defined as the agreement between the three answers provided by either ChatGPT or Bard, regardless of whether the concept conveyed was correct or incorrect.
In comparing the LLM model outputs, UCLA Health cardiothoracic imaging fellow Amir Ali Rahsepar, MD, and team found that ChatGPT’s answers were consistent 90% of the time or 36 out of 40 times, while Bard's answers were consistent only 57.5% of the time at 23 out of 40.
Out of 120 ChatGPT answers, 70.8% were correct (85), 11.7% were partially correct (14), and 17.5% were incorrect (21). In a breakdown of Bard’s performance answering 97 questions, 51.7% of Bard’s answers were correct (62), 9.2% were partially correct (11), and 20% were incorrect (24), according the Rahsepar's findings.
Although the use of AI offers new possibilities, according to Rahsepar, it also presents challenges that must be carefully reviewed by experts to prevent undue burden on patients and healthcare workers.
“It is essential that LLM developers be aware of the complexity of healthcare decision-making and implement serious guardrails for all healthcare-related interactions,” Rahsepar wrote.
Learn more at this Sunday afternoon session.