Over three-fifths of treatments recommended by ChatGPT-3.5 Turbo are at least partially not concordant with National Comprehensive Cancer Network (NCCN) guidelines, a study published August 24 in JAMA Oncology found.
Researchers led by Shan Chen from Mass General Brigham and Harvard Medical School in Boston found that the chatbot mixed in incorrect recommendations among correct ones, which may be difficult for experts to detect.
"Developers should have some responsibility to distribute technologies that do not cause harm, and patients and clinicians need to be aware of these technologies' limitations," Chen and co-authors wrote.
The use of ChatGPT and other large language models has been explored in medical settings, with the chatbots being able to pass medical exams and communicate with patients. However, they have also been shown to use fictional resources in "writing" medical papers, as well as supply incorrect information to patients.
Chen and colleagues sought to investigate ChatGPT-3.5 Turbo's performance for providing recommendations regarding breast, prostate, and lung cancer treatment that are concordant with NCCN guidelines.
They developed zero-shot prompt templates to search treatment recommendations, which do not provide the model with examples of correct responses. These templates were used to create four prompt variations for 26 diagnosis descriptions for a total of 104 prompts, which were then input to the model.
The team compared the chatbot's recommendations against 2021 NCCN guidelines, since ChatGPT has a knowledge cutoff after September 2021. It scored outputs from the prompts on five criteria for a total of 520 scores. The team also measured concordance through the assessment of three board-certified oncologists, with majority rule being taken as the final score.
The researchers found that all three oncologists agreed with 322 of 520 (61.9%) ChatGPT-3.5 Turbo scores. They also noted that disagreements mostly came up when the output was not clear, such as not specifying which multiple treatments to combine. The researchers however noted that this could be because of differing interpretations of guidelines among the oncologists.
The team also found that for nine of 26 (34.6%) diagnosis descriptions, the four prompts yielded the same scores for each of the five scoring criteria. Additionally, ChatGPT gave at least one recommendation for 102 of 104 (98%) prompts. Outputs with a recommendation included at least one treatment that was concordant with NCCN guidelines. However, 35 of 102 (34.3%) of these outputs also recommended one or more non-concordant treatments.
Finally, the investigators found that for 13 of 104 (12.5%) outputs, responses were hallucinated, meaning they were not part of any recommended treatment. It wrote that hallucinations were mainly recommendations for localized treatment of advanced disease, targeted therapy, or immunotherapy.
The study authors wrote that based on these results, clinicians should advise patients that such chatbots are not a reliable source of cancer treatment information.
In an accompanying editorial, Atul Butte, MD, PhD from the University of California, San Francisco wrote that the "real" potential of large language models and AI is to be trained from patient, clinical, and outcomes data from the "very best" centers and then deliver these digital tools to patients. However, he pointed out that these algorithms will need to be "carefully" monitored as they make their way into health systems.
"It is time to stop thinking of AI as 'nice to have' pilot projects and start realizing that we need AI as a 'scalable privilege' for all patients," Butte wrote.
The full study can be found here.