ChatGPT shows potential for diagnosing nuclear medicine cases, yet needs further development before it can be implemented in practice, according to a study presented November 28 at RSNA in Chicago.
Gillean Cortes, DO, a resident at the University of California, Irvine presented a study that put ChatGPT-3.5 and ChatGPT-4 to the test on nuclear medicine differential diagnoses cases transcribed from two textbooks. The chatbot versions achieved accuracies of 60% and 70%, but were prone to “hallucinations,” Cortes noted.
“While ChatGPT has shown some potential in generating accurate diagnoses, this technology requires further development before it can be implemented into clinical and educational practice,” Cortes said.
In the study, Cortes and colleagues culled a sample of 50 cases specific to nuclear medicine imaging from the textbooks “Top 3 Differentials in Radiology” and “Top 3 Differentials in Nuclear Medicine.” The researchers converted the cases into standardized prompts that contained purely descriptive language and queried ChatGPT-3.5 and ChatGPT-4 for the most likely diagnosis, the top three differential diagnoses, and corresponding explanations and references from the medical literature.
The large language model’s output diagnoses were analyzed for accuracy based on comparisons with the original literature, while reliability was assessed through manual verification of the generated explanations and citations.
ChatGPT-3.5 generated the top diagnosis in 30 cases (60%) and was accurate in generating the top three differentials in six cases; ChatGPT-4 was deemed accurate in generating the top diagnosis in 35 cases (70%) and was accurate in the top three differentials in five cases.
Accuracy of nuclear medicine differential diagnoses generated by ChatGPT (n = cases) | |||
---|---|---|---|
Top diagnosis | Top 3 diagnoses | Differential diagnosis score | |
ChatGPT-3.5 | n = 30 (60%) | n = 6 (12%) | 58% |
ChatGPT-4 | n = 35 (70%) | n = 5 (10%) | 59% |
p-value | 0.15 | 0.37 | 0.48 |
In addition, however, ChatGPT-3.5 hallucinated 41.5% of the references provided and generated six total false statements, while ChatGPT-4 hallucinated 8.3% of the references and gave four false statements.
For instance, in a case with a prompt that did not give the patient’s smoking history, ChatGPT generated the following: “The most likely diagnosis for this scenario is lung cancer, given the presence of a hypermetabolic cavitary lesion in the lung and the patient’s age and smoking history.”
“It is important to acknowledge that the most recent version of ChatGPT has made slight improvements in the accuracy of its diagnoses as well as reducing the hallucination effect,” Cortes noted.
Ultimately, given that ChatGPT developer Open.AI has not fully released the training data for the algorithm, it is difficult to determine whether the work used in this study was included in its training, she added. Nonetheless, the study’s results were quite promising, especially given that the group did not fine-tune the prompts nor adjust any input parameters, she said.
“Knowledge of the accuracy and possible errors of these algorithms can provide a better understanding of the limitations of these tools,” Cotes concluded.