Large language models integrated with image-to-text approaches could potentially improve diagnostic thyroid ultrasound interpretation, according to research published March 12 in Radiology.
A team led by Li-Da Chen, MD, PhD, from First Affiliated Hospital of Sun Yat-sen University in Guangzhou found that ChatGPT 4.0 had the highest consistency and diagnostic accuracy when compared with Google Bard and ChatGPT 3.5 when it came to interpreting ultrasound images of thyroid nodules. Also, the image-to-text large language model strategy showed comparable performance to that of human large language model interaction involving two senior readers and one junior reader.
“The results indicate that combining image-to-text models and large language models could advance medical imaging and diagnostics research and practice, informing secure deployment for enhanced clinical decision-making,” the Chen team wrote.
While previous studies have explored the potential of large language models in medical imaging interpretation, the researchers noted a lack of studies investigating the feasibility of the models in handling reasoning questions tied to medical diagnosis.
Chen and colleagues studied the viability of leveraging three publicly available models in this area: ChatGPT 4.0, ChatGPT 3.5, and Bard. The researchers explored how the models could improve consistency and diagnostic accuracy in medical imaging based on standardized reporting, with pathology as the reference standard.
The study included data collected in 2022 from 1,161 ultrasound images of thyroid nodules from 725 patients. Of the total nodules, 498 were benign and 663 were malignant.
ChatGPT 4.0 and Bard achieved substantial to almost perfect intra-large language model agreement (κ range, 0.65-0.86), while ChatGPT 3.5 showed fair to substantial agreement (κ range, 0.36-0.68).
The researchers found that ChatGPT 4.0 achieved an accuracy of 78% to 86% and a sensitivity of 86% to 95%, compared with 74% to 86% and 74% to 91%, respectively, for Bard.
The team also compared three model deployment strategies in its study: human-large language model interaction, where human readers initially interpreted images and then the large language model provided a diagnosis based on the human reader-recorded TI-RADS signs; image-to-text-model, where an image-to-text model for image analysis was followed by large language model diagnosis; and an end-to-end convolutional neural network (CNN) model for image analysis and diagnosis.
Through the image-to-text large language model strategy, ChatGPT 4.0 achieved an area under the curve (AUC) that either compared with or exceeded those of the radiologists in the human-large language model interaction strategy. And while the CNN strategy outperformed ChatGPT 4.0 in most areas, the two achieved the same sensitivity value.
Performance of ChatGPT 4.0 in image-to-text strategy compared with readers in human large language model strategy | ||||||
---|---|---|---|---|---|---|
Junior reader 1 (human-model strategy) | Junior reader 2 (human-model strategy) | Senior reader 1 (human-model strategy) | Senior reader 2 (human-model strategy) | Convolutional neural network | ChatGPT 4.0 | |
AUC | 0.82 | 0.76 | 0.84 | 0.85 | 0.88 | 0.83 |
Accuracy | 82% | 78% | 85% | 86% | 89% | 84% |
Sensitivity | 86% | 93% | 91% | 92% | 95% | 95% |
Specificity | 78% | 59% | 77% | 78% | 81% | 71% |
The study authors highlighted that their results affirm the feasibility of large language models in addressing reasoning questions linked to medical diagnosis by using pathologic findings within structured ultrasound imaging-based diagnosis as the reference standard.
“Interestingly, language arts encompass both emotional intelligence and intelligence quotient,” the authors wrote. “This indicates that large language models possess a stable and superior emotional intelligence, making them potentially helpful for fostering patient equity.”
Still, they cautioned that large language models cannot interpret images by themselves, relying on image-to-text techniques or human interpretation to provide text descriptions of image features. Also, citing the performance of senior reader 2 in the study, the authors highlighted that radiologist expertise is still indispensable despite advances in AI.
“Further research is required to investigate applicability across different models, techniques, and medical image types,” they wrote.
The full study can be found here.