Techniques that show users how a radiology artificial intelligence (AI) algorithm arrives at its findings may not be that reliable, according to research published online on October 10 in Nature Machine Intelligence.
A multi-institutional team led by Pranav Rajpurkar, PhD, of Harvard Medical School compared seven different types of saliency methods (also known as heat maps) with radiologists for identifying 10 common pathologies on chest x-rays. They found that all of the saliency techniques consistently underperformed radiologists.
"Our analysis shows that saliency maps are not yet reliable enough to validate individual clinical decisions made by an AI model," Rajpurkar said in a statement. "We identified important limitations that raise serious safety concerns for use in current practice."
Black box
The "black box" nature of many radiology AI algorithms can affect how clinicians view the reliability of the tool and therefore discourage its use, according to the researchers. Conversely, some clinicians may even overtrust the analysis of AI algorithms.
In an effort to build trust in AI technology, saliency maps have been incorporated into a variety of medical imaging AI algorithms to show users the relevant elements of the images that influenced the software's results. The goal is to help clinicians ascertain if the model is focusing on a clinically irrelevant part of the image or even on confounding aspects, according to the researchers.
"However, recent work has shown that saliency methods used to validate model predictions can be misleading in some cases and may lead to increased bias and loss of user trust in high-stakes contexts such as healthcare," the authors wrote. "Therefore, a rigorous investigation of the accuracy and reliability of these strategies is necessary before they are integrated into the clinical setting."
In the study, Rajpurkar and colleagues from Stanford University and New York University sought to evaluate the performance of these techniques for the interpretation of chest x-rays. The researchers used three common convolutional neural network (CNN) models to evaluate seven different saliency methods: Grad-CAM, Grad-CAM++, Integrated Gradients, Eigen-CAM, DeepLIFT, Layer-Wise Relevance Propagation, and Occlusion. An ensemble of 30 CNNs were trained and evaluated for each combination of saliency method and model architecture.
Each of these algorithms processed chest x-rays in the holdout test set of the CheXpert publicly available chest x-ray dataset to obtain image-level predictions for 10 pathologies: airspace opacity, atelectasis, cardiomegaly, consolidation, edema, enlarged cardiomediastinum, lung lesion, pleural effusion, pneumothorax, and support devices.
The performance of these saliency method segmentations was compared with ground-truth radiologist segmentations established by two board-certified radiologists with 18 and 27 years of experience, respectively. In addition, segmentations produced by a separate group of three radiologists from Vietnam with 9, 10, and 18 years of experience, respectively, were also compared with the ground-truth segmentations to establish a human benchmark localization performance, according to the authors.
Underperformance
Although Grad-CAM localized pathologies better than the six other saliency methods, all performed significantly worse than the human readers. On average, the saliency method was 24% worse than the human benchmark when comparing the average overlap in results between the saliency method and the human benchmark. As for the specific pathologies, the saliency method was 76.2% worse than the human benchmark.
In other findings, the researchers noted that the gap in performance between Grad-CAM and the human results was largest for smaller pathologies and for those that had more complex shapes. Furthermore, they concluded that the AI model's confidence in its findings is positively correlated with Grad-CAM localization performance.
Although clinical practices are already using saliency maps as a quality assurance tool for computer-aided detection (CAD) methods, such as chest x-ray models, this feature should be applied with caution and a healthy dose of skepticism in light of the new results, according to the researchers.
"This work is a reminder that care should be taken when leveraging common saliency methods to validate individual clinical decisions in deep learning-based workflows for medical imaging," the authors wrote.
To assist in further research, the group has released its development dataset -- called CheXlocalize -- of its expert segmentations. It can be downloaded from GitHub.