Vision transformer AI model boosts PET/CT imaging of cancer

Apr 12, 2024

A vision transformer (ViT) AI model outperformed convolutional neural networks when classifying tumors on PET/CT images as benign or malignant, according to a study published April 9 in Scientific Reports.

In a head-to-head comparison, researchers at Osaka University in Osaka, Japan, demonstrated that a ViT model performed better than convolutional neural network (CNN) models when classifying findings on PET/CT imaging. The result boosted the clinical value of the approach, the group noted.

“We expect that the ViT model will help users to differentiate between benign and malignant slices in PET/CT images and prevent overlooking lesions with insignificant FDG uptake,” wrote lead author Daiki Nishigaki, MD, and colleagues.

ViT models are based on architecture originally primarily used for natural language processing tasks. They have been adapted to include mechanisms that process images as sequences of tokens. One advantage of ViTs is that they can integrate information across the entire image, while CNNs obtain more localized features, according to the authors.

Few studies have compared the approaches, they added. Thus, the researchers put the two approaches to the test on a difficult clinical task: differentiating F-18 FDG-PET/CT slices as benign or malignant, especially in lesions with poor F-18 FDG radiotracer uptake.

“In daily medical practice, it is often difficult to make clinical decisions because of physiological FDG uptake or cancers with poor FDG uptake,” the authors noted.

This study included imaging from 143 patients with active abdominopelvic cancer and 64 patients without any active cancer who underwent whole-body PET/CT scans at Osaka University Hospital between January 2020 and August 2021.

Using this data, the researchers trained a previously developed ViT-B/16 model without modifications to classify the PET/CT images as “positive” or “negative” (malignant or benign). Next, they compared the ViT’s performance with two baseline CNN models, called DenseNet and EfficientNet, on 4,852 test PET/CT images.

Predictions and Grad-CAMs of ViT-based models on sample PET/CT, PET, and CT test images from the “positive” class. ViT was fine-tuned using training data for each modality. The bounding boxes depicted in the figure indicate malignant lesions. The top row of the Grad-CAMs shows important areas for “positive” predictions, and the bottom row shows areas for “negative” predictions. Image courtesy of Scientific Reports.

According to the findings, the ViT model achieved an area under the receiver operating characteristic curve (AUC) of 90%, which was superior to the EfficientNet (87%) and DenseNet (87%) models.

Moreover, even when F-18 FDG radiotracer uptake was low in the images, the ViT model produced an AUC of 81%, which was higher when compared with the DenseNet model (65%), the authors added.

Ultimately, the value of the study is that it demonstrated the usefulness of ViT for classifying F-18 FDG uptake from PET/CT images, the researchers noted. Extending the study to other institutions is an important future task, they wrote.

“We demonstrated the clinical value of ViT by showing its sensitive analysis of easy-to-miss cases of oncological diseases,” the researchers wrote.

The full study is available here.