Model monitors radiology AI tools for potential failures

Nov 12, 2025

Tuesday, December 2 | 3:10 p.m.-3:20 p.m. | SSIN04-2 | Room E450B

This session will introduce a method for postdeployment monitoring of commercial radiology AI software and flagging potential failures across multiple products.

The approach identifies missed findings without requiring structured labels during training or accompanying reports during inference, according to presenter Camila Gonzalez, PhD, and a team from the Stanford AI Development and Evaluation (AIDE) Lab. With the goal of real-time algorithm monitoring after deployment, Gonzalez and colleagues trained a foundation model using in-house CT scans and radiology reports.

Targeting an ICH algorithm, the group retrieved 4,648 noncontrast head CT studies, of which 43.7% were female patients at a median age of 69. Model development involved extracting ICH findings from radiology reports using a large language model and evaluating the findings against manually curated labels, as well as retraining an existing vision language model using corresponding reports.

AI tools are increasingly being integrated into radiologic workflows, the group noted. Yet, manufacturers rarely provide oversight strategies. The session will explain the role and potential of extracted "model embeddings" to evaluate and monitor vendor ICH AI performance.

For each scan, the group extracted model embeddings and computed the mean cosine distance to true negative (i.e., ICH neither present nor detected by the vendor model) cases in the training set.

The vendor model obtained a sensitivity of 64% and specificity of 83% on the in-house test set, misidentifying 206 from 567 ICH-positive studies, according to the group. However, by selecting decision thresholds using validation data, they were able to increase sensitivity on the test set from 64% to 75% and further dial it up to 81% (with moderate rises in false positives).

For those interested in calibrating black-box AI tools to clinically acceptable sensitivity thresholds, this session is a must.

Model monitors radiology AI tools for potential failures

Road to RSNA 2025: AI Preview

Postdeployment monitoring needed for AI software updates

Clinically Meaningful AI Detection of Interval Cancers

AI boosts performance of nonexpert radiologists on prostate MRI

Road to RSNA 2025: AI Preview

Postdeployment monitoring needed for AI software updates

AI approach may lead to earlier PDAC detection

AI boosts performance of nonexpert radiologists on prostate MRI

Opportunistic screening predicts risk of diabetes

Automated CT body composition analysis predicts mRCC prognosis

AI makes the grade for opportunistic assessment of BMD

ML enables early coronary atherosclerosis management

Model supports pancreatic cancer surgical risk stratification

U.K. group evaluates CE-marked AI tools for pneumothorax accuracy

Study reinforces radiologist role in ICH diagnosis

AI-driven triage alerts in the ED deemed 'effective safety net'