Tuesday, December 2 | 3:10 p.m.-3:20 p.m. | SSIN04-2 | Room E450B
This session will introduce a method for postdeployment monitoring of commercial radiology AI software and flagging potential failures across multiple products.
The approach identifies missed findings without requiring structured labels during training or accompanying reports during inference, according to presenter Camila Gonzalez, PhD, and a team from the Stanford AI Development and Evaluation (AIDE) Lab. With the goal of real-time algorithm monitoring after deployment, Gonzalez and colleagues trained a foundation model using in-house CT scans and radiology reports.
Targeting an ICH algorithm, the group retrieved 4,648 noncontrast head CT studies, of which 43.7% were female patients at a median age of 69. Model development involved extracting ICH findings from radiology reports using a large language model and evaluating the findings against manually curated labels, as well as retraining an existing vision language model using corresponding reports.
AI tools are increasingly being integrated into radiologic workflows, the group noted. Yet, manufacturers rarely provide oversight strategies. The session will explain the role and potential of extracted "model embeddings" to evaluate and monitor vendor ICH AI performance.
For each scan, the group extracted model embeddings and computed the mean cosine distance to true negative (i.e., ICH neither present nor detected by the vendor model) cases in the training set.
The vendor model obtained a sensitivity of 64% and specificity of 83% on the in-house test set, misidentifying 206 from 567 ICH-positive studies, according to the group. However, by selecting decision thresholds using validation data, they were able to increase sensitivity on the test set from 64% to 75% and further dial it up to 81% (with moderate rises in false positives).
For those interested in calibrating black-box AI tools to clinically acceptable sensitivity thresholds, this session is a must.



