Stanford AIDE lab proposes radiology AI safety monitoring model

Oct 16, 2025

A background ensembled monitoring model running in a clinical consensus review manner can serve as a quality control mechanism for "black box" radiology AI tools, according to a use case published October 16 in npj Digital Medicine.

Applied to a U.S. Food and Drug Administration (FDA)-cleared intracranial hemorrhage (ICH) detection algorithm, the approach generated case-by-case confidence measurements to help physicians recognize low-confidence AI prediction scenarios, explained a team from Stanford University, led by Stanford Radiology's AI Development and Evaluation (AIDE) lab co-directors Akshay Chaudhari, PhD, and David Larson, MD, and principal machine-learning scientist Zhongnan Fang, PhD.

One motivation behind the proposed ensembled monitoring model (EMM) is that most FDA clearances consider premarket performance data. "There is a broad mismatch between premarket evaluation and then postmarket evaluation and postdeployment monitoring," Larson told AuntMinnie.

To close the gap, the group developed EMM to monitor FDA-cleared radiological AI devices as a complement to retrospective monitoring based on concordance between AI model outputs and labor-intensive manual labeling.

“Prior approaches have used large language models (LLMs) to compare the outputs of a deployed model to a finalized radiology report," Chaudhari noted. "However, this can only be performed in a retrospective manner.”

Instead, EMM works for real-time, patient-specific assessment, Chaudhari added. For the physician, the EMM produces a confidence level (red, yellow, or green) of the deployed model at the time of interpretation.

"At the end, we're trying to look at the agreement levels between an ‘expert committee’ (EMM) and the primary black-box AI, and translating those into uncertainty measures," Fang said.

Zhongnan Fang, PhD, David Larson, MD, and Akshay Chaudhari, PhD, explain their ensembled monitoring model (EMM).

Orchestrated like "multiple expert reviews," the EMM quantifies confidence in the primary AI model's predictions. EMM consists of five independently trained submodels that mirror the primary AI model's task. Each submodel within EMM works in parallel to the primary model and independently processes the same input to generate its own prediction.

In this use case -- a primary AI model that detects ICH, operating on head CT imaging -- a majority of the cases that EMM analyzed were classified as having increased confidence, Larson noted. When EMM indicates decreased confidence, a more detailed radiologist review is called for, he said.

The group also considered unnecessary false-alarm reviews, defining false alarms as cases flagged for decreased confidence for which the primary model’s prediction was actually correct.

For ICH-positive cases detected by the black-box AI, EMM increased detection accuracy by up to 38.57%, while maintaining a low false-alarm rate of under 1% across ICH prevalence levels ranging from 30% (in emergency settings) to 5% (in outpatient settings), according to the group.

For ICH-negative predictions, the primary model already had high baseline accuracy at the lower prevalences (accuracy of 93% and 98% for prevalences of 15% and 5%, respectively), the group reported. As a result, the most favorable balance between improved accuracy and low false-alarm rate was observed at the 30% prevalence level, Larson said.

"We looked at how well EMM improves accuracy for cases flagged as positive and negative (meaning, essentially, PPV and NPV) versus how often it incorrectly returns a 'decreased confidence' result (false alarm)," Larson explained. "Since the purpose of EMM is to help differentiate whether the primary model is correct or incorrect, we evaluated EMM according to how well it does that."

Importantly, the ensembled model worked best using five submodel assessments. It was especially helpful in addressing the problem of false positives, which tends to be a particular problem in a low-prevalence environment such as routine outpatient imaging, Larson added.

The group also observed that EMM, when trained on only 25% of the training data (4,592 subjects), achieved near-optimal performance across disease prevalences of 30%, 15%, and 5% (emergency, inpatient, and outpatient settings, respectively).

When trained with smaller submodels using only 5% of the training data (918 subjects), EMM also maintained optimal performance at the 5% prevalence level. These results demonstrate EMM’s strong generalizability in data-scarce settings across different disease prevalences, the group noted.

"We expect that there's going to be multiple generations of these [black box] AI tools that will be helping the physicians," Chaudhari said. "If we can benchmark where our current models are successful and where they are failing, hopefully, we can provide that information to the model creators so that next time around, they can patch some of the failure modes that these models can have."

Find the complete paper here.