AI's performance in clinical settings may differ from literature

Ai Hand

CHICAGO -- AI algorithms in real-world clinical settings may not achieve the high performance described in published research, according to research presented December 2 at RSNA 2025.

This especially applies to imaging centers with high volumes of women with a history of breast conserving therapy, suggest findings presented by Sarah Eskreis-Winkler, MD, PhD, from the Memorial Sloan Kettering Cancer Center in New York.

Sarah Eskreis-Winkler, MD, PhD, discusses results from a study she and colleagues led evaluating the performance of FDA-approved AI tools in real-world clinical settings. Sitting next to her are session moderators Ritse Mann, MD, PhD (left), and Linda Moy, MD.Sarah Eskreis-Winkler, MD, PhD, discusses results from a study she and colleagues led evaluating the performance of FDA-approved AI tools in real-world clinical settings. Sitting next to her are session moderators Ritse Mann, MD, PhD (left), and Linda Moy, MD.

“The field of breast imaging is in flux,” Eskreis-Winkler said. “Most of the breast imagers in this room are probably either using one of these [AI] tools or are thinking about it.”

AI algorithms approved by the U.S. Food and Drug Administration (FDA) are becoming more widespread. And large-scale trials show that AI assistance leads to improved diagnostic performance in breast imaging.

Eskreis-Winkler and colleagues studied the performance of an AI tool (Transpara version 1.74A, ScreenPoint Medical) for screening mammography prior to its clinical rollout. The study population included “many” women with a personal history of breast cancer and/or high lifetime risk.

The team selected four cohorts, each containing 200 screening mammograms. These included the following: True positives, BI-RADS 0, cancer diagnosed within three months (Cohort 1); false negatives, BI-RADS 1 or 2, cancer diagnosed within 18 months (Cohort 2); true negatives, BI-RADS 1 or 2, with no cancer diagnosed for at least 18 months and no history of breast-conserving therapy (Cohort 3); and true negatives, BI-RADS 1 or 2, with no cancer diagnosed for at least 18 months in women with breast-conserving therapy (Cohort 4).

The AI tool assigned each exam a risk level ranging from high to intermediate to low. Per vendor benchmarks, the algorithm was expected to identify more than 90% of true positives as elevated risk and more than 60% of true negatives as low risk.

Of the exams with cancer, the AI tool assigned a high-risk score to 73% of cases and an intermediate risk score to 19% of cases. For false-negative and interval cancers, it assigned a high-risk score to 36% of cases.

The AI designated 69% of screen-detected cancer cases as having elevated risk and correct localization, while the same applied to 11% of intermediate cases. The remaining 20% of cases were split between high, intermediate, and low risk, as well as those with incorrect AI localization.

Eskreis-Winkler said the results illustrate the challenge of evaluating AI performance.

“It’s very easy to evaluate the performance of AI the standalone way, but when we’re talking about a radiologist consulting with and deciding to accept there’s more, it’s more in the psychological domain,” she said.

She also highlighted the importance of internal validation prior to using AI in the clinic and monitoring performance over time.

“This is something we’re working actively on as well,” Eskreis-Winkler said.

Visit our RADCast for full coverage of RSNA 2025.