Much has been said of what AI algorithms in radiology can do and how well they have performed during their validation, but much less has been said about inevitable performance degradation over time and what to do about it.
Monitoring the performance of AI algorithms is analogous to post-market surveillance, according to Stuart Pomerantz, MD, a neuroradiologist affiliated with Massachusetts General Hospital, who like others has emphasized three areas of risk in AI algorithm performance.
Monitoring performance
Much has been said of what AI algorithms in radiology can do and how well they have performed during their validation, but much less has been said about inevitable performance degradation over time and what to do about it.
Monitoring the performance of AI algorithms is analogous to post-market surveillance, according to Stuart Pomerantz, MD, a neuroradiologist affiliated with Massachusetts General Hospital, who like others has emphasized three areas of risk in AI algorithm performance.
- Where there is poor generalizability of the AI algorithm, the model is overfit to the training data
- When validation of an AI algorithm uses limited data sets, such as a few sites in a certain region and data that often does not account for regional variations and demographic factors
- With data drift that happens with changes in demographics, imaging equipment, and scan protocols
Monitoring performance
"Algorithms often fail to meet performance claims that are derived from validation," Pomerantz said at RSNA 2023, adding that the clinical performance gap undermines trust in AI, slows AI adoption, and can harm patients.
Ultimately, risks around using AI algorithms in radiology and in other medical specialties have called attention to the issue of performance degradation. When it happens, Pomerantz said that retraining an AI model is generally beyond the capability of most radiology practices.
However, it is possible to recalibrate thresholds for AI performance, essentially "localizing AI" for adapting as conditions change. To that end, Pomerantz and his team developed a novel method for calibrating AI algorithm performance to local conditions, applying what is known as conformal uncertainty quantification.
A nonheuristic approach
To ensure AI models produce expected results, conformal uncertainty quantification provides a straightforward nonheuristic approach to calibrating an AI algorithm's performance to local conditions. Conformal uncertainty quantification uses selective classification and is being tested for its value in enhancing prediction confidence, promising better decision-making, and managing the risk of using AI algorithms in high-risk settings, such as medical diagnostics.
Pomerantz and colleagues applied conformal uncertainty quantification to an intracranial hemorrhage (ICH) detection algorithm for CT. Conformal methodology was applied to the results of their ICH detection algorithm on consecutive CT exams and compared with ground truth. A deep-learning feature extractor was used on the array of per-image probabilities of each case.
By classifying different thresholds for the AI: positive, negative, or abstention due to uncertainty, the researchers were able to achieve high probabilities for positive predictive value and negative predictive value. They also applied the method on a per-patient basis with zero tolerance for false positives or false negatives.
"By providing rigorous and mathematically provable guarantees for high accuracy on the local population, this method can improve confidence in AI system reliability across diverse settings," Pomerantz said. Conformal uncertainty quantification avoids the need to retrain an algorithm, and it provides performance guarantees from local-level conditions and data. Conformal uncertainty quantification may also be simpler to use than other uncertainty quantification methods.
Increased monitoring demands
The importance of this topic has drawn the attention of the American College of Radiology (ACR), Canadian Association of Radiologists (CAR), European Society of Radiology (ESR), Royal Australian and New Zealand College of Radiologists (RANZCR), and RSNA. Representatives from these societies published a joint paper January 22 suggesting methods for monitoring performance of AI tools in clinical use. Among the points made were the following:
- Where an AI system is used and the standard of care is not met, accountability and liability may extend to the developer and to the healthcare entity that implemented the AI system in addition to the clinician.
- Society-developed resources, such as the ACR Data Science Institute's Define-AI directory, often serve as a good starting place to ensure the technology being developed meets genuine clinical needs.
- Two common errors in performance reporting include a failure to report a range of expected performance -- lower-quality applications often report a single summary accuracy figure -- and not reporting specific failure conditions and errors, with lower-quality algorithms selectively highlighting the best diagnoses made by their systems. In the broader AI safety community, there is a strong embrace of Model Cards or System Cards, in which in-depth analyses of limitations, errors, and biases are explicitly reported, often entirely separate from the primary report of system performance.
- Solutions must have an explicit post-market quality assurance plan. The importance of this mainly relates to the issues caused by concept drift, due to changes in the patient population or occasionally even differences due to upgrades of successive new versions of AI software.
What it means for radiology
"In practice, what this may entail is prospective performance monitoring of the AI model, for example monitoring for major deviations in month-to-month diagnostic event frequencies, with alerts raised when normal bounds are exceeded, or a control sample approach where a constant reserved held-out set of test case examples is routinely evaluated with the algorithm, to ensure no major deviations on known difficult or borderline cases," authors of the joint statement wrote.
The statement also noted that an ideal monitoring solution collects real-time data on model performance. It also aggregates and analyzes results comparing against expected performance at the local, regional, or national benchmark level through availability of ground truth and well-defined performance benchmarks.