Presenter Ran Zhang, PhD, a research scientist in medical physics at the University of Wisconsin-Madison, and colleagues investigated the poor generalizability of AI algorithms developed for classifying COVID-19 on chest x-rays. After training a model on a small but high-quality dataset, they concluded that data quality was much more important for producing generalizable performance.
The researchers first collected 5,201 COVID-19-positive and 9,185 COVID-19-negative chest x-ray images acquired from the Henry Ford Health System in 2020. To study the impact of data size on their AI algorithm’s performance and generalization, they sampled data sets with various sizes: from ± 200 patients to ± 2,500 patients. For each sampled data set, 20% of the data was used for model testing. Cross-validation was performed to evaluate the internal test performance. To evaluate the external test performance, three large external test sets with a total of 17,000 test cases were included.
For internal test performance, the area under the curve (AUC) for the algorithm ranged from 0.72 to 0.78 as the data size increased. In the three external test sets, the AUCs for the model similarly increased from 0.74 to 0.80, from 0.79 to 0.82, and from 0.72 to 0.78. In addition, for all studied data sizes, the researchers found external test performance was not inferior to the corresponding internal test performance, indicating good model generalizability.
“Good data is more important than big data for generalizable AI in medical imaging,” the authors wrote. “Decent external test performance and generalizability can be achieved with a small, high-quality data set.”