Federated learning may boost AI generalizability

Dec 2, 2020

2020 03 04 21 21 6197 Artificial Intelligence Ai Hand 400 20200304215813

Training deep-learning algorithms with a federated-learning approach could help address the challenge of improving the performance of radiology artificial intelligence (AI) software across institutions, according to a presentation in a scientific session at this week's virtual RSNA 2020 meeting.

A multi-institutional team led by Karthik Sarma of the University of California, Los Angeles (UCLA) utilized federated learning -- a method that enables algorithms to be trained using data from multiple sites without having to share patient data -- to train three different deep-learning algorithms using separate prostate MRI datasets. Each of the models produced equivalent performance to a standalone algorithm trained on the entire dataset.

"We believe that our results demonstrate that federated learning may be a ready-to-use tool to enable the development of multi-institutional deep-learning models, which we believe will really unlock the power of academic deep learning to enable better performance on medical imaging datasets," Sarma said.

Poor generalizability

Deep-learning algorithms in radiology often have poor generalizability across institutions, a shortcoming that can occur when models are developed using data collected only at a single institution, according to Sarma. One way around this problem is to aggregate data from multiple sites into a multi-institutional dataset that could enable the development of better and more generalizable algorithms, Sarma said.

"But creating these datasets is a challenge for regulatory reasons and also because significant concern exists about the privacy implications of data sharing and clinical data -- even deidentified clinical data," he said.

Another option is federated learning, which involves training AI models at multiple participating institutions -- each with their own private datasets. Instead of collecting these datasets together in a single location, a global copy of the deep-learning model is created and placed on a single federated server that all participating institutions have access to, according to Sarma.

Ensuring data privacy

Each institution then makes a local copy of the deep-learning model and trains it using their own local data. Once training is complete, all sites then send the new model weights to the federated server without sharing their local data, Sarma said.

"And once all of the institutions have set their local model weights, they're aggregated into a new global model, which is then redistributed to each of the participating sites," he said.

This process is then repeated again at each site until the training process has been completed for the overall model.

"What you have in the end is a global model that has seen the benefit of aggregation of models that have seen all of the different private datasets, but it's never directly seen any of that private data," he said. "And as a result, none of that private data has ever had to change hands; only the trained model weights have had to travel between the different institutions."

Performance evaluation

To evaluate federated learning for training an AI algorithm, the researchers gathered 343 T2-weighted MRI exams from the ProstateX Challenge, a quantitative image analysis challenge held by the SPIE, the American Association of Physicists in Medicine, and the U.S. National Cancer Institute in 2017. An experienced radiologist annotated prostate contours on all of these images, according to Sarma.

Of this set of 343 MRIs, 100 each were distributed to the three test sites: the U.S. National Institutes of Health, the State University of New York (SUNY), and UCLA. The 3D Anisotropic Hybrid Network (AH-Net) deep-learning model was chosen for the project due to its superior segmentation performance, Sarma said.

A federated learning server was then deployed using the Nvidia Clara Train toolkit on AWS and connected to the three training institutions to enable sharing of model weights, Sarma said. Three different federated learning models were created -- one at each institution -- and were trained over 300 epochs, or rounds. To provide a point of comparison, the researchers also trained a standalone model using all 300 MRI exams at once.

Equivalent performance

They then tested the performance of all three federated learning models and the standalone model on the 43 hold-out cases in the test set.

Performance of federated learning model on test set
	Federated learning model (SUNY)	Federated learning model (UCLA)	Federated learning model (NCI)	Mean performance from all 3 federated learning models	Standalone model (trained only on entire dataset)
Mean Dice score	0.910	0.905	0.909	0.908	0.911

Overall, the researchers found that each of the models had essentially equivalent performance, Sarma said.

"This result does suggest that the [federated learning] approach works in that it enabled learning from the entire dataset, even though it didn't see all of the data and didn't require direct data sharing," he said. "If it hadn't, then we would expect the standalone full data model to have done better on the held-out test set, because the local [federated learning] models only received one-third of the data that the standalone model did."

Sarma acknowledged that this case study was limited by the use of a single underlying dataset that was split among the institutions. As such, the experiment didn't really re-create a circumstance where participating institutions have their own private data with unique characteristics -- perhaps the best use case for federated learning, he said.

"We're actually currently working on a future work that explores using separate institution-specific datasets in order to make sure that the results that we see aren't a fluke caused by using the same underlying dataset," he said.