Smaller, "fine-tuned" large language models (LLMs) used for imaging applications are more sustainable than large general-purpose LLMs, using less energy without negatively affecting accuracy, researchers have reported.
A team led by Florence Doo, MD, of the University of Maryland Medical Intelligent Imaging (UM2ii) Center in Baltimore found that a small, specific LLM with seven billion parameters used 0.13 kilowatt-hours (kWh) compared with a general LLM which used 0.59 -- a 78% difference. Their findings were published August 27 in Radiology.
"Radiologists can make a difference by choosing the 'optimal' AI model for a task -- or as a mentor has said, you don't need a sledgehammer for a nail," Doo told AuntMinnie.com.
The energy used by LLMs for medical applications, including imaging, contributes to the overall carbon footprint of the healthcare system, according to Doo and colleagues. LLM size is defined by the number of "parameters" it has; these are "akin to the weighted neurons in the human brain," Doo and colleagues explained, noting that the "size of an LLM refers to its complexity and learning capacity such that more parameters mean the model can potentially recognize more nuanced patterns in the data, which could translate into higher accuracy for tasks such as diagnosing diseases from radiographs."
Since how much energy LLMs use has not been measured, Doo's team explored the balance between accuracy and energy use for different LLM types for medical imaging applications, specifically chest x-rays, via a study that included data from five different billion (B)-parameter sizes of open-source LLMs (Meta's Llama 2 7B, 13B, and 70B, all general-purpose models, and LMSYS Org's Vicuna v1.5 7B and 13B, which Doo's group described as "specialized, fine-tuned models"). The study used information from 3,665 chest radiograph reports culled from the National Library of Medicine's Indiana University Chest X-ray collection.
The investigators tested the models using local "compute clusters" with visual computing graphic processing units; a single-task prompt directed each model to confirm the presence or absence of 13 CheXpert disease labels. (CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation developed by Stanford University doctoral candidate Jeremy Irvin and colleagues in 2019.) They measured each of the LLMs energy use in kilowatt-hours and assessed their accuracy using the 13 CheXpert disease labels for diagnostic findings on chest x-ray exams (overall accuracy was the mean of each label's individual accuracy). The researchers also calculated the LLMs' efficiency ratios (i.e., accuracy per kWh; higher values equal higher efficiency).
They reported the following:
Comparison of LLMs for chest x-ray interpretation efficiency and accuracy | |||||
---|---|---|---|---|---|
Measure | Llama 2 7B | Llama 2 13B | Llama 2 70B | Vicuna 1.5 7B | Vicuna 1.5 13B |
Efficiency ratio | 13.39 | 40.9 | 22.3 | 737.2 | 331.4 |
Overall labeling accuracy | 7.9% | 74% | 92.7% | 93.8% | 93% |
GPU energy consumed (kilowatt-hour, or kWH) | 0.59 | 1.81 | 4.16 | 0.13 | 0.28 |
The team highlighted that Vicuna 1.5 7B had the highest efficiency ratio, at 737.2 compared with 13.39 for Llama 2's lowest, 7B, and reported that the Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs. 0.59 kWh) and had a low overall accuracy compared to other models.
"[We were surprised to see how much more energy the larger models used with only a slight bump in accuracy," Doo said.
Bigger isn't always better, according to Doo.
"We don’t always need the biggest, flashiest AI models to get great results," she told AuntMinnie.com. "When selecting an LLM or other AI tools, we can consider sustainability and make smart choices that benefit both our patients and the planet."
The complete study can be found here.