The large language model (LLM) GPT-4o (OpenAI) is effective for protocoling abdominal and pelvic CT scans -- choosing optimal protocols more frequently than radiologists when it is augmented with "detailed prompting," researchers have reported.
The study findings could improve department workflow, allowing radiologists to focus on "core interpretive responsibilities," according to a team led by Bryan Buckley, MBBCh, of the University of Toronto in Canada. The results were published January 6 in Radiology.
"Protocolling is important, but is a labor and time intensive process for radiologists," corresponding author Rajesh Bhayana, MD, also of the university, told AuntMinnie.com. "It's also prone to variability and error. Incorrect protocols can make imaging tests non-diagnostic, sometimes preventing us from answering the clinical question. Therefore, using AI to help assign protocols in line with institutional guidelines is of interest to both improve efficiency and quality."
Protocoling of medical imaging involves selecting the optimal study technique for a given clinical indication and is an important step in the medical imaging workflow, the group explained, writing that "accuracy is critical, as incorrect protocols can lead to nondiagnostic and time-consuming task and a source of interruptions for radiologists -- which can lead to increased diagnostic errors." The team suggested that the use of LLMs for this task could prove helpful.
As there are little data on the performance of these models for protocoling, Buckley and colleagues conducted a study that evaluated GPT-4o's ability to automatically assign protocols for abdominal and pelvic CT scans. One version of the model was augmented with context engineering (that is, "providing an LLM with all of the context it requires to solve a problem, which has been likened to constructing the mental world it operates in"; the researchers used 300 prompting set cases to do this). Another version was optimized through fine-tuning (that is, including labeled examples to improve its performance for this particular task; the researchers used Microsoft Foundry to do this).
The research included 1,448 patients who underwent abdominal or pelvic CT scans between January and June 2024. The investigators tracked human-selected protocol and training level (resident, fellow, or radiologist) data. Reference standard protocols were defined by radiologists in consultation with institutional guidelines. Buckley and colleagues compared the LLMs with radiologists' protocol selections for the following categories: exact match, equal alternative, reasonable but inferior, and inappropriate (with exact match and equal alternative considered the best results).
They found that GPT-4o, with prompting, chose optimal protocols for abdominal and pelvic CT exams more often than radiologists, and that there was no significant difference in inappropriate protocols.
Performance comparison between radiologists and GPT-4o for selecting abdominal and pelvic CT protocols | |||
Outcome | Radiologists | GPT-4o (with prompting instructions) | p-value |
| Selection of optimal protocols | 88.3% | 96.2% | < 0.001 |
| Inappropriate protocols | 2.4% | 1.3% | 0.21 |
Fine-tuning the LLM did not increase the proportion of optimal protocols over the prompting approach (both, 96.2%; p < 0.99). The team also reported that the proportion of protocols that matched the reference standard was similar among all training levels (radiologists, 79.4%; fellows, 74.9%; and residents, 72.1%, for a p-value of 0.3).
Overall performance of the prompting-only strategy, fine-tuned model, and original human protocoler. GPT-4o (OpenAI) with prompting only selected protocols exactly matching the reference standard more frequently than original human protocolers (90.7% [497 of 548 patients] versus 76.1% [417 of 548 patients]; p < 0.001). Optimal protocols were also selected more frequently by the prompting-only model than original human protocolers (96.2% [527 of 548 patients] vs. 88.3% [484 of 548 patients]; p < 0.001). However, there was no evidence of a difference in proportion of inappropriate protocols selected between the prompting-only model and human protocolers (1.3% [seven of 548 patients] versus 2.4% [13 of 548 patients]; p = 0.21). When comparing the two GPT-4o optimization strategies, fine-tuning did not significantly improve performance over prompting only (optimal protocols: 96.2% [527 of 548 patients] versus 96.2% [527 of 548 patients]; p > 0.99). Figure and caption courtesy of the RSNA.
The study results suggest that "meticulous prompting with state-of-the-art LLMs is all that is required to unlock automated protocoling," according to the group.
"A prompting-only approach offers several advantages over fine-tuning on labeled examples, including facilitating efficient adaptation to institutional-specific protocols, new protocols, and changes in practice over time," it concluded. "LLMs could facilitate widespread automated protocoling, which could significantly improve workflow and reduce radiologist time spent on noninterpretive tasks."
The full study can be found here.




















