LLMs need to cite sources for decision-making

Liz Carey Feature Writer Smg 2023 Headshot

A large language model (LLM) helps radiologists when the model can explain or cite a source for every decision of the management plan the model outputs, according to research presented December 1 at RSNA 2025.

The recommendation comes from University of Toronto medical imaging researchers who studied how to optimize clinical decision-making using LLMs in pancreatic cancer. Specifically, they examined the role of the OpenAI GPT-4o and DeepSeek V3 large language models.

Karthik Gupta, MD, said LLMs have a habit of denying requests. Gupta and colleagues at the University of Toronto studied the usefulness of LLMs generating management plans for pancreatic ductal adenocarcinoma (PDAC).Karthik Gupta, MD, said LLMs have a habit of denying requests. Gupta and colleagues at the University of Toronto studied the usefulness of LLMs generating management plans for pancreatic ductal adenocarcinoma (PDAC). Liz Carey

Pancreatic ductal adenocarcinoma (PDAC) is an aggressive malignancy that requires complex, time-sensitive decisions, noted presenter Karthik Gupta, MD. The research is important because LLMs -- such as GPT, DeepSeek, and others -- are used to support radiology decision-making but with little explainability. LLM outputs require more radiologist review, Gupta said.

"There's emerging evidence suggesting that LLMs can automate clinical decision-making with concise summaries and potentially incorporate these existing guidelines," Gupta said.

With PDAC, deviations from guideline-concordant therapy can meaningfully affect survival and quality of life for patients, Gupta added.

Multidisciplinary tumor boards (MTBs) that include radiology, oncology, pathology, surgery, and allied health services play the key role, but the operational reality of MTB proceedings is that they are resource-intensive, logistically complex, and can be inconsistent due to ever-evolving guidelines across different settings, he said.

"This leaves a lot of gaps where patients at high-volume centers just might not get reviewed in time," Gupta noted. Also, "depending on the resource limitations of the specific center, the guidelines might not be followed exactly."

Toward resource efficiency in pancreatic cancer care, Gupta and colleagues evaluated open-weight and closed-source LLMs on drafting National Comprehensive Cancer Network (NCCN)-aligned MTB management plans from case summaries.

Radiologists play a major role in deciding next steps for PDAC management, including guiding staging and vascular involvement, which in turn guides resectability in PDAC tumors, Gupta explained. Therefore, the group used radiologist MTB case summaries and recommendations for 328 MTB cases to establish ground truth for their study.

In their retrospective paired analysis, they also informed the LLMs using NCCN v2.2025 (February 3, 2025) guidelines fed through the Microsoft Azure API, and as a final step, two radiologists reviewed concordance to ground truth MTB case plans.

First, the group looked at overall completion rates. LLMs have a habit of denying requests, as some may be designed to err on the safer side, according to Gupta.

For this study, DeepSeek V3 offered PDAC management plans at a 100% case completion rate, while GPT-4o demonstrated a 96.3% completion rate. According to the results, DeepSeek demonstrated discordance in 1.5% of cases overall, while GPT-4o demonstrated discordance in 8.8% of cases (a statistically significant difference), Gupta noted.

The group also generated category-specific concordance measures for resectability, neoadjuvant therapy, locally advanced unresectable cancer, and palliative care.

"Across the board, both models do quite well with DeepSeek consistently outperforming GPT-4o," Gupta stated. All category-specific concordance metrics were over 91%, except for GPT-4o's 86% in the category of locally advanced nonresectable cancer.

However, Gupta also highlighted that DeepSeek occasionally misclassified vascular involvement, "which can have serious implications," and GPT-4o tended to offer more overtreatment recommendations across the board.

Additionally, "if these things are ever to be copilots ... in the future where they act independently to try to draft management plans, it's quite risky for any sort of model to be recommending more aggressive treatment rather than more conservative, which would just increase the work for the tumor boards themselves," Gupta said.

Overall, local models performed better than proprietary models and offer more patient information privacy, Gupta noted. "These models are getting better and better and smaller and smaller and easier to run," he said.

As a proof of concept, this LLM project holds promise. Gupta also pointed out that nonreasoning models were selected for the study because determining reasoning model output can be quite difficult.

"In terms of workflow implications, first of all, can these LLMs actually cite where in the guidelines that they're grounding, they're making decisions on, to help with the LLM explainability aspect, which is a big deal," Gupta said.

"Then, on top of that, can they be an assistant in the room where they help track guideline deviations and discussions, and that's always useful data to have," he continued.

Finally, can they work in the background and be used to establish a triage system to speed the most urgent cases to tumor board?

It is too early to tell, but Gupta hopes to expand the dataset, study performance with other cancers, and trial prospectively in "a feasible pilot with human oversight and appropriate guardrails." The group also has plans to test numerous local models.

For full coverage of RSNA 2025, visit our RADCast.

Page 1 of 12
Next Page