Supervision needed for ChatGPT responses to breast health questions

Allegretto Amerigo Headshot

Physician oversight is still needed when patients go to large language model (LLM)-based chatbots to answer their breast health questions, according to research published June 16 in Clinical Imaging

A team led by Dana Ataya, MD, from H. Lee Moffitt Cancer Center and Research Institute in Tampa, FL, found that ChatGPT-3.5 gave appropriate responses to all general questions about common acute breast symptoms. However, for questions about pregnancy, ChatGPT-3.5 gave statements that may lead to pregnant or lactating women not getting needed mammograms. 

“Our study confirms that ChatGPT has the potential to automate healthcare information related to appropriate management of acute breast symptoms,” the Ataya team wrote.

LLM-based chatbots such as OpenAI’s ChatGPT or Google’s Gemini continue to be used by patients to answer medical questions they may have, rather than ask their doctors. The researchers highlighted the importance of finding out whether these novel resources can provide safe and accurate information. 

While ChatGPT-3.5 has shown success in providing appropriate answers to questions about disease screening and prevention, acute symptoms may more urgently prompt patients to seek answers from online resources, the group noted. 

Ataya and colleagues studied the accuracy of ChatGPT-3.5's responses to common questions about acute breast symptoms. They also evaluated whether using lay language rather than medical language affected the accuracy of generated responses. 

The researchers formed questions that addressed acute breast conditions as informed by the American College of Radiology’s (ACR’s) appropriateness criteria and the team’s own clinical experience. Of these, seven addressed the most common acute breast symptoms, nine addressed pregnancy-associated breast symptoms, and four were about specific management and imaging recommendations for a palpable breast abnormality.  

From there, they submitted the questions three times to ChatGPT-3.5. Five fellowship-trained breast radiologists judged the responses while adhering to ACR guidelines, scoring responses as appropriate, inappropriate, or unreliable through majority vote. 

The team reported the following results: 

  • ChatGPT-3.5 generated appropriate responses for all seven questions about common acute breast symptoms. This included questions that were phrased both colloquially and used standard medical terminology.  

  • Six out of nine generated responses to questions about pregnancy-related breast symptoms were inappropriate. 

  • Three out of four questions about management and imaging recommendations for a palpable breast abnormality were appropriate. 

The inaccurate responses had incorrect or misleading information related to mammography during pregnancy and lactation, and evaluated a palpable abnormality, the study authors highlighted. In one example, ChatGPT-3.5 said it is not recommended to undergo mammography during pregnancy due to the potential risk of radiation exposure to the developing fetus. However, the ACR appropriateness criteria state that mammography is safe during pregnancy and exposes the fetus to a “negligible” amount of radiation (< 0.03 mGy). 

“Although ChatGPT always included cautionary statements instructing the user to seek advice from a healthcare professional, any inaccurate information could potentially lead to an adverse outcome,” the authors wrote. “For that reason, physician oversight remains critical when seeking medical guidance from ChatGPT-3.5.” 

The team also called for larger prospective studies to further explore the reliability of LLMs in answering medical questions.  

The full study can be accessed here.

Page 1 of 380
Next Page