LLM performance varies based on language input

Jan 31, 2025

Baidu’s AI chatbot Ernie Bot outperformed OpenAI’s ChatGPT-4 on interventional radiology questions in Chinese, while ChatGPT was superior when questions were in English, according to a recent study.

The finding suggests that patients may get better answers when they choose large language models (LLMs) trained in their native language, noted a group of interventional radiologists at the First Affiliated Hospital of Soochow University in Suzhou, China.

“ChatGPT's relatively weaker performance in Chinese underscores the challenges faced by general-purpose models when applied to linguistically and culturally diverse healthcare environments,” the group wrote. The study was published January 23 in Digital Health.

Liver cancer is among the most common malignancies globally, with transcatheter arterial chemoembolization (TACE) and hepatic artery infusion chemotherapy (HAIC) two widely utilized procedures to treat the disease, the authors explained. Both procedures are complex and difficult for patients and their caregivers to understand, they noted.

ChatGPT and Ernie Bot, which was released by Baidu in August 2023, have shown promise in patient education, yet few studies have tested them in interventional radiology patient groups, and no studies have compared their performances in both Chinese and English contexts, the authors added.

To that end, the researchers developed 38 questions covering topics related to TACE and HAIC, including foundational knowledge, patient education, treatment, and care. The responses generated by the chatbots were evaluated by 10 professionals in liver cancer interventional radiology, with each response rated on a five-point Likert scale for accuracy and comprehensiveness.

According to the results, both chatbots effectively addressed questions related to TACE and HAIC, yet their performance varied by language: ChatGPT excelled in English contexts, while Ernie Bot performed better in Chinese.

Likart scale scores for Ernie Bot and ChatGPT responses to questions about TACE and HAIC in both English and Chinese
	English		Chinese
Likart scale score	ChatGPT	Ernie Bot	ChatGPT	Ernie Bot
5	35 (92.1%)	28 (73.7%)	22 (57.9%)	34 (89.5%)
4	1 (2.6%)	1 (2.6%)	2 (5.3%)	4 (10.5%)
3	2 (5.3%)	5 (13.2%)	13 (34.2%)	0%
2	0%	4 (10.5%)	1 (2.6%)	0%
1	0%	0%	0%	0%

In the Chinese context, Ernie Bot’s advantage can largely be attributed to its training on Chinese-specific datasets and real-time updated databases, which enables it to provide more precise information. In contrast, ChatGPT demonstrated robust language generation capabilities in English contexts, consistent with its reputation as a versatile and widely applicable AI model, they added.

Ultimately, while both models show promise, their unique strengths highlight some important considerations for developing and applying AI tools in clinical practice.

“Choosing a suitable large language model is important for patients to get more accurate treatment,” the group concluded.

The full study is available here.