Artificial intelligence may be the future of healthcare, but it's not there yet. That could be the lesson from a head-to-head test that pitted a group of internal medicine physicians against a commonly used symptom-checker app for diagnosing a set of clinical conditions.
In a research letter published October 10 in JAMA Internal Medicine, researchers from Harvard Medical School described their experiment to see how doctors made the correct diagnosis compared with Human Dx, a web- and app-based symptom-checker platform. The senior author of the letter is Dr. Ateev Mehrotra, an associate professor of healthcare policy at the school.
The group noted how computer-based checklists and other apps are increasingly being used for tasks such as reducing medication errors, with physicians typically having error rates of 10% to 15%, according to published literature. Human Dx is being used by more than 2,700 doctors and trainees from over 40 countries, and it includes more than 100,000 clinical vignettes.
In the experiment, 234 internal medicine, family practice, and pediatric physicians evaluated 45 clinical vignettes that included both common and uncommon clinical conditions, with different degrees of severity. The doctors were required to identify the most likely diagnosis and also provide two more diagnoses. Their performance was compared with that of the symptom-checker app.
The doctors put the correct diagnosis first at a rate of 72.1%, compared with only 34% for Human Dx (p < 0.001). When rating how often the doctors listed the correct diagnosis as one of three possibilities, the physicians got it right at a rate of 84.3%, compared with 51.2% for the app (p < 0.001).
The researchers found that the difference between the doctors and the digital app was greater with more severe conditions, and less pronounced with more common illnesses. But the doctors were wrong in 15% of cases, in line with previous estimates of physician accuracy.
While physicians triumphed this time, there were a number of limitations to the experiment, including the fact that the clinical vignettes do not reflect the real world of healthcare, that doctors using Human Dx might not be a representative sample of physicians, and that the app itself might not represent the state of the art in artificial intelligence, according to the researchers.
"Symptom checkers are only one form of computer diagnostic tools, and other tools may have superior performance," they concluded.