Too Cautious for the Healthcare System: ChatGPT's Limitations in Dealing with Health Issues
Artificial intelligence (AI) is increasingly being used in healthcare. Many people use tools like ChatGPT to assess their symptoms and decide whether they need immediate medical attention, should seek medical advice, or wait and see how the situation develops. And with versions specifically tailored for the healthcare sector – such as ChatGPT Health in the US – it's easy to get the impression that such tools are particularly suited for professional use. However, the reliability of ChatGPT's recommendations has so far only been studied to a limited extent.
In a recent study conducted by the Division of Ergonomics at Technische Universität Berlin, researchers analyzed exactly how ChatGPT classifies health complaints across different model versions, how its performance has changed over time, and whether identical inputs generate consistent recommendations. The results show that ChatGPT is currently only of limited use for digital initial assessments and independent patient management.
22 model versions, 45 real-world cases, 9,900 reviews
"The main difference to our earlier studies is the longitudinal analysis. Previously, we only studied one or two models. We have now tested all the models that have been available and analyzed how they have actually changed," says Dr. Marvin Kopka, leader of the study. "It was important for us to do this as we are seeing repeated reports that new models achieve near-perfect results in medical licensing exams or knowledge tests. This leads people to quickly conclude that they also provide reliable medical advice to patients. Our study shows that this is not actually the case."
For the study "Evaluating the Accuracy of ChatGPT Model Versions for Giving Care-Seeking Advice," published in Communications Medicine, the research team tested 22 ChatGPT model versions using real-world cases involving 45 patients. These included conditions such as "acute tendon or ligament strain on the previous day" or "mild digestive disorders or diarrhea for one day without any other symptoms." Each case was entered ten times for each model. This produced a total of 9,900 individual assessments. In each case, the models had to decide whether a situation should be classified as an emergency, a case requiring medical evaluation, or a situation suitable for self-treatment.
Very little increase in accuracy
The analysis shows that accuracy initially increased significantly with the first model versions. However, since the advent of the third-generation model (GPT-4), there have only been minor improvements. The best-performing model achieved an accuracy rate of 74 percent. Although newer models more frequently recommended self-treatment, overall performance in this area remained limited.
Particular weaknesses in cases of minor ailments
The models tested were particularly effective at identifying cases requiring treatment. Most errors (70 percent of all errors) occurred in cases where self-treatment would have been sufficient. None of the 13 cases requiring self-treatment were correctly identified by all models in all runs.
Only a few models, such as o4, o3, or GPT-5, recommended self-treatment at all. All other models tested consistently recommended a medical evaluation. This is problematic because a significant number of symptoms are actually harmless, resolve on their own, or can be treated at home.
The study thus reveals a structural pattern: As a precaution, almost all models tend to assess symptoms as requiring treatment more often than is medically necessary. The researchers use the term "conservative triage behavior" to describe this tendency. "We were actually surprised by how clear the results were," said Dr. Marvin Kopka. "They explicitly show that newer models do not automatically provide better answers to the questions that matter to patients. Better test or exam results do not necessarily translate into greater practical benefits in patient care."
What counts is the practical benefit
"In our view, what matters is not only whether a model correctly assesses individual cases, but also what practical value the recommendations actually have in everyday life. When a system recommends seeking medical advice as a precaution in response to many different symptoms, it may initially seem like solid advice to users; but in reality, it no longer offers any real guidance if the recommendation is almost always the same," says Kopka.
Same input, different recommendations
A further problem is that the models do not provide consistent answers. Even with identical inputs, there were significant variations depending on the model. Although newer models showed fewer cases that were never correctly assessed, they also had more cases with inconsistent recommendations across multiple runs. This was particularly evident in the case of GPT-5: In 42 percent of all cases, the recommendations were sometimes correct and sometimes incorrect, even though the exact same case was entered multiple time.
However, the experiment did show that accuracy can be improved by asking the same question multiple times and then selecting the lowest priority level from among the answers provided. This enabled overall accuracy to increase by an average of four percentage points, and accuracy in cases requiring self-care by as much as 14 percentage points. The researchers are quick to point out, though, that this is not a recommendation for end users, as it could lead to genuine emergencies being missed.
Relevance to the debate on primary care
According to Kopka, the findings are also relevant to health policy. In Germany, an intense debate is taking place about primary care systems and forms of digital patient management. The TU Berlin study suggests that general language models such as ChatGPT are not currently a suitable standalone tool for this purpose. If, in practice, a system predominantly advises seeking medical advice, there is little real impact on actual decision-making, and it may even lead to an increase in unnecessary use of medical services.
Potential lies more in quality-assured applications
"We currently see the potential of large language models less in their use within manufacturers' chat windows and more in their meaningful integration into quality-assured applications, such as symptom checker apps. Used here, they could help present information in an understandable way, explain recommendations, and better guide people through existing care pathways – provided, of course, that medical quality assurance is carried out in the background," says Kopka.
Limitations of the study
The researchers also point out that the study focused on population representativeness. Since genuine medical emergencies are rare in everyday life and therefore also very rare when using ChatGPT, the dataset contained only a few such cases and primarily examined decisions for or against seeking medical attention. Further studies are to investigate the accuracy of the models in identifying genuine emergencies.
Contact
Marvin Kopka
marvin.kopka(at)tu-berlin.de