Even the Best AI Chatbots Get Health Questions Wrong 1 in 5 Times, Study Finds
With the rise of AI chatbots like ChatGPT, Claude, and Gemini, many people turn to them for quick health advice instead of visiting a doctor. While these tools are convenient and accessible anytime, a recent study reveals a significant drawback: even the most advanced AI chatbots make mistakes on health-related questions about 20% of the time.
Overview of the Study
Researchers from Penn State conducted a comprehensive evaluation of four popular large language model (LLM) chatbots to assess the accuracy and safety of their responses to medical questions. The study, published as a preprint but not yet peer-reviewed, involved real and hypothetical health queries submitted by university students, staff, and faculty. A panel of nine board-certified physicians reviewed and graded 212 AI-generated responses based on validity, quality, reasoning, and potential to cause harm.
Chatbots Tested
- ChatGPT-4o
- ChatGPT-3.5
- Gemini-1.5 Pro
- Llama3-8b
Responses were evaluated carefully to identify the strengths and weaknesses of each AI model in handling medical information and advice.
Key Findings on AI Chatbot Performance
The overall accuracy of AI-generated medical answers was around 76%. While this might seem high, it means nearly one in four responses contained invalid or potentially harmful information.
Among the chatbots, ChatGPT-4o performed the best with 84.6% of responses rated as valid. However, its error rate of over 15% still highlights substantial risk. At the other end, Llama3-8b had only half of its answers rated as valid.
Impact of Medical Specialty and Question Length
Accuracy varied by medical field. Questions about obstetrics and gynecology were answered most accurately, whereas neurology, internal medicine, and dermatology scored lower. Neurology cases often involved rare and complex conditions difficult to diagnose, and dermatology relies on visual assessment, which text-only chatbots cannot perform effectively.
Question length also influenced performance. Medium-length queries (60 to 250 characters) yielded the best results, while very short or very long questions often led to weaker answers. More specific and focused questions helped the AI reason better and provide higher-quality responses.
Retrieval-Augmented Generation (RAG) Technique Yields Mixed Results
The study explored the effect of augmenting AI chatbots with a specialized medical encyclopedia using Retrieval-Augmented Generation (RAG). This method provides AI access to a curated library of medical textbooks, clinical guidelines, and research articles before generating answers, theoretically improving reliability.
Interestingly, for Gemini-1.5 Pro and Llama3-8b, the standard AI answers (without RAG enhancement) were preferred by medical professionals. There was no significant difference noted for the ChatGPT models. Researchers suggested that the impact of RAG might depend on the specific AI model and recommended further study.
Implications for AI Use in Health Guidance
Nearly one-fourth of adults under 30 already use AI monthly for health-related advice, a figure cited in the study. While AI chatbots provide fast and accessible information, the risk of incorrect or harmful advice is notable. This underscores the importance of cautious interpretation and the continued necessity of consulting qualified healthcare providers for critical health decisions.
This study highlights both the promise and current limitations of AI in healthcare. Advances in AI technology could improve accuracy and safety, but users must remain aware of the potential pitfalls until these tools are more refined.
Conclusion
AI chatbots are transforming how people seek medical information, offering convenience and speed. However, this study reveals that even the best AI can provide incorrect health answers about 20% of the time. The findings emphasize that AI should complement, not replace, professional medical advice. Patients should use AI chatbots cautiously and always verify significant health concerns with licensed healthcare practitioners.
As AI technology evolves and research continues, future developments may enhance chatbot reliability and safety in supporting medical decision-making.
Source: studyfinds.com






