Geoffrey A. Fowler, tech columnist for the Washington Post, was keen to find out. He gave the new tool access to a decade’s worth of data – including 29 million steps and 6 million heartbeat measurements stored in his Apple Health app. He then asked the bot to rate his health, specifically his heart health.

The bot gave him a straight six (an F according to the American system, i.e., the worst possible rating).

“I panicked and went jogging,” writes Fowler in his article. He then sent the ChatGPT report to his real doctor. His response was much more reassuring: Fowler’s risk of a heart attack is actually so low that his insurance company probably wouldn’t even pay for an additional test to disprove the AI.

ChatGPT Health: “This is not ready for any medical advice.”

When Fowler asked the AI to assess his heart health again, the verdict changed dramatically. Suddenly it was a “4” (D). He kept asking and watched as the grade fluctuated between a “6” (F) and a “2” (B). The journalist showed the results to cardiologist Eric Topol, an expert in AI in medicine. His verdict was unequivocal:

This is baseless. This is not ready for any medical advice.

Topol was equally harsh about the arbitrary fluctuations in the ratings: this kind of randomness is “completely unacceptable”.

The danger is obvious: such unsophisticated tools could cause extreme anxiety in healthy people or, conversely, lull people with genuine health problems into a false sense of security.

Misinterpretation: Why AI often fails because of data

The flawed analysis of ChatGPT had specific causes. Firstly, the AI based its negative assessment heavily on the VO2 max value. However, Apple itself only refers to this value as an “estimate”, and independent researchers found that these estimates can be 13 percent too low on average.

Secondly, the AI interpreted fluctuations in Fowler’s resting heart rate as worrying health signals and advised him to speak to his doctor about a “notable increase”. In doing so, it overlooked a crucial factor: the changes always occurred when the journalist received a new Apple Watch model with improved sensors.

An AI doctor with memory loss

ChatGPT Health treated this “fuzzy data” like unambiguous medical facts, missing the critical context that a human expert would immediately recognize. Even when Fowler connected his official medical records from the doctor’s office to the tool, the grade only improved to a “5” (D) – further proof that the core problem is not the amount of data, but the lack of analytics.

Across various chats, ChatGPT repeatedly forgot important personal information about Fowler, including his gender, age, and current vital signs. Even when the AI had access to his latest blood test results, it sometimes failed to include them in the analysis. For a tool intended to provide personal health insights, this is a fundamental flaw. A consistent understanding of the individual is the foundation of any meaningful health assessment.

Classification: Where do we stand with AI health?

This experiment fits with a worrying trend: technology companies are bringing AI products to market that are at best less than accurate – or at worst even “dangerous”. The problem is by no means limited to OpenAI and ChatGPT Health. Competitor bot Claude from Anthropic also rated Fowler’s heart health a “4” (D), relying on the same flawed logic.

Furthermore, these services are not covered by strict health data privacy laws such as HIPAA in the US. The experiment is therefore an important wake-up call and raises a crucial question for the future: Who should we trust when an AI gives bad health advice, and how can we ensure that future innovations are safe and reliable?

What about you? Are you already using AI to track health data or to resolve health issues?