This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.
The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.
I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.
Instrumentation and testing become primarily useful at an individual level to explain or investigate someone's disease or disorder, or to screen for major risk factors, and the hazards and consequences of unnecessary testing outweigh the benefits in all but a few cases. For which your GP and/or government will (or should) routinely screen those at actual risk, which is why I pooped in a jar last week and mailed it.
An athlete chasing an ever-better VO2max or FTP hasn't necessarily got it wrong, however. We can say something like, "Bjorn Daehlie’s results are explained by extraordinary VO2max", with an implication that you should go get results some other way because you're not a five-sigma outlier. But at the pointy end of elite sport, there's a clear correlation between marginal improvement of certain measures and competitive outcomes, and if you don't think the difference of 0.01sec between first and third matters then you've never stood on a podium. Or worse, next to one. When mistakes are made and performance deteriorates, it's often due to chasing the wrong metric(s) for the athlete at hand, generally a failure of coaching.
BMI works fine for people who aren't very muscular, which is the great majority of people. Waist to height ratio might be more informative for people with higher muscle mass.
I'm 5'8" and weigh on average 210lbs. My BMI isn't even morbidly obese, it is 31, which is just "regular" obese, but on top of that, a DEXA scan shows that I am actually only 25% body fat, with only 1lb of visceral fat.
Doctor's don't care about that, they see on the Epic chart that my BMI is > 30 and have to tell me some spiel about a healthier lifestyle so they check check off a checkbox and continue to the next screen.
You would consider incorrectly then.
This person has ~155 pounds of lean body mass. 164 would put him at roughly a body builder level of fat, which basically requires a part time job in cooking and nutrition to maintain.
For reference, I’m in a similar situation to this person. I’m 5’11” (180cm) and about 200 lbs (91kg) with about 170 lbs of lean body mass. My dexa scan says that I’m 15% body fat, but I get the same lectures from doctors about being obese and needing a lifestyle change, all based on BMI and (I assume) my size (I’m barrel chested). It’s completely absurd.
Ideal body fat percentage is 18-24% - I'm at 25% (or was in November - might be +/- 2% since then - gained a few pounds weight, but not waist size).
So I would say I'm not morbidly obese or even regular obese based on the percentage of my body that is muscle vs fat.
It's really hard to tell with the data provided.
Somehow my body is just amazing at working without any help from me. I don't even exercise much. Maybe a few pushups a day, up and down my stairs at my house a couple dozen times a day, and probably 5-10k steps a day max.
On top of that, I'm not sure if that is a real indication of anything, either.
The reason to do that is to get an idea of your abdominal fat (which is the more dangerous place for fat to store), but there are two types of abdominal fat, one is dangerous (visceral fat) and one is completely benign (subcutaneous fat). And a measurement around your waist won't tell you which you have.
I personally have almost all of my fat subcutaneous, with only 1lb of visceral fat (which is right in the perfect range).
Literally all of them?
Follow that rule next time you read such a statement in a context that's not formal math.
That is not even true. We are talking anecdotal evidence here.
And a near universal experience with doctors for anybody paying attention is that.
One can reject it or accept it and improve upon it after checking its predictive power, or they can pause their thinking and wait for some authority to give them the official numbers on that.
All humans?
Sorry :)
Though, on second thought: yes, all humans, and not merely as a generalization. 100% of humans do it.
Almost every single start of every single appointment (including a follow up from just a couple days prior), they comment about my BMI. It is the rare time they don't that I remember. My last urology appointment the doctor was very congenial, didn't even go over the lab work, just said, everything is looking good, asked how I was feeling, everything good, alright, refilled my prescriptions and left.
When people say "oh BMI isn't accurate" it means you are more overweight then it suggests unless you are literally an extreme body builder.
An individual learns nothing from its calculation and it has no clinical value. I receive more constructive feedback from an auntie jabbing me in the chest and saying "you got fat".
> the great majority of people
There is wide morphological variety across human populations, so, no.
This is true of many metrics and even lab results. Good doctors will counsel you and tell you that the lab results are just one metric and one input. The body acclimates to its current conditions over time, and quite often achieves homeostasis.
My grandma was living for years with an SpO2 in the 90-95% range as measured by pulse oximetry, but this was just one metric measured with one method. It doesn't mean her blood oxygen was actually repeatedly dropping, it just meant that her body wasn't particularly suited to pulse oximetry.
Muscle mass obviously does, though. cystatin c is a better market if your body composition differs from the "average"
https://www.nice.org.uk/guidance/ng203/chapter/rationale-and...
My creatinine levels are high because my body mass - including muscle mass - is well above average. On the basic kidney tests my GP did, my numbers indicated kidney disease. Doing a Cystatin C test showed very clearly that my numbers were firmly in the normal range.
The page does go on to point out the muscle mass issue:
> The committee highlighted the 2008 recommendation, which states that caution should be used when interpreting eGFR and in adults with extremes of muscle mass and on those who consume protein supplements (this was added to recommendation 1.1.1).
Further down they do mention Cystatin C, and seem to have basically decided that a risk of false positives is acceptable because of a lower risk of false negatives. That part is interesting, and it may well be the right decision at a population level.
But if your muscle mass is sufficiently above average, the regular kidney tests done will flag up possible kidney disease every single damn time you do one, and my experience is that UK doctors are totally oblivious to the fact that this is not necessarily cause for concern for a given patient and will often just assume a problem and it will be up to the patient to educate them.
EDIT: What's worse, actually, is the number of times I've had doctors or nurses try to help me to "game" this test by telling me to e.g. drink more before the test next time, seemingly oblivious that irrespective of precision, making changes to conditions that also invalidates it as a way to track changes in eGFR is not helpful.
Creatinine is the standard marker used for eGFR. It is also a byproduct of muscle metabolism. People who regularly lift weights or have lifestyles that otherwise result in a higher-than-normal muscularity will almost universally have higher creatinine levels than those who don't, assuming similar baseline kidney function. It's also problematic for people with extremely low muscle mass, for the opposite reason.
It's one of the reasons enhanced bodybuilders can get bit with failing kidney function - they know that their eGFR is going to look worse and worse based on creatinine formulas so they ignore it, when the elevated blood pressure from all the dbol they're popping is killing their kidneys.
Cystatin C is the better option for people with too much (or too little) muscle for creatinine to be accurate.
This gets to one of LLMs' core weaknesses, they blindly respond to your requests and rarely push back against the premise of it.
There's a reason why Oura rings are expensive and it's not the hardware - you can get similar stuff for 50€ on Aliexpress.
But none of them predicted my Covid infection days in advance. Oura did.
A device like the Apple Watch that's on you 24/7 is good with TRENDS, not absolute measurements. It can tell you if your heart rate, blood oxygen or something else is more or less than before, statistically. For absolute measurements it's OK, but not exact.
And from that we can make educated guesses on whether a visit to a doctor is necessary.
It actually warned you, or retrospectively looking at the metrics you could see that there was a pattern in advance of symptoms? (If the latter, same here with my Garmin watch - precipitous HRV decline in the 7 days before symptoms. But no actual warning.)
Of course it didn't tell me "you have COVID19-B variant C" - but it did tell me I'm probably sick and should seek care.
It somehow takes all that and gave me a "you might be sick" notification.
Like your car will start with a small noise first, you can’t hear it. But in time the small noise becomes a big noise just before things break.
If you catch it in the small noise part, you can proactively prepare.
The standard risk model for CVD based on SCORE-2 and PREVENT like parameters are very poor as reported in the recently published paper on the their accuracy performance by the Swedish team [1]. As all CVD risk stratification with cardiologist review, the most important accuracy is sensivity (avoiding false negative that will escape review) of SCORE-2 and PREVENT, 48% and 26%, respectively.
The paper alternative proposal increased the sensitivity to 58% by performing clustering instead of conventional regression models as practiced in the standard SCORE-2 (Europe) and PREVENT (US).
These type of models including the latest proposal performed very poorly as indicated by their otherwise excellent and intuitive display of graphical abstract results [1].
[1] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:
https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...
Modern medicine has failed to move into the era of subtlety and small problems and many people suffer as a result. Fitness nerds and general non-scientists fill the gap poorly so we get a ton of guessing and anecdotal evidence and likely a whole lot of bad advice.
Doctors won't say there's a problem until you're SICK and usually pretty late in the process when there's not a lot of room to make improvements.
At the same time, doctors won't do anything if you're 5% off optimal, but they'll happily give you a medicine that improves one symptom that's 50% off optimal that comes along with 10 side effects. Although unless you're dying or have something really straightforward wrong with you, doctors don't do much at all besides giving you a sedative and or a stimulant.
Doctors don't know what to do with small problems because they're barely studied and the people who DO try to do something don't do it scientifically.
I have a close friend who works in conservative care, and it’s astonishing what they see. For example, someone went to a number of specialists and doctors about a throat condition where they really struggled swallowing. They even had to swallow a radioactive pill to do some kind of imaging. Unnecessary exposure, and an expensive process to go through, and ultimately went exactly nowhere.
Meanwhile, it was a simple musculoskeletal issue which my friend was able to resolve in a single visit with absolutely no risk to the patient.
Medical schools need to stop producing MDs who reach for pills as the first line of defense without trying to root cause issues. Do you really need addictive pain killers, or maybe some PT, exercise, massage, etc. to help resolve your pain.
As someone who is fit and active,in their 60s with zero obvious symptoms, but is nonetheless on cholesterol and blood pressure medication, this isn't true (in the UK, at least)
And there's the problem. That they are "customers" that pay, either direct or via insurance, or via government insurance vs. a nationalized healthcare system, and I mean healthcare not nationalized health insurance
Instead I use the health benefits programs of my health care insurer. My insurer has an interest in prevention, so I can get consulting for free (or very low fees), and even kickbacks if I regularly participate in fitness courses and maintain my yearly check-up routine. Now, I live in Germany and it probably is different in other countries, but it just makes economic sense from the insurer's point of view so that I would be surprised if it were very different elsewhere.
I think this creates a huge knowledge gap.
On the way to the hospital, ChatGPT was pretty confident it was a issue with my gallbladder due to me having a fatty meal for lunch (but it was delicious).
After an extended wait time to be seen, they didnt ask about anything like that, and at the end they were like anything else to add, added it in about ChatGPT / Gallbladder... discharged 5 minutes later with suspicion of Gallbladder as they couldnt do anything that night.
Over the next few weeks, got test after test after test, to try and figure out whats going on. MRI. CT. Ultrasound etc.etc. they all came back negative for the gallbladder.
ChatGPT was persistant. It said to get a HIDA scan, a more specialised scan. My GP was a bit reluctant but agreed. Got it, and was diagnosed with a hyperkinetic gallbladder. It is still unrecognised as an issue, but mostly accepted. So much so my surgeon initally said that it wasnt a thing (then after doing research about it, says it is a thing)... and a gastroentologist also said it wasnt a thing.
Had it taken out a few weeks ago, and it was chroically inflammed. Which means the removal was the correct path to go down.
It just sucks that your wife was on the other end of things.
There are probably ("good") reasons for this. But your own persistence, and today the help of AI, can potentially help you. The problem with it is the same problem as previously: "charlatans". Just that today the charlatan and the savior are both one and the same: The AI.
I do recognize that most people probably can't tell one from the other. In both cases ;)
You'll find this in my post history a few times now but essentially: I was lethargic all the time, got migraine type headaches "randomly" a lot. Having the feeling I'd need to puke. One time I had to stop driving as it just got so bad. I suddenly was no longer able to tolerate alcohol either.
I went to multiple doctors, was sent to specialists, who all told me that they could maaaaaybe do test XYX but essentially: It wasn't a thing, I was crazy.
Through a lot of online research I "figured out" (and that's an over-statement) that it was something about the gut microbiome. Something to do with histamine. I tried a bunch of things, like I suspected it might be DAO (Di-Amino-Oxidase) insufficiency. I tried a bunch of probiotics, both the "heals all your stuff" and "you need to take a single strain or it won't work" type stuff. Including "just take Actimel". Actimel gave me headaches! Turns out one of the (prominent) strains in there makes histamine. Guess what, Alcohol, especially some, has histamines and your "hangover" is also essentially histamines (made worse by the dehydration). And guess what else, some foods, especially some I love, contain or break down into histamines.
So I figured that somehow it's all about histamines and how my current gut microbiome does not deal well with excess histamines (through whichever source). None of the doctors I went to believed this to be a "thing" nor did they want to do anything about it. Then I found a pro-biotic that actually helped. If you really want to check what I am taking, check the history. I'm not a marketing machine. What I do believe is that one particular bacterium helped, because it's the one thing that wasn't in any of the other ones I took: Bacillus subtilis.
A soil based bacterium, which in the olden times, you'd have gotten from slightly not well enough cleaned cabbage or whatever vegetable du jour you were eating. Essentially: if your toddler stuffs his face with a handful of dirt, that's one thing they'd be getting and it's for the better! I'm saying this, because the rest of the formulation was essentially the same as the others I tried.
I took three pills per day, breakfast, lunch and dinner. I felt like shit for two weeks, even getting headaches again. I stuck with it. After about two weeks I started feeling better. I think that's when my gut microbiome got "turned around". I was no longer lethargic and I could eat blue cheese and lasagna three days in a row with two glasses of red wine and not get a headache any longer! Those are all foods that contain or make lots of histamine. I still take one per day and I have no more issues.
But you gotta get to this, somehow, through all of the bullshit people that try to sell you their "miracle cure" stuff. And it's just as hard as trying to suss out where the AI is bullshitting you.
There was exactly a single doctor in my life, who I would consider good in that regard. I had already figured the above one out by that time but I was doing keto and it got all of my blood markers, except for cholesterol into normal again. She literally "googled" with me about keto a few times, did a blood test to confirm that I was in ketosis and in general was just awesome about this. She was notoriously difficult to book and later than any doctor for schedules appointments, but she took her time and even that would not really ever have been enough to suss out the stuff that I figured out through research myself if you ask me. While doctors are the "half gods in white", I think there's just way too much stuff and way too little time for them. It's like: All the bugs at your place of work. Now imagine you had exactly one doctor across a multitude of companies. Of course they only figure out the "common" ones ...
In practice it means you often have to escalate from GP to local specialist to even more narrow specialist all the way to one of the regional big city specialist that almost exclusively get the weird cases.
This is because every hop is an increasingly narrow area of speciality.
Instead of just “cancer doctor” its the “GI cancer doctor” then its “GI cancer doctor of this particular organ” then its “an entire department of cancer doctors who work exclusively on this organ who will review the case together”, etc.
FWIW, as I understand it, many probiotics aren't going to colonize on their own and "stick around" for a prolonged period of time when you stop taking them, even under good circumstances but you can't quote me on that so to speak. And in the past we would've gotten many of them through one way or another through our diet as well, just not through a probiotic but naturally.
I tried multiple probiotics. Both blends of multiple types as well as things like "Saccharomyces Boulardii"-only preparation. I don't recall all the exact ones I tried though.
I don't get it... a doctor ordered the blood work, right? And surely they did not have this opinion or you would have been sent to a specialist right away. In this case, the GP who ordered the blood work was the gatekeeper. Shouldn't they have been the person to deal with this inquiry in the first place?
I would be a lot more negative about "the medical establishment" if they had been the ones who put you through the trauma. It sounds like this story is putting yourself through trauma by believing "Dr. GPT" instead of consulting a real doctor.
I will take it as a cautionary tale, and remember it next time I feed all of my test results into an LLM.
Also, the regular bloodwork is around $50-$100 (for noninsured or without a prescription), so many people just do this out of pocket once in a while and only bring to doctor if anything looks suspicious.
Finally, there is EU regulation about data that applies to medical field as well - you always have the right to view all the data that any company has stored about you. Gatekeeping is forbidden by law.
It gave you a probabilistic output. There were no guns and nothing to stick to. If you had disrupted the context with enough countervailing opinion it would have "relented" simply because the conversational probabilities changed.
For the general public, these tools have been advertised this way.
So if a good subset of HN still gets fooled, the layperson is screwed.
In fact, I can now easily access even my doctor's appointment notes. I have my entire chart unless my doctor specifically writes private notes.
It is so unfortunate that a general chatbot designed to answer anything was the first use case pushed. I get it when people are pissed.
Everyone that encounters this needs to do a clean/fresh prompt with memory disabled to really know if the LLM is going to consistently come to the same conclusion or not.
I would never let an LLM make an amputate or not decision, but it could convince me to go talk with an expert who sees me in person and takes a holistic view.
You should be happy about it that it's not the thing specifically when the signs pointed towards it being "the thing"?
When it doesn't happen will you still be happy?
But to answer directly... yes? yes, I am.
[edit]
A bit it more real. My blood pressure monitor says my bp is 200/160. Chat says you're dead get yourself to a hospital.
Get to the hospital and says oh your bp monitor is wrong.
I'm happy? I would say that I am. Sure I'm annoyed at my machine, but way happier it's wrong than right.
"Yes I'm happy I'm not dying" ignores that "go to the hospital [and waste a day, maybe some financial cost]" because a machine was wrong. This is still pretty inconvenient because a machine wasn't accurate/calibrated/engineered weak. Not dying is good, but the emotions and fear for a period of time is still bad.
I 100% understand those frustrations. That the "detectors" should've been more accurate, or the fears, battery of tests, and costs associated of time and money. But, if you have the means to find out something that could have been extremely concerning is actually "nothing wrong" - isn't that worth it?
My friend is 45, had bloody stool -> colonoscopy -> polyps removed -> benign. Isn't that way better than colon cancer?
Maybe it's a glass half-empty-full thing.
What model?
Care to share the conversation? Or try again and see how the latest model does?
Why would you consult a known bullshit generator for anything this important?
I mean, at some point we have to admit that LLMs aren’t designed for correctness but utility.
It's an interesting cognitive dissonance that you both trusted it enough to go to a specialist but not enough to admit using it.
Why did you do the thing people calmly explained you should not do? Why are you pissed about experiencing the obvious and known outcome?
In medicine, even a test with "Worrying" results is rarely an actual condition requiring treatment. One reason doctors are so bad at long tail conditions is that they have been trained, both by education and literal direct experience, that chasing down test results without any symptoms is a reliable way to waste money, time, and emotions.
It's a classic statistics 101 topic to look at screening tests and notice that the majority of "positive" outcomes are false positives.
And probably the same people laugh at ancient folks carefully listening to shamans.
There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).
What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.
Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...
Then there's confounders like altitude, elevation gain that can sully the numbers.
It can be pretty great, but it needs a bit of control in order to get a proper reading.
Seems like Apple's 95% accuracy estimate for VO2 max holds up.
Thirty participants wore an Apple Watch for 5-10 days to generate a VO2 max estimate. Subsequently, they underwent a maximal exercise treadmill test in accordance with the modified Åstrand protocol. The agreement between measurements from Apple Watch and indirect calorimetry was assessed using Bland-Altman analysis, mean absolute percentage error (MAPE), and mean absolute error (MAE).
Overall, Apple Watch underestimated VO2 max, with a mean difference of 6.07 mL/kg/min (95% CI 3.77–8.38). Limits of agreement indicated variability between measurement methods (lower -6.11 mL/kg/min; upper 18.26 mL/kg/min). MAPE was calculated as 13.31% (95% CI 10.01–16.61), and MAE was 6.92 mL/kg/min (95% CI 4.89–8.94).
These findings indicate that Apple Watch VO2 max estimates require further refinement prior to clinical implementation. However, further consideration of Apple Watch as an alternative to conventional VO2 max prediction from submaximal exercise is warranted, given its practical utility.
https://pmc.ncbi.nlm.nih.gov/articles/PMC12080799/There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.
> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.
Yeah for sure, I probably didn't make it clear enough but I do fault OpenAI for this as much as or maybe more than Apple. I didn't think that needed to be stressed since the article is already blasting them for it and I don't disagree with most of that criticism of OpenAI.
Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.
This is better described as “critical thinking” in its formal form.
You could also call it skepticism.
That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.
Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.
Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.
I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)
OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).
1. Trust the source of the data to be honest about it's quality
Or
2. Distrust the source
Approach number 2 basically means we can never do any analysis on it.
Personally I'd rather have a product that might be wrong than none at all, but that's a personal preference.
There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.
I will also preface this with saying I do not think any LLM is better than the average doctor and that you are far better served going to your doctor than asking ChatGPT what your health is like on any factor.
But I'll also say that the quality of doctors varies massively, and that a good amount of doctors learn what they learn in school and do not keep up with the latest advances in research, particularly those that have broad spectrums such as GPs. LLMs that search scientific literature, etc., might point you in the direction of this research that the doctors are not aware of. Or hallucinate you into having some random disease that impacts 3 out of every million people and send you down a rabbithole for months.
Unfortunately, it's difficult to resolve this without extremely good insurance or money to burn. The depth you get and the level of information that a good preventative care cardiologist has is just miles ahead of where your average family medicine practitioner is at. Statins are an excellent example - new prescriptions are for atorvastatin are still insanely high despite it being a fairly poor choice in comparison to rosuvastatin or pitavastatin for a good chunk of the people on it. They often are behind on the latest recommendations from the NLA and AHA, etc.
There's a world where LLMs or similar can empower everyday people to talk to their doctor about their options and where they stand on health, where they don't have to hope their doc is familiar with where the science has shifted over the past 5-10 years, or cough up the money for someone who specializes in it. But that's not the world of today.
In the mean time, I do think people should be comfortable being their own advocates with their doctors. I'm lucky enough that my primary care doc is open to reading the studies I send over to him on things and work with me. Or at least patient enough to humor me. But it's let me get on medications that treat my symptoms without side effects and improved my quality of life (and hopefully life/healthspan). There's also been things I've misinterpreted - I don't pick a fight with him if we come to opposite conclusions. He's shown good faith in agreeing with me where it makes sense to me, and pushed back where it hasn't, and I acknowledge he's the expert.
I wonder what it’s like now. Any time I use it for a diagnosis I get outlandish results, and then I’ll head to my GP and turns out it was something rather simple.
For more ambiguous situations where you need actual tests, I am skeptical of using LLMs.
That said if it's showing someone as having 30 I don't imagine they're going to be in spectacular condition
Now, 2 years later, I don’t run due to injury and a kid, and it’s resting at 34. For reference, when I went to the gym almost everyday and ran once or twice a week, the value was 32.
I don’t get much utility out of it, even looking at the trends. Not sure what Apple is doing behind the scenes to get the score.
I'm not that bothered of course. For me it's just a fun metric I can attempt to optimise when training.
Health and fitness correlate but are different things. VO2max is more about fitness than about health.
Also, looking at https://en.wikipedia.org/wiki/VO2_max#Reference_values, 30 is about average for men in their 40s/50s, which, form a quick google, I estimate is the author’s age range.
And the average man is his 40s or 50s is in...not especially good aerobic shape.
My hypothesis is that the apple watch estimates higher if you are running rather than pedaling. I definitely don't think my cardio vascular went from poor to great over a month. It seems more likely that it was maybe underestimating, and perhaps now is overestimating.
Incidentally I got rid of mine recently. It is bliss not having one.
Also VO2 max is a crappy measure of fitness. I apparently had "average" VO2 max after a treadmill test. I can hike 50km with a 2km elevation gain in one go and not die. People with higher VO2 max I know, dropped out.
I’ll quote:
“This is a measurement of your VO2 max, which is the maximum amount of oxygen your body can consume during exercise. Also called cardiorespiratory fitness, this is a useful measurement for everyone from the very fit to those managing illness.
A higher VO, max indicates a higher level of cardio fitness and endurance.
Measuring VO2 max requires a physical test and special equipment. You can also get an estimated VO, max with heart rate and motion data from a fitness tracker. Apple Watch can record an estimated VO max when you do a brisk walk, hike, or run outdoors.
VO, max is classified for users 20 and older. Most people can improve their VO, max with more intense and more frequent cardiovascular exercise. Certain conditions or medications that limit your heart rate may cause an overestimation of your VO, max. VOz max is not validated for pregnant users. You can indicate you're taking certain medications or add a current pregnancy in Health Details.”
And thru-hikers can do this for days. It’s more related to fatigue resistance, mitochondrial density, and walking efficiency. But VO2 max still matters in high-intensity sports, you can’t ignore it when you’re pedaling a bike at high Zone 4 in a race.
For it to get better, it needs to know outcomes of its diagnosis.
Are people just typing back to ChatGPT saying "you're wrong / you're right"?
First of all, wrist based HR measurements are not reliable. If you feed ChatGPT a ton of HR data that is just plain wrong, expect a bad result. Everyone who wants to track HR reliably should invest in a chest strap. The VO2 Max calculation is heavily based on your pace at a given heart rate. It makes some generalizations on on your running biomechanics. For example, if your "real" lab tested VO2 max stays constant, but you improve your biomechanics / running efficiency, you can run faster at the same effort, and your Apple watch will increase your VO2 Max number.
You can't feed an LLM years of time-series meteorological data, and expect it to work as a specialized weather model, you can't feed it years of medical time-series and expect it to work as a model specifically trained, and validated on this specific kind of data.
An LLM generates a stream of tokens. You feed it a giant set of CSVs, if it was not RL'd to do something useful with it, it will just try to make whatever sense of it and generate something that will most probably have no strong numerical relationship to your data, it will simulate an analysis, it won't do it.
You may have a giant context windows, but attention is sparse, the attention mechanism doesn't see your whole data at the same time, it can do some simple comparisons, like figuring out that if I say my current pressure is 210X180 I should call an ER immediately. But once I send it a time-series of my twice a day blood-pressure measurements for the last 10 years, it can't make any real sense of it.
Indeed, it would have been better for the author to ask the LLM to generate a python notebook to do some data analysis on it, and then run the notebook and share the result with the doctor.
That is not a plausible outcome given the current technology or of any of OpenAI's demonstrated capabilities.
"If Bob's Hacksaw Surgery Center wants to stay in business they have to stop killing patients!"
Perhaps we should just stop him before it goes too far?
OpenAI has said that medical advice was one of their biggest use-cases they saw from users. It should be assumed they're investigating how to build out this product capability.
Google has LLMs fine tuned on medical data. I have a friend who works at a top-tier US medical research university, and the university is regularly working with ML research labs to generate doctor-annotated training data. OpenAI absolutely could be involved in creating such a product using this sort of source.
You can feed an LLM text, pictures, videos, audio, etc - why not train a model to accept medical-time-series data as another modality? Obviously this could have a negative performance impact on a coding model, but could potentially be valuable for a consumer-oriented chat bot. Or, of course, they could create a dedicated model and tool-call that model.
They are going to hire armies of developing world workers to massage those models on post-training to have some acceptable behaviors, and they will create the appropriate agents with the appropriate tools to have something that will simulate the real thing in a most plausible way.
Problem is, RLVR is cheap with code, but it can get very expensive with human physiology.
A LLM has structures in its latent space which allows it to do basic math, it has also seen enough data that it has probably structures in it to detect basic trends.
A LLM doesn't just generate a stream of tokens. It generates an embedding and searches/does something in its latent space, then returns tokens.
And you don't even know at all what LLM Interfaces do in the background. Gemini creates sub-agents. There can easily be already a 'trend detector'.
I even did a test and generated random data with a trend and fet it to chatgpt. The output was very coherent and right.
Here is the paper were I read about it: https://arxiv.org/html/2601.04480v1
I would claim that ignoring the "ChatGPT is AI and can make mistakes. Check important info." text, right under the query they type in client, is clearly more irresponsible.
I think that a disclaimer like that is the most useful and reasonable approach for AI.
"Here's a tool, and it's sometimes wrong." means the public can have access to LLMs and AI. The alternative, that you seem to be suggesting (correct me if I'm wrong), means the public can't have access to an LLM until they are near perfect, which means the public can't ever have access to an LLM, or any AI.
What do you see as a reasonable approach to letting the public access these imperfect models? Training? Popups/agreement after every question "I understand this might be BS"? What's the threshold for quality of information where it's no longer considered "broken"? Is that threshold as good as or better than humans/news orgs/doctors/etc?
This marketing obscures what the software is _actually_ good at and gives users a poor mental model of what's going on under the hood. Dumping years worth of un-differentiated health data into a generic chatGPT chat window seems like a fundamental misunderstanding of the strengths of large language models.
A reasonable approach would be to try to explain what kind of tasks these models do well at and what kind of situations they behave poorly in.
I live in a place where getting a blood test requires a referral from a doctor, who is also required to discuss the results with you.
Could you tell me which source of information do you see as "perfect" (or acceptable) that you see as a good example of a threshold for what you think the public should and should not have access to?
Also, what if a tool still provides value to the user, in some contexts, but not to others, in different contexts (for example, using the tool wrong)?
For the "tool" perspective, I've personal never seen a perfect tool. Do you have an example?
> I live in a place where getting a blood test requires a referral from a doctor, who is also required to discuss the results with you.
I don't see how this is relevant. In the above article, the user went to their doctor for advice and a referral. But, in the US (and, many European countries) blood tests aren't restricted, and can be had from private labs out of pocket, since they're just measurements of things that exist in your blood, and not allowing you to know what's inside of you would be considered government overreach/privacy violation. Medical interpretations/advice from the measurements is what's restricted, in most places.
I know it when I see it.
> I don't see how this is relevant.
It's relevant because blood testing is an imperfect tool. Laypeople lack the knowledge/experience to identify imperfections and are likely to take results at face value. Like the author of the article did when ChatGPT gave them an F for their cardiac health.
> Medical interpretations/advice from the measurements is what's restricted, in most places.
Do you agree with that restriction?
This isn't a reasonable answer. No action can be taken and no conclusion/thought can be made from it.
> Do you agree with that restriction?
People should be able to perform and be informed about their own blood measurements, and possibly bring something up with their doctors outside of routine exams (which they may not even be insured for in the US). I think the restriction on medical advice/conclusion, that results in treatment, is very good, otherwise you end up with "Wow, look at these results! you'll have to buy my snake oil or you'll die!".
I don't believe in reducing society to a level that completely protects the most stupid of us.
Sure it is. The world runs on human judgement. If you want me to rephrase I could say that the threshold for imperfection should reflect contemporary community standards, but Stewart's words are catchier.
> I think the restriction on medical advice/conclusion, that results in treatment, is very good, otherwise you end up with "Wow, look at these results! you'll have to buy my snake oil or you'll die!".
Some people would describe this as an infringement on their free speech and bodily autonomy.
Which is to say that I think you and I agree that people in general need the government to apply some degree of restriction to medicine, we just disagree about where the line is.
But I think if I asked you to describe to me exactly where the line is you'd ultimately end up at some incarnation of "I know it when I see it".
Which is fine. Even good, I think.
> I don't believe in reducing society to a level that completely protects the most stupid of us.
This seems at odds with what you said above. A non-stupid person would seek multiple consistent opinions before accepting medical treatment, after all.
What's the most complex (in an information rich way) tool that you have seen?
To me, this is horrific. I am the advocate for my own health. I trust my doctor - he's a great guy. I have spoken to him extensively around a variety of health matters and I greatly trust his opinion.
But I also recognize that he has many other patients and by necessity has to work within the general lines of probability. There is no way for him to know every confounding and contributing factor of my health, no matter how diligent I am in filling out my chart.
I get my own bloodwork done regularly. This has let me make significant changes in my life to improve health markers. I can also get a much broader spectrum of tests done than the standard panel. This has directly lead to productive conversations with my doctor!
And from a more philosophical standpoint, this is about understanding my own body. The source of the data is me. Why should this be gatekept behind a physician referral? I find it insane to think that I could be in a position where I am not allowed to find out the cholesterol serum levels in my blood unless a doctor OKs it! What the fuck?
You’re saying it like it’s a good thing.
Allow it to answer general questions about health, medicine and science.
It can’t practice medicine, it can only be a talking encyclopedia that tells you how the heart works and how certain biomarkers are used. Analyzing your specific case or data is off limits.
And then when the author asks his question, it says it’s not designed to do that.
Considering the number of people who take LLM responses as authoritative Truth, that wouldn't be the worst thing in the world.
The product itself is telling you in plain English that it’s ABSOLUTELY CERTAIN about its answer… even when you challenge it and try to rebut it. And the text of the product itself is much more prominent than the little asterisk “oh no, it’s actually lying because the LLM can never be that certain.” That’s clearly not a responsible product.
I opened the ChatGPT app right now and there is literally nothing about double checking results. It just says “ask anything,” in no uncertain terms, with no fine print.
Here’s a recent ad from OpenAI: https://youtu.be/uZ_BMwB647A, and I quote “Using ChatGPT allowed us to really feel like we have the facts and our doctor is giving us his expertise, his experience, his gut instinct” related to a severe health question.
And another recent ad related to analyzing medical scans: “What’s wonderful about ChatGPT is that it can be that cumulative source of information, so that we can make the best choices.” (https://youtu.be/rXuKh4e6gw4)
And yet another recent ad, where lots of users are using ChatGPT to get authoritative answers to health questions. They even say you can take a picture of a meal before you eat and after you eat, and have it generate the amount of calories you ate! Just based on the difference between the pictures! How has that been tested and verified? (https://youtu.be/305lqu-fmbg)
Now, some of the ads have users talking to their doctors, which is great.
But they are clearly marketing ChatGPT as the tool to use if you want to arrive at the truth. No asterisks. No “but sometimes it’s wrong and you won’t be able to tell.” There’s nothing to misunderstand about these ads: OpenAI is telling you that ChatGPT is trustworthy.
So I reject the premise that it’s the user’s fault for not using enough caution with these tools. OpenAI is practically begging you to jump in and use it for personal, life or death type decisions, and does very little to help you understand when it may be wrong.
Is the same thing that can be said about any human
> "Doctor is human and can make mistakes"
Therefore it's really not sufficient to make it clear that it is wrong in different ways and worse than human.
If you are lucky, maybe it was finetuned to see a long comma-delimited sequence of values as a table and then emit a series of tool calls to generate some deterministic code to calculate a set of descriptive statistics that then will be close in the latent space to some hopefully current medical literature, and it will generate some things that makes sense and it is not absurdly wrong.
It is a fucking LLM, it is not 2001's HAL.
Imagine if as a dev someone came to you and told you everything that is wrong with your tech stack because they copy pasted some console errors into ChatGPT. There's a reason doctors need to spend almost a decade in training to parse this kind of info. If you do the above then please do it with respect for their profession.
My wife is a lawyer and sees the same thing at her job. People "writing" briefs or doing legal "research" with GPT and then insisting that their document must be right because the magic AI box produced it.
When reading news stories on topics you know well, you notice inaccuracies or poor reporting - but then immediately forget that lesson when reading the next article on a topic you are not familiar with.
It's very similar to what happens with AI.
“A little knowledge is a dangerous thing” is not new, it’s a quote/observation that goes back hundreds of years.
> Imagine if as a dev someone came to you and told you everything that is wrong with your tech stack because they copy pasted some console errors into ChatGPT.
You mean the PHB? They don’t need ChatGPT for that, they can cite Gartner.
The basic idea was to adapt JEPA (Yann LeCun's Joint-Embedding Predictive Architecture) to multivariate time series, in order to learn a latent space of human health from purely unlabeled data. Then, we tested the model using supervised fine tuning and evaluation on on a bunch of downstream tasks, such as predicting a diagnosis of hypertension (~87% accuracy). In theory, this model could be also aligned to the latent space of an LLM--similar to how CLIP aligns a vision model to an LLM.
IMO, this shows that accuracy in consumer health will require specialized models alongside standard LLMs.
They usually require more data It is not a great idea to diagnose anything with so few information. But in general I am optimistic of the use of LLMs on health.
A family member recently passed away from a rare, clinically diagnosed disease. ChatGPT knew what it was a couple months before the relevant specialists diagnosed it.
I'm definitely not going with Apple. Are there any minimally obtrusive trackers that provide downloadable data?
... and you won't believe what happened next!
Can we do away with the clickbait from MSN? The article is about LLMs misdiagnosing cardiovascular status when given fitness tracker data
Of course, the real solution is to stop using Microsoft products, which I did.
Sure, LLM companies and proponents bear responsibility for the positioning of LLM tools, and particularly their presentation as chat bots.
But from a systems point of view, it's hard to ignore the inequity and inconvenience of the US health system driving people to unrealistic alternatives.
(I wonder if anyone's gathering comparable stats on "Doctor LLM" interactions in different countries... there were some interesting ones that showed how "Doctor Google" was more of a problem in the US than elsewhere.)
At the end of the day, it’s yet another tool that people can use to help their lives. They have to use their brain. The culture of seeing doctor as a god doesn’t hold up anymore. So many people have had bad experiences when the entire health care industry at least in US is primarily a business than helping society get healthy.
Paywall-free version at https://archive.ph/k4Rxt
Look, AI Healthbros, I'll tell you quite clearly what I want from your statistical pattern analyzers, and you don't even have to pay me for the idea (though I wouldn't say no to a home or Enterprise IT gig at your startup):
I want an AI/ML tool to not merely analyze my medical info (ON DEVICE, no cloud sharing kthx), but also extrapolate patterns involving weather, location, screen time, and other "non-health" data.
Do I record taking tylenol when the barometric pressure drops? Start alerting me ahead of time so I can try to avoid a headache.
Does my screen time correlate to immediately decreased sleep scores? Send me a push notification or webhook I can act upon/script off of, like locking me out of my device for the night or dimming my lights.
Am I recording higher-intensity workouts in colder temperatures or inclement weather? Start tracking those metrics and maybe keep better track of balance readings during those events for improved mobility issue detection.
Got an app where I track cannabis use or alcohol consumption? Tie that to my mental health journal or biological readings to identify red flags or concerns about misuse.
Stop trying to replace people like my medical care team, and instead equip them with better insights and datasets they can more quickly act upon. "Subject has been reporting more negative moods in his mental health journal, an uptick in alcohol consumption above his baseline, and inconsistent cannabis use compared to prior patterns" equips the care team with a quick, verifiable blurb from larger datasets that can accelerate care and improve patient outcomes - without the hallucinations of generative AI.
I strongly dislike the author conflating HIPAA with PHI but this is a losing battle for me. And clearly editors don’t spot it, neither do AI systems - where is Clippy?! It simply serves as an indicator the author is a pretty ignorant medical consumer in the US, and this case study is stunning. Some people really should not be allowed to engage with magic.
Maybe a very small bit of Information Theory (a couple of Shannon's papers are enough) and some classical books on Natural Language Processing from the late 90s and early 2000 so you have an idea of what Language Models are outside the modern Deep Learning driven approach.
An average human doctor has maybe 15 minutes allotted to getting to know you, analyse and determine a course of action. Which is usually "take some ibuprofen and let's see if it goes away". Then you go again in two weeks with the same thing, it's a different doctor and the context has been reset unless you do an info dump from the previous visits and try not to forget anything.
And if you infodump too much or use actual medical diagnosis terms, the Dr gets defensive because you're stepping on THEIR area of expertise and will start pushing back even from the obvious just because they can.
Time spent in a medical encounter is tied to patient satisfaction but there is rapid drop off for clinical benefit especially in the current day where investigations are more important than a physical exam in most cases and more than history in a substantial portion.
15 minutes is what we book as follow-ups or minor assessments in US+Canada, usually sufficient for most things. New consults or complex patients are 30-60 minutes.
Infodumping is not particularly helpful. Doctors are trained to use a combination of open and closed questions to guide the encounter based on their thinking and understanding of medicine. It’s relevant past medical history as not every symptom or past disease is necessarily useful in assessing what’s wrong today.
This is not how doctors work in most of the world. Not having an actual primary care physician that is able to keep track of each patient over multiple years means they are skipping out on one of their most important duties. You should advocate for a better standard of care rather than resorting to hallucinating chatbots.
Nobody sees the same doctor twice except in very rare cases - usually when the doctor is a specialist with no alternative
What would you prefer instead?
Or how vo2 max is hard to measure, or how not wearing a wearable or wearing it loose changes results, to, I gave an llm a range to rate without really giving it context of what I want the range to really represent or the methods of gathering data.
Tldr; author bought everything, read nothing, complained to an expensive professional, and now hopes that we read his article.