Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.
Run `claude update` to make sure you're on the latest version.
They don't have to be malicious operators in this case. It just happens.
It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.
I care about -expected- performance when picking which model to use, not optimal benchmark performance.
The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.
In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.
To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.
When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.
Unsurprising given they amount to explicit synchronization to make the order of operations deterministic.
If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!
e.g
if (batch_size > 1024): kernel_x else: kernel_y
I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.
“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”
So according to Anthropic they are not tweaking quality setting due to demand.
And according to Meta, they always give you ALL the data they have on you when requested.
However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.
I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.
these things are by definition hard to reason about
Sure, I'll take a cup of coffee while I wait (:
at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.
welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.
If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".
> How do I know which model Gemini is using in its responses?
> We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.
... for Google :)
So if you want batching + determinism, you need the same batch with the same order which obviously don't work when there are N+1 clients instead of just one.
I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link
E.g. some binomial interval proportions (aka confidence intervals).
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
"trust but verify" ofc . https://latent.space/p/artificialanalysis do api keys but also mystery shopper checks
I agree.
I'll also add that when my startup got acquired into a very large, well-known valley giant with a sterling rep for integrity and I ended up as a senior executive - over time I got a first-hand education on the myriad ways genuinely well-intentioned people can still end up being the responsible party(s) presiding over a system doing net-wrong things. All with no individual ever meaning to or even consciously knowing.
It's hard to explain and I probably wouldn't have believed myself before I saw and experienced it. Standing against an overwhelming organizational tide is stressful and never leads to popularity or promotion. I think I probably managed to move on before directly compromising myself but preventing that required constant vigilance and led to some inter-personal and 'official' friction. And, frankly, I'm not really sure. It's entirely possible I bear direct moral responsibility for a few things I believe no good person would do as an exec in a good company.
That's the key take-away which took me a while to process and internalize. In a genuinely good organization with genuinely good people, it's not "good people get pressured by constraints and tempted by extreme incentives, then eventually slip". I still talk with friends who are senior execs there and sometimes they want to talk about whether something is net good or bad. I kind of dread the conversation going there because it's inevitably incredibly complex and confusing. Philosopher's trolley car ethics puzzles pale next to these multi-layered, messy conundrums. But who else are they going to vent to who might understand? To be clear, I still believe that company and its leadership to be one of the most moral, ethical and well-intentioned in the valley. I was fortunate to experience the best case scenario.
Bottom line: if you believe earnest, good people being in charge is a reliable defense against the organization doing systemically net-wrong things - you don't comprehend the totality of the threat environment. And that's okay. Honestly, you're lucky. Because the reality is infinitely more ambiguously amoral than white hats vs black hats - at the end of the day the best the 'very good people' can manage is some shade of middle gray. The saddest part is that good people still care, so they want to check the shade of their hat but no one can see if it's light enough to at least tell yourself "I did good today."
We already know large graphics card manufacturers tuned their drivers to recognize specific gaming benchmarks. Then when that was busted, they implemented detecting benchmarking-like behavior. And the money at stake in consumer gaming was comparatively tiny compared to current AI valuations. The cat-and-mouse cycle of measure vs counter-measure won't stop and should be a standard part of developing and administering benchmark services.
Beyond hardening against adversarial gaming, benchmarkers bear a longer term burden too. Per Goodhart's Law, it's inevitable good benchmarks will become targets. The challenge is the industry will increasingly target performing well on leading benchmarks, both because it drives revenue but also because it's far clearer than trying to glean from imprecise surveys and fuzzy metrics what helps average users most. To the extent benchmarks become a proxy for reality, they'll bear the burden of continuously re-calibrating their workloads to accurately reflect reality as user's needs evolve.
Thanks!
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?
Basically the paper showed methods for how to handle heavy traffic load by changing model requirements or routing to different ones. This was awhile ago and I'm sure it's massively more advanced now.
Also why some of AI's best work for me is early morning and weekends! So yes, the best time to code with modern LLM stacks is when nobody else is. It's also possibly why we go through phases of "they neutered the model" some time after a new release.
Yes. Now pay up or you will be replaced.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
What we could also use is similar stuff for Codex, and eventually Gemini.
Really, the providers themselves should be running these tests and publishing the data.
The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
Are you suggesting result accuracy varies with server load?
Aha, so the models do degrade under load.
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
It's almost as if, as tool use and planning capabilities have expanded, Claude (as a singular product) is having a harder time coming up with simple approaches that just work, instead trying to use tools and patterns that complicate things substantially and introduce much more room for errors/errors of assumption.
It also regularly forgets its guidelines now.
I can't tell you how many times it's suggested significant changes/refactors to functions because it suddenly forgets we're working in an FP codebase and suggests inappropriate imperative solutions as "better" (often choosing to use language around clarity/consistency when the solutions are neither).
Additionally, it has started taking "initiative" in ways it did not before, attempting to be helpful but without gathering the context needed to do so properly when stepping outside the instruction set. It just ends up being much messier and inaccurate.
I have to regularly just clear my prompt and start again with guardrails that have either: already been established, or have not been needed previously / are only a result of the over-zealousness of the work its attempting to complete.
1pm EST time it’s all down hill until around 8 or 9pm EST time.
Late nights and weekends is smooth sailing.
It’s not so much that the implementations are bad because the code is bad (the code is bad). It’s that it gets extremely confused and starts to frantically make worse and worse decisions and questioning itself. Editing multiple files, changing its mind and only fixing one or two. Reseting and overriding multiple batches of commits without so much as a second thought and losing days of work (yes, I’ve learned my lesson).
It, the model, can’t even reason with the decisions it’s making from turn to turn. And the more opaque agentic help it’s getting the more I suspect that tasks are being routed to much lesser models (not the ones we’ve chosen via /model or those in our agent definitions) however Anthropic chooses.
In these moments I mind as well be using Haiku.
like these models are nondeterministic right? (besides the fact that rng things like top k selection and temperature exist)
say with every prompt there is 2% odds the AI gets it massively wrong. what if i had just lucked out the past couple weeks and now i had a streak of bad luck?
and since my expectations are based on its previous (lucky) performance i now judge it even though it isn't different?
or is it giving you consistenly worse performance, not able to get it right even after clearing context and trying again, on the exact same problem etc?
But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
There’s too many variables and no hard evidence shared by Anthropic.
There's little incentive to throttle the API. It's $/token.
Either way, if true, given the cost I wish I could opt-out or it were more transparent.
Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.
All speculation though
I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.
A benchmark like this ought to start fresh from when it is published.
I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.
If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.
& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)
How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…
They're going to need to provide a lot more detail on their methodology, because that doesn't make a lot of sense. From their graphs, they seem to be calculating the confidence interval around the previous value, then determining whether the new value falls outside of it. But that's not valid for establishing the statistical significance of a difference. You need to calculate the confidence interval of the difference itself, and then see if all the values within that confidence interval remain positive (if it excludes 0). This is because both the old and new measurement have uncertainty. Their approach seems to be only considering uncertainty for one of them.
They should also really be more specific about the time periods. E.g. their graphs only show performance over the past 30 days, but presumably the monthly change is comparing the data from 60 to 31 days ago, to the data from 30 days ago until yesterday? In which case the weekly graph really ought to be displaying the past two months, not one month.
It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.
It’s not always then, but it often follows it.
It's not my fault, they set high standards!
This last week it seems way dumber than before.
Anthropic does not exactly act like they're constrained by infra costs in other areas, and noticeably degrading a product when you're in tight competition with 1 or 2 other players with similar products seems like a bad place to start.
I think people just notice the flaws in these models more the longer they use them. Aka the "honeymoon-hangover effect," a real pattern that has been shown in a variety of real world situations.
Ultimately I can understand if a new model is coming in without as much optimization then it'll add pressure to the older models achieving the same result.
Nice plausible deniability for a convenient double effect.
"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.
After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.
More: probably don't know if they've got a good answer 100% of the time.
It is interesting to note that this trickery is workable only where the best answers are sufficiently poor. Imagine they ran almost any other kind of online service such email, stock prices or internet banking. Occasionally delivering only half the emails would trigger a customer exodus. But if normal service lost a quarter of emails, they'd have only customers who'd likely never notice half missing.
Yet vendor's costs to deliver these services are skyrocketing, competition is intense and their ability to subsidize with investor capital is going away. The pressure on vendors to reduce costs by dialing back performance a few percent or under-resourcing peak loads will be overwhelming. And I'm just a hobbyist now. If I was an org with dozens or hundreds of devs I'd want credible ways to verify the QoS and minimum service levels I'm paying for are being fulfilled long after a vendor has won the contract.
The larger monthly scale should be the default, or you should get more samples.
How do you actually use these in production pipelines in practice then?
Are LLMs even well suited for some of the document parsing / data scrubbing automation people are throwing at them now?
Wouldn't this just be "our test isn't powerful enough to find a signal if there were one here?"
People will see this and derive strong conclusions that the data don't support and you, `qwesr123`, or "JB" from your blogs, will be responsible.
That would be a nice paper.
You have to test inter-day variation. Many have noticed a sudden drop off at certain times.
On HN a few days ago there was a post suggesting that Claude gets dumber throughout the day: https://bertolami.com/index.php?engine=blog&content=posts&de...
I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Anthropic did sell a particular model version.
I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.
Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.
Is exacerbating this issue ... if the load theory is correct.
[1] https://thebullshitmachines.com/lesson-16-the-first-step-fal...
They should be transparent and tell customers that they're trying to not lose money, but that'd entail telling people why they're paying for service they're not getting. I suspect it's probably not legal to do a bait and switch like that, but this is pretty novel legal territory.
Just ignore the continual degradation of service day over day, long after the "infrastructure bugs" have reportedly been solved.
Oh, and I've got a bridge in Brooklyn to sell ya, it's a great deal!
Forgive me, but as a native English speaker, this sentence says exactly one thing to me; We _do_ reduce model quality, just not for these listed reasons.
If they don't do it, they could put a full stop after the fifth word and save some ~~tokens~~ time.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.
I didn't "try 100 times" so it's unclear if this is an unfortunate series of bad runs on Claude Code and Gemini CLI or actual regression.
I shouldn't have to benchmark this sort of thing but here we are.
Claude-Code is terrible with context compaction. This solves that problem for me.
I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.
[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)
[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)
[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
Of course, your suggestion (better service for users who know how to speak Proper English) would be the cherry on top of this strategy.
From what I've seen on HackerNews, Anthropic is all-in on social media manipulation and social engineering, so I suspect that your assumption holds water.
If this measure were hardened up a little, it would be really useful.
It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.
Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.
I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.
They were fighting an arms race that was getting incredibly expensive and realized they could get away with spending less electricity and there was nothing the general population could do about it.
Grok/Elon was left out of this because he would leak this idea at 3am after a binge.
"No no yeah bro no I'm good like really the work's done and all yeah sorry I missed that let me fix it"
It wouldn’t be the first time companies have secret shadow algorithms running to optimize things and wouldn’t it be obvious to limit power users as matter of cost/profit and not tell them. (See history of “Shadow ban” though that’s for different reasons)
Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
Is CC getting better, or are you getting better at using it? And how do you know the difference?
I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months.
For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling.
For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session.
Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well.
The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though.
My initial prompting is boilerplate at this point, and looks like this:
(Explain overall objective / problem without jumping to a solution)
(Provide all the detail / file references / past work I can think of)
(Ask it "what questions do you have for me before we build a plan?")
And then go back and forth until we have a plan.
Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code.
Thumbs up or down? (could be useful for trends) Usage growth from the same user over time? (as an approximation) Tone of user responses? (Don't do this... this is the wrong path... etc.)
> Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts?
The article actually links to this fine postmortem by anthropic that demonstrates one way this is possible - software bugs affecting inference: https://www.anthropic.com/engineering/a-postmortem-of-three-...
Another way this is possible is the model reacting to "stimuli", e.g. the hypothesis at the end of 2023 that the (then current) ChatGPT was getting lazy because it was finding out the date was in december and it associated winter with shorter lazier responses.
A third way this is possible is the actual conspiracy version - Anthropic might make changes to make inference cheaper at the expense of the quality of the responses. E.g. quantizing weights further or certain changes to the sampling procedure.