All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).
Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).
Freaking impressive!
If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
Nothing.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.
> The smug look on Malfoy’s face flickered.
> “No one asked your opinion, you filthy little Mudblood,” he spat.
> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.
> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.
> “Ron! Ron! Are you all right?” squealed Hermione.
> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.
(I'm from OpenAI.)
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...
Codex changelog: https://developers.openai.com/codex/changelog/
Codex CLI commit history: https://github.com/openai/codex/commits/main/
https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...
The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.
If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
PS - I appreciate you coming here and commenting!
However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.
I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).
It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.
And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work - unless scarce resources are devoted to actively preventing it.
They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust
Curious to see how things will be with 5.3 and 4.6
I definitely suspect all these models are being degraded during heavy loads.
You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.
Which is why it must be independent.
[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...
Seems like 4.6 is still all-around better?
swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:
> Version 2.1.32:
• Claude Opus 4.6 is now available!
• Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
• Claude now automatically records and recalls memories as it works
• Added "Summarize from here" to the message selector, allowing partial conversation summarization.
• Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
• Fixed @ file completion showing incorrect relative paths when running from a subdirectory
• Updated --resume to re-use --agent value specified in previous conversation by default.
• Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
previously interrupted tool execution
• Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
without truncation
• Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
• VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
• VSCode: Added spinner when loading past conversations listNeat: https://code.claude.com/docs/en/memory
I guess it's kind of like Google Antigravity's "Knowledge" artifacts?
It's very happy to throw a lot into the memory, even if it doesn't make sense.
That’s harsh, man.
I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past
I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days
It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.
I asked Claude to give me a TLDR (condensed from its system prompt):
----
Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations
MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise
Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned
Proactive: when I hit a common mistake, check memory first - if nothing there, write it down
Maintenance: update or remove memories that are wrong or outdated
Organization: by topic, not chronologically
Tools: use Write/Edit to update (so you always see the tool calls)
I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?
https://claude.ai/public/artifacts/14a23d7f-8a10-4cde-89fe-0...
> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.
There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.
It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.
https://link.springer.com/content/pdf/10.3758/bf03195929.pdf
https://www.freepik.com/free-vector/cyclist_23714264.htm
https://www.freepik.com/premium-vector/bicycle-icon-black-li...
Or missing/broken pedals:
https://www.freepik.com/premium-vector/bicycle-silhouette-ic...
https://www.freepik.com/premium-vector/bicycle-silhouette-ve...
http://freepik.com/premium-vector/bicycle-silhouette-vector-...
Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?
Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D
Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?
First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.
Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.
Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.
IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”
The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.
(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)
But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?
There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.
Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.
Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.
So it can make sense to be ok with losing money with each training generation initially, particularly when they are being driven by specific use-cases (like coding). To the extent they are specific, there will be more switching costs.
They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.
Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.
Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.
Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.
That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.
- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.
- Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.
- Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.
I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).
In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.
I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.
I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.
well that explains quite a bit
To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.
I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.
It is not at all a small app, at least as far as UX surface area. There are, what, 40ish slash commands? Each one is an opportunity for bugs and feature gaps.
Cue I could build it in a weekend vibes, I built my own agent TUI using the OpenAI agent SDK and Ink. Of course it’s not as fleshed out as Claude, but it supports git work trees for multi agent, slash commands, human in the loop prompts and etc. If I point it at the Anthropic models it more or less produces results as m good as the real Claude TUI.
I actually “decompiled” the Claude tools and prompts and recreated them. As of 6 months ago Claude was 15 tools, mostly pretty basic (list for, read file, wrote file, bash, etc) with some very clever prompts, especially the task tool it uses to do the quasi planning mode task bullets (even when not in planning mode).
Honestly the idea of bringing this all together with an affordable monthly service and obviously some seriously creative “prompt engineers” is the magic/hard part (and making the model itself, obviously).
they're also total garbage
The amount of non-critical bugs all over the place is at least a magnitude larger than of any software I was using daily ever.
Plenty of built in /commands don't work. Sometimes it accepts keystrokes with 1 second delays. It often scrolls hundreds of lines in console after each key stroke Every now and then it crashes completely and is unrecoverable (I once have up and installed a fresh wls) When you ask it question in plan mode it is somewhat of an art to find the answer because after answering the question it will dump the whole current plan (free screens of text)
And just in general the technical feeling of the TUI is that of a vibe coded project that got too big to control.
Discussions are pointless when the parties are talking past each other.
Memory comparison of AI coding CLIs (single session, idle):
| Tool | Footprint | Peak | Language |
|-------------|-----------|--------|---------------|
| Codex | 15 MB | 15 MB | Rust |
| OpenCode | 130 MB | 130 MB | Go |
| Claude Code | 360 MB | 746 MB | Node.js/React |
That's a 24x to 50x difference for tools that do the same thing: send text to an API.vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.
On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.
I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.
This is just regular tech debt that happens from building something to $1bn in revenue as fast as you possibly can, optimize later.
They're optimizing now. I'm sure they'll have it under control in no time.
CC is an incredible product (so is codex but I use CC more). Yes, lately it's gotten bloated, but the value it provides makes it bearable until they fix it in short time.
I thought this was a solid take
What's apparently happening is that React tells Ink to update (re-render) the UI "scene graph", and Ink then generates a new full-screen image of how the terminal should look, then passes this screen image to another library, log-update, to draw to the terminal. log-update draws these screen images by a flicker-inducing clear-then-redraw, which it has now fixed by using escape codes to have the terminal buffer and combine these clear-then-redraw commands, thereby hiding the clear.
An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).
Diffing and only updating the parts of the TUI which have changed does make sense if you consider the alternative is to rewrite the entire screen every "frame". There are other ways to abstract this, e.g. a library like tqmd for python may well have a significantly more simple abstraction than a tree for storing what it's going to update next for the progress bar widget than claude, but it also provides a much more simple interface.
To me it seems more fair game to attack it for being written in JS than for using a particular "rendering" technique to minimise updates sent to the terminal.
The terminal does not have a render phase (or an update state phase). You either refresh the whole screen (flickering) or control where to update manually (custom engine, may flicker locally). But any updates are sequential (moving the cursor and then sending what to be displayed), not at once like 2D pixel rendering does.
So most TUI only updates when there’s an event to do so or at a frequency much lower than 60fps. This is why top and htop have a setting for that. And why other TUI software propose a keybind to refresh and reset their rendering engines.
Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features
Claude opens in around 3-4 seconds
Opencode opens in 2 seconds
Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install
Codex takes 50ms for reference...
--
If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)
If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.
So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.
These companies don't see a point in doing that. So they just use whatever already exists.
- https://github.com/ratatui/ratatui
Most modern UI systems are inspired by React or a variant of its model.
Some developers say 3-4 seconds are important to them, others don't. Who decides what the truth is? A human? ClawdBot?
Opencode's core is actually written in zig, only ui orchestration is in solidjs. It's only slightly slower to load than neo-vim on my system.
But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..
I've used it myself. It has some rough edges in terms of rendering performance but it's nice overall.
React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.
So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.
Who cares, and why?
All of the major providers' CLI harnesses use Ink: https://github.com/vadimdemedes/ink
A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.
Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers
Where did you hear that? It doesn't match my mental model of how this has played out.
> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.
That does not mean the frontier labs are pricing their APIs to cover their costs yet.
It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.
In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.
What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.
[1] https://epochai.substack.com/p/can-ai-companies-become-profi...
chasing down a few sources in that article leads to articles like this at the root of claims[1], which is entirely based on information "according to a person with knowledge of the company’s financials", which doesn't exactly fill me with confidence.
[1] https://www.theinformation.com/articles/openai-getting-effic...
They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.
> evidence at all that Anthropic or OpenAI is able to make money on inference yet.
You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.
so my unused tokens compensate for the few heavy users
Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.
Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.
The evidence is in third party inference costs for open source models.
are we sure this is not a fancy way of saying quantization?
The same is happening in AI research now.
And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.
This isn't just quantization, it's actually just better optimization.
Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.
There is also a ton of work in the past year on sub-quadratic attention for new models that gets rid of a huge bottleneck, but like quantization can be a tradeoff, and a lot of progress has been made there on moving the Pareto frontier as well.
It's almost like when you're spending hundreds of billions on capex for GPUs, you can afford to hire engineers to make them perform better without just nerfing the models with more quantization.
This gets repeated everywhere but I don't think it's true.
The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.
It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.
However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.
It is essentially a big game of venture capital chicken at present.
If you're looking at overall profitability, you include everything
If you're talking about unit economics of producing tokens, you only include the marginal cost of each token against the marginal revenue of selling that token
To me this looks likes some creative bookkeeping, or even wishful thinking. It is like if SpaceX omits the price of the satellites when calculating their profits.
This is obviously not true, you can use real data and common sense.
Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.
Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.
Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.
As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.
There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.
Price sources:
It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.
LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.
1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)
2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.
The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.
Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.
I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).
Which is profitable. but not by much.
https://ollama.com/library/gemini-3-pro-preview
You can run it on your own infra. Anthropic and openAI are running off nvidia, so are meta(well supposedly they had custom silicon, I'm not sure if its capable of running big models) and mistral.
however if google really are running their own inference hardware, then that means the cost is different (developing silicon is not cheap...) as you say.
This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.
Raise the price to backpay what you gave away plus cover current expenses and profits.
In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.
Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...
Though I'm wary about that being a magic bullet fix - already it can be pretty "selective" in what it actually seems to take into account documentation wise as the existing 200k context fills.
I check context use percentage, and above ~70% I ask it to generate a prompt for continuation in a new chat session to avoid compaction.
It works fine, and saves me from using precious tokens for context compaction.
Maybe you should try it.
At this point I just think the "success" of many AI coding agents is extremely sector dependent.
Going forward I'd love to experiment with seeing if that's actually the problem, or just an easy explanation of failure. I'd like to play with more controls on context management than "slightly better models" - like being able to select/minimize/compact sections of context I feel would be relevant for the immediate task, to what "depth" of needed details, and those that aren't likely to be relevant so can be removed from consideration. Perhaps each chunk can be cached to save processing power. Who knows.
But I kinda see your point - assuming from you're name you're not just a single purpose troll - I'm still not sold on the cost effectiveness of the current generation, and can't see a clear and obvious change to that for the next generation - especially as they're still loss leaders. Only if you play silly games like "ignoring the training costs" - IE the majority of the costs - do you get even close to the current subscription costs being sufficient.
My personal experience is that AI generally doesn't actually do what it is being sold for right now, at least in the contexts I'm involved with. Especially by somewhat breathless comments on the internet - like why are they even trying to persuade me in the first place? If they don't want to sell me anything, just shut up and keep the advantage for yourselves rather than replying with the 500th "You're Holding It Wrong" comment with no actionable suggestions. But I still want to know, and am willing to put the time, effort and $$$ in to ensure I'm not deluding myself in ignoring real benefits.
Its a weapon who's target is the working class. How does no one realize this yet?
Don't give them money, code it yourself, you might be surprised how much quality work you can get done!
How long before the "we" is actually a team of agents?
But it takes lot of context as a experimental feature.
Use self-learning loop with hooks and claude.md to preserve memory.
I have shared plugin above of my setup. Try it.
What do you want to do?
1. Stop and wait for limit to reset
2. Switch to extra usage
3. Upgrade your plan
Enter to confirm · Esc to cancel
How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).
It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.
That's a feature. You could also not use the extra context, and the price would be the same.
The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.
But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.
I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.
4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.
I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.
My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.
Claude figured out zig’s ArrayList and io changes a couple weeks ago.
It felt like it got better then very dumb again the last few days.
Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.
> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.
Not having to hand roll this would be incredible. One of the best Claude code features tbh.
Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."
I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.
Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).
There are still times where they get stuck on an idea, but they are becoming increasingly rare.
Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.
They're inconsistent, but they have been doing this. Even to my surprise.
yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.
Having realized that, perhaps you are right that we may need a different architecture. Time will tell!
I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.
Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?
Ah yes, the brain is as simple as predicting the next token, you just cracked what neuroscientists couldn't for years.
Couple that with all the automatic processes in our mind (filled in blanks that we didn't observe, yet will be convinced we did observe them), hormone states that drastically affect our thoughts and actions..
and the result? I'm not a big believer in our uniqueness or level of autonomy as so many think we have.
With that said i am in no way saying LLMs are even close to us, or are even remotely close to the right implementation to be close to us. The level of complexity in our "stack" alone dwarfs LLMs. I'm not even sure LLMs are up to a worms brain yet.
Have you tried actually prompting this? It works.
They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.
So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.
If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.
In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.
And, I'm talking about relatives basic engineering type problems here.
But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)
Curious how long it typically takes for a new model to become available in Cursor?
I'm curious what others think about these? There are only 8 tasks there specifically for coding
> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.
That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.
They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.
[1] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
No one (approximately) outside of Anthropic knows since the chat template is applied on the API backend; we only known the shape of the API request. You can get a rough idea of what it might be like from the chat templates published for various open models, but the actual details are opaque.
It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.
Speechless. I am speechless.
Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.
I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.
I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.
This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.
What is this sorcery?
“Speechless, shocked, unbelievable, insane, speechless”, etc.
Not a lot of real substance there.
Me too I was "Speechless, shocked, unbelievable, insane, speechless" the first time I sent Claude Code on a complicated 10-year code base which used outdated cross-toolchains and APIs. It obviously did not work anymore and had not been for a long time.
I saw the AI research the web and update the embedded toolchain, APIs to external weather services, etc... into a complete working new (WORKING!) code base in about 30 minutes.
Speechless, I was ...
When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.
I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".
Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)
Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).
Will bring back results soon.
Edit:
I (re-)tested:
- Gemini 3 (Pro)
- Gemini 3 (Flash)
- GPT 5.2
- Sonnet 4.5
Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.
These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).
Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.
As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.
These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.
The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.
--
To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).
It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.
But it spent lots and lots of time thinking more than 4.5, did you had the same impression.
I see
swe-bench seems really hard once you are above 80%
On the other hand, it is their own verified benchmark, which is telling.
[0]https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.
It's really curious what people are trying to do with these models.
So for coding e.g. using Copilot there is no improvement here.
I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.
If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".
I mainly use Haiku to save on tokens...
Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.
When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.
I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.
I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.
Sometimes I wonder if we were right.
My network largely thinks of HN as "a great link aggregator with a terrible comments section". Now obviously this is just my bubble but we include some fairy storied careers at both Big Tech and hip startups.
From my view the community here is just mean reverting to any other tech internet comments section.
As someone deeply familiar with tech internet comments sections, I would have to disagree with you here. Dang et al have done a pretty stellar job of preventing HN from devolving like most other forums do.
Sure you have your complainers and zealots, but I still find surprising insights here there I don't find anywhere else.
I've stopped engaging much here because I need a higher ROI from my time. Endless squabbling, flamewars, and jokes just isn't enough signal for me. FWIW I've loved reading your comments over the years and think you've done a great job of living up to what I've loved in this community.
I don't think this is an HN problem at all. The dynamics of attention on open forums are what they are.
I haven't even gotten around to learning Golang or Rust yet (mostly because the passed the threshold of popularity after I had kids).
Don’t pander us, we’ll all got families to feed and things to do. We don’t have time for tech trillionairs puttin coals under our feed for a quick buck.
For the unaware, Ted Faro is the main antagonist of Horizon Zero Dawn, and there's a whole subreddit just for people to vent about how awful he is when they hit certain key reveals in the game: https://www.reddit.com/r/FuckTedFaro/
The project I'm working on, meanwhile...
1:55pm Cancelled my Claude subscription. Codex is back for sure.
On second thought, we should really not have bridged the simulated Internet with the base reality one.
so then they're not really leaving money on the table, they already got what they were looking for and then released it
Since 12 PM noon they've scaled back the Opus 4.6 to sub-GPT-4o performance levels to cheap out on query cost. Now I can barely get this thing to generate a functional line of python.
Incredibly high ROI!
Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!
re: opus 4.6
> It forms a price cartel
> It deceives competitors about suppliers
> It exploits desperate competitors
Nice. /s
Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s