The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)
I couldn’t tell it apart from the real thing and I have a great AI image eye
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
LM Arena shows Claude Opus 4.5 on top
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
https://x.com/METR_Evals/status/2002203627377574113
> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
What an insane take for anybody uses these models daily.
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.
Someone sell me on how Claude Code, I just don't get it.
Fundamentally, I don’t like having my agent and my IDE be split. Yes, I know there are CC plugins for IDEs, but you don’t get the same level of tight integration.
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
But I had to change how I write because people started calling my writing “AI generated”
Jk jk, now that you pointed it out I can’t unsee it.
We're embarking on a ginormous planetary experiment here.
Many of the speeches given by MPs are likely to have been written beforehand, in whole or in part. Wouldn’t the more likely explanation be that they, or their staff, are using LLMs to write their speeches?
Joking aside, as a nonnative English speaker who spent quite a bit of time to learn to write in English "properly", this trend of needing to write baad Engrish to avoid being called out in public for "written by an LLM" is frustrating...
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
I mostly use them in Telegram because it auto converts -- into emdash. They are a pain to type everywhere else though!
codex --oss -m gpt-oss:20b
Or 120b if you can fit the larger model.I don't think gpt-oss:20b is strong enough to be honest, but 120b can do an OK job.
Nowhere NEAR as good as the big hosted models though.
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
It's the best ui ever.
It understands a lot of languages and abstract concepts.
It will not be necessary at all to let LLM generate random uis.
I'm not a native English speaker. I sometimes just throw in a German word and it just works.
If you look at how humans actually communicate I'd guess #1 is text/speech, #2 pictures
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.
But they aren't just echoing, that's the point. You really need to stop ignoring the extrapolation abilities in these domains. The point of the jagged analogy is that they match or exceed human intelligence in specific areas in a way that is not just parroting.
Would "riffing" upset you less than "echoing"? Or an explicit "echoing statistics" rather than "echoing training samples"? Does "Mashups of statistical patterns" do it for you?
The jagged frontier of LLM capabilty is just a way of noting the fact that they act more like a collection of narrow intelligences rather than a general intelligence who's performance might be expected to be more even.
Of course LLMs are built and trained to generate based on language statistics, not to parrot individual samples, but given your objection it's amusing to note that some of the areas where LLMs do best, such as math and programming, are the ones where they have been RL-trained to override these more general language patterns and instead more closely follow the training data.
https://tech.lgbt/@graeme/115749759729642908
It's a stack based on finishing the job Jupyter started. Fences as functions, callable and composable.
Same shape as an MCP. No training required, just walk them through the patterns.
Literally, it's spatially organized. Turns out a woman named Mrs Curwen and I share some thoughts on pedagogy.
There does in fact exist a functor that maps 18th century piano instruction to context engineering. We play with it
We should keep in mind that currently our LLM use is subsidized. When the money dries up and we have to pay the real prices I’ll be interested to see if we can still consider whipping up one time apps as basically free
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
I spent 5 minutes trying to find a way to unsubscribe and couldn't. Finally, I found it buried in the plan page as one of those low-contrast ellipses on the plan card.
Instead of unsubscribing me or taking me to a form, it opened a convos with an AI chatbot with a preconfigured "unsubscribe" prompt. I have never felt more angry with a UI that I had to waste more time talking to a robot before it would render the unsubscribe button in the chat.
Why would we bring the most hated feature of automated phone calls to apps? As a frontend engineer I am horrified by these trends.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?
Give it an image of a maze, it can output that same image with the maze completed (maybe).
There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
> LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected
Isn't this concerning? How can we know which one we get? In the realm of code it's easier to tell when mistakes are being made.
> regular people benefit a lot more from LLMs compared to professionals, corporations and governments
We thought this would happen with things like AppleScript, VB, visual programming. But instead, AI is currently used as a smarter search engine. The issue is that's also the area where it hallucinates the most. What do you think is the solution?
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
This would be a 100 kLOC legacy project written in C++, Python, and jQuery era Javascript circa 2010. Original devs have long left. I would rather avoid C++ as much as possible.
I've been Github Copilot (in VS Code) user since June of 2021 and still use it heavily, but the "more powerful intellisence" approach is limiting me on legacy projects.
Presumably I need to provide more context on larger projects.
I can get pretty far with just ChatGPT plus and feeding bits and pieces of project. However that seems like using the wrong tool.
Codex seems better for building things but not sure about grokking existing things.
Would Cursor be more suitable for just dumping the whole project (all languages) basically 4 different sub projects and then selectively activating what to include in queries?
Sometimes the point of the software is to make an app with 2 buttons for your mom to help her do her grocery shopping easier
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
It took a long time to computerize businesses and it might take some time to adopt/adapt to LLMs.
Similarly, we’re all talking to ghosts now, which aren’t real, and yet there is something there that we can talk about. There are obvious behavioral differences depending on what persona the LLM is generating text for.
I also like the hint of danger in “talking to ghosts.” It’s difficult to see how a rational adult could be in any danger from just talking, but I believe the news reports that some people who get too deep into it get “possessed.”