Like there have been multiple times now where I wanted the code to look a certain way, but it kept pulling back to the way it wanted to do things. Like if I had stated certain design goals recently it would adhere to them, but after a few iterations it would forget again and go back to its original approach, or mix the two, or whatever. Eventually it was easier just to quit fighting it and let it do things the way it wanted.
What I've seen is that after the initial dopamine rush of being able to do things that would have taken much longer manually, a few iterations of this kind of interaction has slowly led to a disillusionment of the whole project, as AI keeps pushing it in a direction I didn't want.
I think this is especially true if you're trying to experiment with new approaches to things. LLMs are, by definition, biased by what was in their training data. You can shock them out of it momentarily, whish is awesome for a few rounds, but over time the gravitational pull of what's already in their latent space becomes inescapable. (I picture it as working like a giant Sierpinski triangle).
I want to say the end result is very akin to doom scrolling. Doom tabbing? It's like, yeah I could be more creative with just a tad more effort, but the AI is already running and the bar to seeing what the AI will do next is so low, so....
This would be fine if not for one thing: the meta-skill of learning to use the LLM depreciates too. Today's LLM is gonna go away someday, the way you have to use it will change. You will be on a forever treadmill, always learning the vagaries of using the new shiny model (and paying for the privilege!)
I'm not going to make myself dependent, let myself atrophy, run on a treadmill forever, for something I happen to rent and can't keep. If I wanted a cheap high that I didn't mind being dependent on, there's more fun ones out there.
This isn't to say LLMs won't change software development forever, I think they will. But I doubt anyone has any idea what kind of tools and approaches everyone will be using 5 or 10 years from now, except that I really doubt it will be whatever is being hyped up at this exact moment.
My gripe with AI tools in the past is that the kind of work I do is large and complex and with previous models it just wasn't efficient to either provide enough context or deal with context rot when working on a large application - especially when that application doesn't have a million examples online.
I've been trying to implement a multiplayer game with server authoritative networking in Rust with Bevy. I specifically chose Bevy as the latest version was after Claude's cut off, it had a number of breaking changes, and there aren't a lot of deep examples online.
Overall it's going well, but one downside is that I don't really understand the code "in my bones". If you told me tomorrow that I had optimize latency or if there was a 1 in 100 edge case, not only would I not know where to look, I don't think I could tell you how the game engine works.
In the past, I could not have ever gotten this far without really understanding my tools. Today, I have a semi functional game and, truth be told, I don't even know what an ECS is and what advantages it provides. I really consider this a huge problem: if I had to maintain this in production, if there was a SEV0 bug, am I confident enough I could fix it? Or am I confident the model could figure it out? Or is the model good enough that it could scan the entire code base and intuit a solution? One of these three questions have to be answered or else brain atrophy is a real risk.
I am interested in doing something similar (Bevy. not multiplayer).
I had the thought that you ought be able to provide a cargo doc or rust-analyzer equivalent over MCP? This... must exist?
I'm also curious how you test if the game is, um... fun? Maybe it doesn't apply so much for a multiplayer game, I'm thinking of stuff like the enemy patterns and timings in a soulslike, Zelda, etc.
I did use ChatGPT to get some rendering code for a retro RCT/SimCity-style terrain mesh in Bevy and it basically worked, though several times I had to tell it "yeah uh nothing shows up", at which point is said "of course! the problem is..." and then I learned about mesh winding, fine, okay... felt like I was in over my head and decided to go to a 2D game instead so didn't pursue that further.
I've found that there are two issues that arise that I'm not sure how to solve. You can give it docs and point to it and it can generally figure out syntax, but the next issue I see is that without examples, it kind of just brute forces problems like a 14 year old.
For example, the input system originally just let you move left and right, and it popped it into an observer function. As I added more and more controls, it began to litter with more and more code, until it was ~600 line function responsible for a large chunk of game logic.
While trying to parse it I then had it refactor the code - but I don't know if the current code is idiomatic. What would be the cargo doc or rust-analyzer equivalent for good architecture?
Im running into this same problem when trying to claude code for internal projects. Some parts of the codebase just have really intuitive internal frameworks and claude code can rip through them and provide great idiomatic code. Others are bogged down by years of tech debt and performance hacks and claude code can't be trusted with anything other than multi-paragraph prompts.
>I'm also curious how you test if the game is, um... fun?
Lucky enough for me this is a learning exercise, so I'm not optimizing for fun. I guess you could ask claude code to inject more fun.
I wouldn't have believed it a few tears ago if you told me the industry would one day, in lockstep, decide that shipping more tech-debt is awesome. If the unstated bet doesn't pay off, that is, AI development will outpace the rate it generates cruft, then there will be hell to pay.
Once we realize the kind of mess _those_ models created, well, we'll need even more capable models.
It's a variation on the theme of Kernighan insight about the more "clever" you are while coding the harder it will be to debug.
EDIT: Simplicity is a way out but it's hard under normal circumstances, now with this kind of pressure to ship fast because the colleague with the AI chimp can outperform you, aiming at simplicity will require some widespread understanding
This isn't anything new of course. Previously it was with projects built by looking for the cheapest bidder and letting them loose on an ill-defined problem. And you can just imagine what kind of code that produced. Except the scale is much larger.
My favorite example of this was a project that simply stopped working due to the amount of bugs generated from layers upon layers of bad code that was never addressed. That took around 2 years of work to undo. Roughly 6 months to un-break all the functionality and 6 more months to clean up the core and then start building on top.
(except where it's been stated, championed, enforced, and ultimated in no unequivocal terms by every executive in the tech industry)
Using an LLM is almost exactly the same. You get the occasional, "wow! I've never seen it do that before!" moments (whether that thing it just did was even useful or not), get a short hit of feel goods, and then we keep using it trying to get another hit. It keeps providing them at just the right intervals for people to keep them going just like they do with tick tock
As in if the LLM doesn't know about it, some devs are basically giving up and not even going to RTFM. I literally had to explain to someone today how something works by...reading through the docs and linking them the docs with screenshots and highlighted paragraphs of text.
Still got push back along the lines of "not sure if this will work". It's. Literally. In. The. Docs.
I think the way you’re using these tools that makes you feel this way is a choice. You’re choosing to not be in control and do as little as possible.
I found the setting and turned it off for real. Good riddance. I’ll use the hotkey on occasion.
The time it happened for me was rather abrupt, with no training in between, and the feeling was eerily similar.
You know _exactly_ why the best solution is, you talk to your reports, but they have minds of their own, as well as egos, and they do things … their own way.
At some point I stopped obsessing with details and was just giving guidance and direction only in the cases where it really mattered, or when asked, but let people make their own mistakes.
Now LLMs don’t really learn on their own or anything, but the feeling of “letting go of small trivial things” is sorta similar. You concentrate on the bigger picture, and if it chose to do an iterative for loop instead of using a functional approach the way you like it … well the tests still pass, don’t they.
Not trusting the ML's output is step one here, that keeps you intellectually involved - but it's still a far cry from solving the majority of problems yourself (instead you only solve problems ML did a poor job at).
Step two: I delineate interesting and uninteresting work, and Claude becomes a pair programmer without keyboard access for the latter - I bounce ideas off of it etc. making it an intelligent rubber duck. [Edit to clarify, a caveat is that] I do not bore myself with trivialities such as retrieving a customer from the DB in a REST call (but again, I do verify the output).
Context management, proper prompting and clear instructions, proper documentation are still relevant.
Yea exactly, Like we are just waiting so that it gets completed and after it gets completed then what? We ask it to do new things again.
Just as how if we are doom scrolling, we watch something for a minute then scroll down and watch something new again.
The whole notion of progress feels completely fake with this. Somehow I guess I was in a bubble of time where I had always end up using AI in web browsers (just as when chatgpt 3 came) and my workflow didn't change because it was free but recently changed it when some new free services dropped.
"Doom-tabbing" or complete out of the loop AI agentic programming just feels really weird to me sucking the joy & I wouldn't even consider myself a guy particular interested in writing code as I had been using AI to write code for a long time.
I think the problem for me was that I always considered myself a computer tinker before coder. So when AI came for coding, my tinkering skills were given a boost (I could make projects of curiosity I couldn't earlier) but now with AI agents in this autonomous esque way, it has come for my tinkering & I do feel replaced or just feel like my ability of tinkering and my interests and my knowledge and my experience is just not taken up into account if AI agent will write the whole code in multi file structure, run commands and then deploy it straight to a website.
I mean my point is tinkering was an active hobby, now its becoming a passive hobby, doom-tinkering? I feel like I have caught up on the feeling a bit earlier with just vibe from my heart but is it just me who feels this or?
What could be a name for what I feel?
Have to really look out for the crap.
I’ve always said I’m a builder even though I’ve also enjoyed programming (but for an outcome, never for the sake of the code)
This perfectly sums up what I’ve been observing between people like me (builders) who are ecstatic about this new world and programmers who talk about the craft of programming, sometimes butting heads.
One viewpoint isn’t necessarily more valid, just a difference of wiring.
"I got into programming because I like programming, not whatever this is..."
Yes, I'm building stupid things faster, but I didn't get into programming because I wanted to build tons of things. I got into it for the thrill of defining a problem in terms of data structures and instructions a computer could understand, entering those instructions into the computer, and then watching victoriously while those instructions were executed.
If I was intellectually excited about telling something to do this for me, I'd have gotten into management.
>If I was intellectually excited about telling something to do this for me, I'd have gotten into management.
Exactly this. This is the simplest and tersest way of explaining it yet.
I used Claude Code to implement a OpenAI 4o-vision powered receipt scanning feature in an expense tracking tool I wrote by hand four years ago. It did it in two or three shots while taking my codebase into account.
It was very neat, and it works great [^0], but I can't latch onto the idea of writing code this way. Powering through bugs while implementing a new library or learning how to optimize my test suite in a new language is thrilling.
Unfortunately (for me), it's not hard at all to see how the "builders" that see code as a means to an end would LOVE this, and businesses want builders, not crafters.
In effect, knowing the fundamentals is getting devalued at a rate I've never seen before.
[^0] Before I used Claude to implement this feature, my workflow for processing receipts looked like this: Tap iOS Shortcut, enter the amount, snap a pic of the receipt, type up the merchant, amount and description for the expense, then have the shortcut POST that to my expenses tracking toolkit which, then, POSTs that into a Google Sheet. This feature amounted the need for me to enter the merchant and amount. Unfortunately, it often took more time to confirm that the merchant, amount and date details OpenAI provided were correct (and correct it when details were wrong, which was most of the the time) than it did to type out those details manually, so I just went back to my manual workflow. However, the temptation to just glance at the details and tap "This looks correct" was extremely high, even if the info it generated was completely wrong! It's the perfect analogue to what I've been witnessing throughout the rise of the LLMs.
> with AI that can happen faster.
well, not exactly that.
There is a strange insistence on not helping the LLM arrive at the best outcome in the subtext to this question a lot of times. I feel like we are living through the John Henry legend in real time
So maybe our common ground is that we are direct problem solvers. :-)
What I mean by that: you had compiled vs interpreted languages, you had types vs untyped, testing strategies, all that, at least in some part, was a conversation about the tradeoffs between moving fast/shipping and maintainability.
But it isn't just tech, it is also in methodologies and the words use, from "build fast and break things" and "yagni" to "design patterns" and "abstractions"
As you say, it is a different viewpoint... but my biggest concern with where are as industry is that these are not just "equally valid" viewpoints of how to build software... it is quite literally different stages of software, that, AFAICT, pretty much all successful software has to go through.
Much of my career has been spent in teams at companies with products that are undergoing the transition from "hip app built by scrappy team" to "profitable, reliable software" and it is painful. Going from something where you have 5 people who know all the ins and outs and can fix serious bugs or ship features in a few days to something that has easy clean boundaries to scale to 100 engineers of a wide range of familiarities with the tech, the problem domain, skill levels, and opinions is just really hard. I am not convinced yet that AI will solve the problem, and I am also unsure it doesn't risk making it worse (at least in the short term)
Much of my career has been spent in teams at companies with products that are undergoing the transition from "hip app built by scrappy team" to "profitable, reliable software" and it is painful. Going from something where you have 5 people who know all the ins and outs and can fix serious bugs or ship features in a few days to something that has easy clean boundaries to scale to 100 engineers of a wide range of familiarities with the tech, the problem domain, skill levels, and opinions is just really hard. I am not convinced yet that AI will solve the problem, and I am also unsure it doesn't risk making it worse (at least in the short term)
“””
This perspective is crucial. Scale is the great equalizer / demoralizer, scale of the org and scale of the systems. Systems become complex quickly, and verifiability of correctness and function becomes harder. Companies that built from day with AI and have AI influencing them as they scale, where does complexity begin to run up against the limitations of AI and cause regression? Or if all goes well, amplification?
And accountability can still exist? Is the engineer that created or reviewed a Pull Request using Claude Code less accountable then one that used PICO?
The point is that in the human scenario, you can hold the human agents accountable. You cannot do that with AI. Of course, you as the orchestrator of agents will be accountable to someone, but you won't have the benefit of holding your "subordinates" accountable, which is what you do in a human team. IMO, this renders the whole situation vastly different (whether good or bad I'm not sure).
We have services deployed globally serving millions of customers where rigor is really important.
And we have internal users who're building browser extensions with AI that provide valuable information about the interface they're looking at including links to the internal record management, and key metadata that's affecting content placement.
These tools could be handed out on Zip drives in the street and it would just show our users some of the metadata already being served up to them, but it's amazing to strip out 75% of the process of certain things and just have our user (in this case though, it's one user who is driving all of this, so it does take some technical inclination) build out these tools that save our editors so much time when doing this before would have been months and months and months of discovery and coordination and designs that probably wouldn't actually be as useful in the end after the wants of the user are diluted through 18 layers of process.
This distinction to me separates the two primary camps
I deliberately avoid full vibe coding since I think doing so will rust my skills as a programmer. It also really doesn’t save much time in my experience. Once I have a design in mind, implementation is not the hard part.
Managers and project managers are valuable roles and have important skill sets. But there's really very little connection with the role of software development that used to exist.
It's a bit odd to me to include both of these roles under a single label of "builders", as they have so little in common.
EDIT: this goes into more detail about how coding (and soon other kinds of knowledge work) is just a management task now: https://www.oneusefulthing.org/p/management-as-ai-superpower...
The fact of the matter is LLMs produce lower quality at higher volumes in more time than it would take to write it myself, and I’m a very mediocre engineer.
I find this seperation of “coding” vs “building” so offensive. It’s basically just saying some people are only concerned with “inputs”, while others with “outputs”. This kind of rhetoric is so toxic.
It’s like saying LLM art is separating people into people who like to scribble, and people who like to make art.
I had felt like this and still do but man, at some point, I feel like the management churn feels real & I just feel suffering from a new problem.
Suppose, I actually end up having services literally deployed from a single prompt nothing else. Earlier I used to have AI write code but I was interested in the deployment and everything around it, now there are services which do that really neatly for you (I also really didn't give into the agent hype and mostly used browsers LLM)
Like on one hand you feel more free to build projects but the whole joy of project completely got reduced.
I mean, I guess I am one of the junior dev's so to me AI writing code on topics I didn't know/prototyping felt awesome.
I mean I was still involved in say copy pasting or looking at the code it generates. Seeing the errors and sometimes trying things out myself. If AI is doing all that too, idk
For some reason, recently I have been disinterested in AI. I have used it quite a lot for prototyping but I feel like this complete out of the loop programming just very off to me with recent services.
I also feel like there is this sense of if I buy for some AI thing, to maximally extract "value" out of it.
I guess the issue could be that I can have vague terms or have a very small text file as input (like just do X alternative in Y lang) and I am now unable to understand the architectural decisions and the overwhelmed-ness out of it.
Probably gonna take either spec-driven development where I clearly define the architecture or development where I saw something primagen do recently which is that the AI will only manipulate code of that particular function, (I am imagining it for a file as well) and somehow I feel like its something that I could enjoy more because right now it feels like I don't know what I have built at times.
When I prototype with single file projects using say browser for funsies/any idea. I get some idea of what the code kind of uses with its dependencies and functions names from start/end even if I didn't look at the middle
A bit of ramble I guess but the thing which kind of is making me feel this is that I was talking to somebody and shwocasing them some service where AI + server is there and they asked for something in a prompt and I wrote it. Then I let it do its job but I was also thinking how I would architect it (it was some detect food and then find BMR, and I was thinking first to use any api but then I thought that meh it might be hard, why not use AI vision models, okay what's the best, gemini seems good/cheap)
and I went to the coding thing to see what it did and it actually went even beyond by using the free tier of gemini (which I guess didn't end up working could be some rate limit of my own key but honestly it would've been the thing I would've tried too)
So like, I used to pride myself on the architectural decisions I make even if AI could write code faster but now that is taken away as well.
I really don't want to read AI code so much so honestly at this point, I might as well write code myself and learn hands on but I have a problem with build fast in public like attitude that I have & just not finding it fun.
I feel like I should do a more active job in my projects & I am really just figuring out what's the perfect way to use AI in such contexts & when to use how much.
Thoughts?
I was thinking about this the other day as relates to the DevOps movement.
The DevOps movement started as a way to accelerate and improve the results of dev<->ops team dynamics. By changing practices and methods, you get acceleration and improvement. That creates "high-performing teams", which is the team form of a 10x engineer. Whether or not you believe in '10x engineers', a high-performing team is real. You really can make your team deploy faster, with fewer bugs. You have to change how you all work to accomplish it, though.
To get good at using AI for coding, you have to do the same thing: continuous improvement, changing workflows, different designs, development of trust through automation and validation. Just like DevOps, this requires learning brand new concepts, and changing how a whole team works. This didn't get adopted widely with DevOps because nobody wanted to learn new things or change how they work. So it's possible people won't adapt to the "better" way of using AI for coding, even if it would produce a 10x result.
If we want this new way of working to stick, it's going to require education, and a change of engineering culture.
This is true... Equally I've seen it dive into a rabbit hole, make some changes that probably aren't the right direction... and then keep digging.
This is way more likely with Sonnet, Opus seems to be better at avoiding it. Sonnet would happily modify every file in the codebase trying to get a type error to go away. If I prompt "wait, are you off track?" it can usually course correct. Again, Opus seems way better at that part too.
Admittedly this has improved a lot lately overall.
There has been a lot of research that shows that grit is far more correlated to success than intelligence. This is an interesting way to show something similar.
AIs have endless grit (or at least as endless as your budget). They may outperform us simply because they don't ever get tired and give up.
Full quote for context:
Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased.
So I think this tracks with Karpathy's defense of IDEs still being necessary ?
Has anyone found it practical to forgo IDEs almost entirely?
Mind you copilot has only supported agent mode relatively recently.
I really like the way copilot does changes in such a way you can accept or reject and even revert to point in time in the chat history without using git. Something about this just fits right with how my brain works. Using Claude plugin just felt like I had one hand tied behind my back.
But what I like about this setup is that I have almost all the context I need to review the work in a single PR. And I can go back and revisit the PR if I ever run into issues down the line. Plus you can run sessions in parallel if needed, although I don't do that too much.
This stuff gets a whole lot more interesting when you let it start making changes and testing them by itself.
Also note that with Claude models, Copilot might allocate a different number of thinking tokens compared to Claude Code.
Things may have changed now compared to when I tried it out, these tools are in constant flux. In general I've found that harnesses created by the model providers (OpenAI/Codex CLI, Anthropic/Claude Code, Google/Gemini CLI) tend to be better than generalist harnesses (cheaper too, since you're not paying a middleman).
It's not about the model. It's about the harness
A lot of these things sound cool but sometimes I'm curious what they're actually building
Like, is their bottleneck creativity now then? Are they building naything interedting or using agents to build... things that don't appeal to me, anyway?
Until you struggle to review it as well. Simple exercise to prove it - ask LLM to write a function in familiar programming language, but in the area you didn't invest learning and coding yourself. Try reviewing some code involving embedding/SIMD/FPGA without learning it first.
No-one has ever learned skill just by reading/observing
After certain experience threshold of making things from scratch, “coding” (never particularly liked that term) has always been 99% building, or architecture, and I struggle to see how often a well-architected solution today, with modern high-level abstractions, requires so much code that you’d save significant time and effort by not having to just type, possibly with basic deterministic autocomplete, exactly what you mean (especially considering you would have to also spend time and effort reviewing whatever was typed for you if you used a non-deterministic autocomplete).
Asking it to do entire projects? Dumb. You end up with spaghetti, unless you hand-hold it to a point that you might as well be using my autocomplete method.
Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.
The “tenacity” referenced here has been, in my opinion, the key ingredient in the secret sauce of a successful career in tech, at least in these past 20 years. Every industry job has its intricacies, but for every engineer who earned their pay with novel work on a new protocol, framework, or paradigm, there were 10 or more providing value by putting the myriad pieces together, muddling through the ever-waxing complexity, and crucially never saying die.
We all saw others weeded out along the way for lacking the tenacity. Think the boot camp dropouts or undergrads who changed majors when first grappling with recursion (or emacs). The sole trait of stubbornness to “keep going” outweighs analytical ability, leetcode prowess, soft skills like corporate political tact, and everything else.
I can’t tell what this means for the job market. Tenacity may not be enough on its own. But it’s the most valuable quality in an employee in my mind, and Claude has it.
Claude isn't tenacious. It is an idiot that never stops digging because it lacks the meta cognition to ask 'hey, is there a better way to do this?'. Chain of thought's whole raison d'etre was so the model could get out of the local minima it pushed itself in. The issue is that after a year it still falls into slightly deeper local minima.
This is fine when a human is in the loop. It isn't what you want when you have a thousand idiots each doing a depth first search on what the limit of your credit card is.
Recently had an AI tell me this code (that it wrote) is a mess and suggested wiping it and starting from scratch with a more structure plan. That seems to hint at some meta cognition outlines
At a company I worked for, lots of senior engineers become managers because they no longer want to obsess over whether their algorithm has an off by one error. I think fewer will go the management route.
(There was always the senior tech lead path, but there are far more roles for management than tech lead).
Otherwise you'd be senior staff to principle range and doing architecture, mentorship, coordinating cross team work, interviewing, evaluating technical decisions, etc.
I got to code this week a bit and it's been a tremendous joy! I see many peers at similar and lower levels (and higher) who have more years and less technical experience and still write lots of code and I suspect that is more what you're talking about. In that case, it's not so much that you've peaked, it's that there's not much to learn and you're doing a bunch of the same shit over and over and that's of course tiring.
I think it also means that everything you interact with outside your space does feel much harder because of the infrequency with which you have interacted with it.
If you've spent your whole career working the whole stack from interfaces to infrastructure then there's really not going to be much that hits you as unfamiliar after a point. Most frameworks recycle the same concepts and abstractions, same thing with programming languages, algorithms, data management etc.
But if you've spent most of your career in one space cranking tickets, those unknown corners are going to be as numerous as the day you started and be much more taxing.
So although I don't think he should have won the Nobel Prize because not really physics, I felt his perseverance and hard work should merit something.
Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"
Remember Google?
Once it was far-fetched that they would make the search worse just to show you more ads. Now, it is a reality.
With tokens, it is even more direct. The more tokens users spend, the more money for providers.
What are the details of this? I'm not playing dumb, and of course I've noticed the decline, but I thought it was a combination of losing the battle with SEO shite and leaning further and further into a 'give the user what you think they want, rather than what they actually asked for' philosophy.
Unless you’re paying by the token.
Switching costs are currently low. Once you're committed to the workflow the providers will switch to prepaying for a year's worth of tokens.
The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".
I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.
In a market for lemons, good luck finding a lime.
After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.
That said, more and more people seem to be arriving at the conclusion that if you want a fairly large-sized, complex task in a large existing codebase done right, you'll have better odds with Codex GPT-5.2-Codex-XHigh than with Claude Code Opus 4.5. It's far slower than Opus 4.5 but more likely to get things correct, and complete, in its first turn.
For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.
It also works for problems that have been solved a thousand times before, which impresses people and makes them think it is actually solving those problems
"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. Their ability to pattern match makes reasoning seem more powerful than it actually is. If your bug is within some reasonable distance of a pattern it has seen in training, reasoning can get it over the final hump. But if your problem is too far removed from what it has seen in its latent space, it's not likely to figure it out by reasoning alone.
What do you mean by this? Especially for tasks like coding where there is a deterministic correct or incorrect signal it should be possible to train.
So you mean it works on almost all problems?
If it does not, this is going to be first technology in the history of mankind that has not become cheaper.
(But anyway, it already costs half compared to last year)
You could not have bought Claude Opus 4.5 at any price one year ago I'm quite certain. The things that were available cost half of what they did then, and there are new things available. These are both true.
I'm agreeing with you, to be clear.
There are two pieces I expect to continue: inference for existing models will continue to get cheaper. Models will continue to get better.
Three things, actually.
The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].
[0]: https://blog.irvingwb.com/blog/2018/09/a-critical-appraisal-...
This is harmless when it comes to tech opinions but causes real damage in politics and activism.
People get really attached to ideals and ideas, and keep sticking to those after they fail to work again and again.
I went back to tell them (do not know them at all just everyone is chattier digging out of a storm) and they were not there. Feel terrible and no real viable remedy. Hope they check themselves and realize I am an idiot. Even harder on the internet.
Everybody who bet against Moore's Law was wrong ... until they weren't.
And AI is the reaction to Moore's Law having broken. Nobody gave one iota of damn about trying to make programming easier until the chips couldn't double in speed anymore.
However, most people don't know the difference between the proper Moore's Law scaling (the cost of a transistor halves every 2 years) which is still continuing (sort of) and the colloquial version (the speed of a transistor doubles every 2 years) which got broken when Dennard scaling ran out. To them, Moore's Law just broke.
Nevertheless, you are reinforcing my point. Nobody gave a damn about improving the "programming" side of things until the hardware side stopped speeding up.
And rather than try to apply some human brainpower to fix the "programming" side, they threw a hideous number of those free (except for the electricity--but we don't mention that--LOL) transistors at the wall to create a broken, buggy, unpredictable machine simulacrum of a "programmer".
(Side note: And to be fair, it looks like even the strong form of Moore's Law is finally slowing down, too)
And in fact, the agentic looped LLMs are executing much better than that today. They could stop advancing right now and still be revolutionary.
check out whether clocks have gotten cheaper in general. the answer is that it has.
there is no economy of scale here in repairing a single clock. its not relevant to bring it up here.
You can buy one for 90 cents on temu.
of course it's silly to talk about manufacturing methods and yield and cost efficiency without having an economy to embed all of this into, but ... technology got cheaper means that we have practical knowledge of how to make cheap clocks (given certain supply chains, given certain volume, and so and so)
we can make very cheap very accurate clocks that can be embedded into whatever devices, but it requires the availability of fabs capable of doing MEMS components, supply materials, etc.
but inflation is the general price level increase, this can be used as a deflator to get the price of whatever product in past/future money amount to see how the price of the product changed in "real" terms (ie. relative to the general price level change)
Getting a bespoke flintstone axe is also pretty expensive, and has also absolutely no relevance to modern life.
These discussions must, if they are to be useful, center in a population experience, not in unique personal moments.
Not much has down in price over the last few years.
Meanwhile the overall price of storage has been going down consistently: https://ourworldindata.org/grapher/historical-cost-of-comput...
https://marylandmatters.org/2025/11/17/key-bridge-replacemen...
In general, there are several things that are true for bridges that aren't true for most technology:
* Technology has massively improved, but most people are not realizing that. (E.g. the Bay Bridge cost significantly more than the previous version, but that's because we'd like to not fall down again in the next earthquake) * We still have little idea how to reason about the cost of bridges in general. (Seriously. It's an active research topic) * It's a tiny market, with the major vendors forming an oligopoly * It's infrastructure, not a standard good * The buy side is almost exclusively governments.
All of these mean expensive goods that are completely non-repeatable. You can't build the same bridge again. And on top of that, in a distorted market.
But sure, the cost of "one bridge, please" has gone up over time.
Even if you adjust for inflation?
'84 Motorola DynaTAC - ~$12k AfI (adjusted for inflation)
'89 MicroTAC ~$8k AfI
'96 StarTAC ~$2k AfI
`07 iPhone ~$673 AfI
The current average smartphone sells for around $280. Phones are getting cheaper.
(Oil rampdown is a survival imperative due to the climate catastrophe so there it's a very positive thing of course, though not sufficient...)
this is accounting for the fact that more tokens are used.
> Newer models cost more than older models
where did you see this?
There’s no such thing as ”same task by old model”, you might get comparable results or you might not (and this is why the comparison fail, it’s not a comparison), the reason you pick the newer models is to increase chances of getting a good result.
This should answer. In your case, GPT-3.5 definitely is cheaper per token than 4o but much much less capable. So they used a model that is cheaper than GPT-3.5 that achieved better performance for the analysis.
Not according to their pricing table. Then again I’m not sure what OpenAI model versions even mean anymore, but I would assume 5.2 is in the same family as 5 and 5.2-pro as 5-pro
LLMs will face their own challenges with respect to reducing costs, since self-attention grows quadratically. These are still early days, so there remains a lot of low hanging fruit in terms of optimizations, but all of that becomes negligible in the face of quadratic attention.
so close! that is a commodity
There have been plenty of technologies in history which do not in fact become cheaper. LLMs are very likely to become such, as I suspect their usefulness will be superseded by cheaper (much cheaper in fact) specialized models.
Eating burgers and driving cars around costs a lot more than whatever # of watts the human brain consumes.
This is one of the weakest anti AI postures. "It's a bubble and when free VC money stops you'll be left with nothing". Like it's some kind of mystery how expensive these models are to run.
You have open weight models right now like Kimi K2.5 and GLM 4.7. These are very strong models, only months behind the top labs. And they are not very expensive to run at scale. You can do the math. In fact there are third parties serving these models for profit.
The money pit is training these models (and not that much if you are efficient like chinese models). Once they are trained, they are served with large profit margins compared to the inference cost.
OpenAI and Anthropic are without a doubt selling their API for a lot more than the cost of running the model.
Running at their designed temperature.
> You send all the necessary data, including information that you would never otherwise share.
I've never sent the type of data that isn't already either stored by GitHub or a cloud provider, so no difference there.
> And you most likely do not pay the actual costs.
So? Even if costs double once investor subsidies stop, that doesn't change much of anything. And the entire history of computing is that things tend to get cheaper.
> You and your business become dependent on this major gatekeeper.
Not really. Switching between Claude and Gemini or whatever new competition shows up is pretty easy. I'm no more dependent on it than I am on any of another hundred business services or providers that similarly mostly also have competitors.
Oh my lord you absolutely do not. The costs to oai per token inference ALONE are at least 7x. AT LEAST and from what I’ve heard, much higher.
There’s often a better faster way to do it, and while it might get to the short term goal eventually, it’s often created some long term problems along the way.
[1]: https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...
Like... bro that's THE foundation of CS. That's the principle of The bomb in Turing's time. One can still marvel at it but it's been with us since the beginning.
you might think I'm kidding but Search redox on github, you will find that project and the anonymous contributions
Decided to figure out what this "vibe coding" nonsense is, and now there's a certain level of joy to all of this again. Being able to clearly define everything using markdown contexts before any code is even written has been a great way to brain dump those 25 years of experience and actually watch something sane get produced.
Here are the stats Claude Code gave me:
Overview
┌───────────────┬────────────────────────────┐
│ Metric │ Value │
├───────────────┼────────────────────────────┤
│ Total Commits │ 365 │
├───────────────┼────────────────────────────┤
│ Project Age │ 7 days (Jan 20 - 27, 2026) │
├───────────────┼────────────────────────────┤
│ Open Issues │ 5 │
├───────────────┼────────────────────────────┤
│ Contributors │ 1 │
└───────────────┴────────────────────────────┘
Lines of Code by Language
┌───────────────────────────┬───────┬────────┬───────────┐
│ Language │ Files │ Lines │ % of Code │
├───────────────────────────┼───────┼────────┼───────────┤
│ Rust (Backend) │ 94 │ 31,317 │ 51.8% │
├───────────────────────────┼───────┼────────┼───────────┤
│ TypeScript/TSX (Frontend) │ 189 │ 29,167 │ 48.2% │
├───────────────────────────┼───────┼────────┼───────────┤
│ SQL (Migrations) │ 34 │ 1,334 │ — │
├───────────────────────────┼───────┼────────┼───────────┤
│ CSS │ — │ 1,868 │ — │
├───────────────────────────┼───────┼────────┼───────────┤
│ Markdown (Docs) │ 37 │ 9,485 │ — │
├───────────────────────────┼───────┼────────┼───────────┤
│ Total Source │ 317 │ 60,484 │ 100% │
└───────────────────────────┴───────┴────────┴───────────┘I then realized I could feed it everything it ever needed to know. Just create a docs/* folder and tell it to read that every session.
Through discovery I learned about CLAUDE.md, and adding skills.
Now I have an /analyst, /engineer, and /devops that I talk to all day with their own logic and limitations, as well as the more general project CLAUDE.md, and dozens of docs/* files we collaborate on.
I'm at the point I'm running happy.engineering on my phone and don't even need to sit in front of the computer anymore.
Starcraft and Factorio are exactly what it is not. Starcraft has a loooot of micro involved at any level beyond mid level play, despite all the "pro macros and beats gold league with mass queens" meme videos. I guess it could be like Factorio if you're playing it by plugging together blueprint books from other people but I don't think that's how most people play.
At that level of abstraction, it's more like grand strategy if you're to compare it to any video game? You're controlling high level pushes and then the units "do stuff" and then you react to the results.
I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.
That code base is ~98% Claude code.
Not sure if it's an American pronunciation thing, but I had to stare at that long and hard to see the problem and even after seeing it couldn't think of how you could possibly spell the correct word otherwise.
It's a bad American pronunciation thing like "Febuwary" and "nuculer".
If you pronounce the syllables correctly, "an-ec-dote", "Feb-ru-ar-y", "nu-cle-ar" the spellings follow.
English has it's fair share of spelling stupidities, but if people don't even pronounce the words correctly there is no hope.
I'm not sure how big your repos are but I've been effective working with repos that have thousands of files and tens of thousands of lines of code.
If you're just prototyping it will hit wall when things get unwieldy but that's normally a sign that you need to refactor a bit.
Super strict compiler settings, static analysis, comprehensive tests, and documentation help a lot. As does basic technical design. After a big feature is shipped I do a refactor cycle with the LLM where we do a comprehensive code review and patch things up. This does require human oversight because the LLMs are still lacking judgement on what makes for good code design.
The places where I've seen them be useless is working across repositories or interfacing with things like infrastructure.
It's also very model-dependent. Opus is a good daily driver but Codex is much better are writing tests for some reason. I'll often also switch to it for hard problems that Claude can't solve. Gemini is nice for 'I need a prototype in the next 10 minutes', especially for making quick and dirty bespoke front-ends where you don't care about the design just the functionality.
Perhaps this is part of it? Tens of thousands of lines of code seems like a very small repo to me.
For this the LLM struggles a bit, but so does a human. The main issues are it messes up some state that it didnt realise was used elsewhere, and out test coverage is not great. We've seen humans make exactly the same kind of mistakes. We use MCP for Figma so most of the time it can get a UI 95% done, just a few tweaks needed by the operator.
On the backend (Typescript + Node, good test coverage) it can pretty much one-shot - from a plan - whatever feature you give it.
We use opus-4.5 mostly, and sometimes gpt-5.2-codex, through Cursor. You aren't going to get ChatGPT (the web interface) to do anything useful, switch to Cursor, Codex or Claude Code. And right now it is worth paying for the subscription, you don't get the same quality from cheaper or free models (although they are starting to catch up, I've had promising results from GLM-4.7).
I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.
I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.
I had never used Swift before that and was able to use AI to whip up a fairly full-featured and complex application with a decent amount of code. I had to make some cross-cutting changes along the way as well that impacted quite a few files and things mostly worked fine with me guiding the AI. Mind you this was a year ago so I can only imagine how much better I would fare now with even better AI models. That whole month was spent not only on coding but on learning Swift enough to fix problems when AI started running into circles and then learning about Xcode profiler to optimize the application for speed and improving perf.
Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.
That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.
It's been especially helpful in explaining and understanding arcane bits of legacy code behavior my users ask about. I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.
And how do you verify its output isn't total fabrication?
I really enjoyed the process. As TFA says, you have to keep a close eye on it. But the whole process was a lot less effort, and I ended up doing mor than I would otherwise have done.
What type of documents do you have explaining the codebase and its messy interactions, and have you provided that to the LLM?
Also, have you tried giving someone brand new to the team the exact same task and information you gave to the LLM, and how effective were they compared to the LLM?
> I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.
As others have pointed out, from your comment, it doesn't sound like you've used a tool dedicated for AI coding.
(But even if you had, it would still fail if you expect LLMs to do stuff without sufficient context).
E.g. macros exist in Clojure but not Python/JS, and I've definitely been plenty stumped by seeing them in the codebase. They tend to be used in very "clever" patterns.
On the other hand, I'm a bit surprised Claude can tackle a complex Clojure codebase. It's been a while since I attempted using an LLM for Clojure, but at the time it failed completely (I think because there is relatively little training data compared to other mainstream languages). I'll have to check that out myself
Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.
As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.
We call this tech debt.
Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.
With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).
Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.
YMMV here, as it’s hard to do large refactors without tests for correctness.
(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)
Claude by default, unless I tell it not to, will write stuff like:
// we need something to be true
somethingPasses = something()
if (!somethingPasses) {
return false
}
// we need somethingElse to be true
somethingElsePasses = somethingElse()
if (!somethingElsePasses) {
return false
}
return true
instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.
to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.
Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.
I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.
The LLM is onboarding to your codebase with each context window, all it knows is what it’s seen already.
Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.
I also find pointing it to an existing folder full of code that conforms to certain standards can work really well.
There's basically a "brainstorm" /slash command that you go back and forth with, and it places what you came up with in docs/plans/YYYY-MM-DD-<topic>-design.md.
Then you can run a "write-plan" /slash command on the docs/plans/YYYY-MM-DD-<topic>-design.md file, and it'll give you a docs/plans/YYYY-MM-DD-<topic>-implementation.md file that you can then feed to the "execute-plan" /slash command, where it breaks everything down into batches, tasks, etc, and actually implements everything (so three /slash commands total.)
There's also "GET SHIT DONE" (GSD) [1] that I want to look at, but at first glance it seems to be a bit more involved than Superpowers with more commands. Maybe it'd be better for larger projects.
2. Put your important dependencies source code in the same directory. E.g. put a `_vendor` directory in the project, in it put the codebase at the same tag you're using or whatever: postgres, redis, vue, whatever.
3. Write good plans and requirements. Acceptance criteria, context, user stories, etc. Save them in markdown files. Review those multiple times with LLMs trying to find weaknesses. Then move to implementation files: make it write a detailed plan of what it's gonna change and why, and what it will produce.
4. Write very good prompts. LLMs follow instructions well if they are clear "you should proactively do X", is a weak instruction if you mean "you must do X".
5. LLMs are far from perfect, and full of limits. Karpathy sums their cons very well in his long list. If you don't know their limits you'll mismanage the expectations and not use them when they are a huge boost and waste time on things they don't cope well with. On top of that: all LLMs are different in their "personality", how they adhere to instruction, how creative they are, etc.
After you tried it, come back.
I tried a website which offered the Opus model in their agentic workflow & I felt something different too I guess.
Currently trying out Kimi code (using their recent kimi 2.5) for the first time buying any AI product because got it for like 1.49$ per month. It does feel a bit less powerful than claude code but I feel like monetarily its worth it.
Y'know you have to like bargain with an AI model to reduce its pricing which I just felt really curious about. The psychology behind it feels fascinating because I think even as a frugal person, I already felt invested enough in the model and that became my sunk cost fallacy
Shame for me personally because they use it as a hook to get people using their tool and then charge next month 19$ (I mean really Cheaper than claude code for the most part but still comparative to 1.49$)
I guess this is fine when you don’t have customers or stakeholders that give a shit lol.
AI assisted coding has never been like that, which would be atrocious. The typical workflow was using Cursor with some model of your choice (almost always an Anthropic model like sonnet before opus 4.5 released). Nowadays (in addition to IDEs) it's often a CLI tool like Claude Code with Opus or Codex CLI with GPT Codex 5.2 high/xhigh.
If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated
At a base level, people are “upgrading” their Claude Code with custom skills and subagents - all text files saved in .claude/agents|skills.
You can also use their new tasks primitive to basically run a Ralph-like loop
But at the edges, people are using multiple instances, each handling different aspects in parallel - stuff like Gas Town
Tbf you can still get a lot of mileage out of vanilla Claude Code. But I’ve found that even adding a simple frontend design skill improves the output substantially
Anthropic’s own repo is as good place as any
I have a professor who has researched auto generated code for decades and about six months ago he told me he didn't think AI would make humans obsolete but that it was like other incremental tools over the years and it would just make good coders even better than other coders. He also said it would probably come with its share of disappointments and never be fully autonomous. Some of what he said was a critique of AI and some of it was just pointing out that it's very difficult to have perfect code/specs.
Billionaire coder: a person who has "written" billion lines.
Ordinary coders : people with only couple of thousands to their git blame.
> I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media
It has arrived. Github will be most affected thanks to git-terrorists at Apna College refusing to take down that stupid tutorial. IYKYK.
He ran Teslas ML division, but still doesnt know what a simple kalman filter is (in the sense where he claimed that lidar would be hard to integrate with cameras).
I actually disagree with Andrej here re: "Generation (writing code) and discrimination (reading code) are different capabilities in the brain." and I would argue that the only reason he can read code fluently, find issues, etc. is because he has spent year in a non-AI assisted world writing code. As time goes on, he will become substantially worse.
This also bodes incredibly poorly for the next generation, who will mostly in their formative years now avoid writing code and thus fail to even develop a idea of what good code is, how it works/why it works, why you make certain decisions, and not others, etc. and ultimately you will see them become utterly dependent on AI, unable to make progress without it.
IMO outsourcing thinking is going to have incredibly negative consequences for the world at large.
1. hand arithmetic -> using a calculator
2. assembly -> using a high level language
3. writing code -> making an LLM write code
Number 3 does not belong. Number 3 is a fundamentally different leap because it's not based on deterministic logic. You can't depend on an LLM like you can depend on a calculator or a compiler. LLMs are totally different.
It often doesn't work. That's the point. A calculator works 100% of the time. A LLM might work 95% of the time, or 80%, or 40%, or 99% depending on what you're doing. This is difference and a key feature.
It doesn't matter how good you are at calculations the answer to 2 + 2 is always 4. There are no methods of solving 2 + 2 which could result in you accidentally giving everyone who reads the result of your calculation write access to your entire DB. But there are different ways to code a system even if the UI is the same, and some of these may neglect to consider permissions.
I think a good parallel here would be to imagine that tomorrow we had access to humanoid robots who could do construction work. Would we want them to just go build skyscrapers and bridges and view all construction businesses which didn't embrace the humanoid robots as akin to doing arithmetic by hand?
You could of course argue that there's no problem here so long as trained construction workers are supervising the robots to make sure they're getting tolerances right and doing good welds, but then what happens 10 years down the road when humans haven't built a building in years? If people are not writing code any more then how can people be expected to review AI generated code?
I think the optimistic picture here is that humans just won't be needed in the future. In theory when models are good enough we should be able to trust the AI systems more than humans. But the less optimistic side of me questions a future in which humans no longer do, or even know how to do such fundamental things.
It's going to feel literally like playing God, where you type in what you want and it happens ~instantly.
- "OpenAI is partnering with Cerebras to add 750MW of ultra low-latency AI compute"
- Sam Altman saying that users want faster inference more than lower cost in his interview.
- My understanding that many tasks are serial in nature.
This is about where I'm at. I love pure claude code for code I don't care about, but for anything I'm working on with other people I need to audit the results - which I much prefer to do in an IDE.
No doubt that good engineers will know when and how to leverage the tool, both for coding and improving processes (design-to-code, requirement collection, task tracking, basic code reviewal, etc) improving their own productivity and of those around them.
Motivated individuals will also leverage these tools to learn more and faster.
And yes, of course it's not the only tool one should use, of course there's still value in talking with proper human experts to learn from, etc, but 90% of the time you're looking for info the LLM will dig it from you reading at the source code of e.g. Postgres and its test rather than asking on chats/stack overflow.
This is a trasformative technology that will make great engineers even stronger, but it will weed out those who were merely valued for their very basic capability of churning something but never cared neither about engineering nor coding, which is 90% of our industry.
I'm honestly considering throwing away my JetBrains subscription and this is year 9 or 10 of me having one. I only open Zed and start yappin' at Claude Code. My employer doesn't even want me using ReSharper because some contractor ruined it for everyone else by auto running all code suggestions and checking them in blindly, making for really obnoxious code diffs and probably introducing countless bugs and issues.
Meanwhile tasks that I know would take any developers months, I can hand-craft with Claude in a few hours, with the same level of detail, but no endless weeks of working on things that'll be done SoonTM.
This makes it sound like we're back in the days of FrontPage/Dreamweaver WYSIWYG. Goodness.
Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media.
Did he coin the term "slopacolypse"? It's a useful one.Adding/prompting features one by one, reviewing code and then testing the resulting binary feels like the new programming workflow
Prompt/REview/Test - PRET.
I can supervise maybe three agents in parallel before a task requiring significant hand-holding means I'm likely blocking an agent.
And the time an agent is 'restlessly working' on something in usually inversely correlated with the likelihood to succeed. Usually if it's going down a rabbit hole, the correct thing to do is to intervene and reorient it.
I'm still a little iffy on the agent swarm idea. I think I will need to see it in action in an interface that works for me. To me it feels like we are anthropomorphizing agents too much, and that results in this idea that we can put agents into roles and them combine them into useful teams. I can't help seeing all agents as the same automatons and I have trouble understanding why giving an agent with different guideliens to follow, and then having them follow along another agent would give me better results than just fixing the context in the first place. Either that or just working more on the code pipeline to spot issues early on - all the stuff we already test for.
For as fast as this is all moving, it's good to remember that most of us are actually a lot closer to the tip of the spear than we think.
I expect interviews will evolve into "build project X with an LLM while we watch" and audit of agent specs
fun stats: corelation is real, people who were good at vibe code, also had offer(s) with other companies that didn't run vibe code interviews.
It doesn’t work you can’t be productive without agent capable of doing queries to db etc
OP mentions that they are actually doing the “babysitting”
use many simultaneously, and bounce between them to unblock them as needed
build good tools and tests. you will soon learn all the things you did manually -- script them all
We’re about a year deep into “AI is changing everything” and I don’t see 10x software quality or output.
Now don’t get me wrong I’m a big fan of AI tooling and think it does meaningfully increase value. But I’m damn tired of all the talk with literally nothing to show for it or back it up.
Interesting.
Any qualified guesses?
I'm not convinced more traders on wall street will allocate capital more effectively leading to economic growth.
Will more programmers grow the economy? Or should we get real jobs ;)
It does hurt, that's why all programmers now need an entrepreneurial mindset... you become if you use your skills + new AI power to build a business.
I've been increasingly using LLM's to code for nearly two years now - and I can definitely notice my brain atrophy. It bothers me. Actually over the last few weeks I've been looking at a major update to a product in production & considered doing the edits manually - at least typing the code from the LLM & also being much more granular with my instructions (i.e. focus on one function at a time). I feel in some ways like my brain is turning into slop & I've been coding for at least 35 years... I feel validated by Karpathy.
1. Manual coding may be less relevant (albeit ability to read code, interpret it and understand it will be more) in the future. Likely already is.
2. Any skill you don't practice becomes "weaker". Gonna give you an example. I play chess since my childhood, but sometimes I go months without playing it, even years. When I get back I start losing elo fast. If I was in the top 10% of chess.com, I drop to top 30% in the weeks after. But after few months I'm back at top 10%. Takeaway: your relative ability is more or less the same compared to other practitioners, you're simply rusty.
The bits left unsaid:
1. Burning tokens, which we charge you for
2. My CPU does this when I tell it to do bogosort on a million 32-bit integers, it doesn't mean it's a good thing
> TLDR This should be at the start?
I actually have been thinking of trying out ClaudeCode/OpenCode over this past week… can anyone provide experience, tips, tricks, ref docs?
My normal workflow is using Free-tier ChatGPT to help me interrogate or plan my solution/ approach or to understand some docs/syntax/best practice of which I’m not familiar. then doing the implementation myself.
Anyone wondering what exactly is he actually building? What? Where?
> The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do.
I would LOVE to have jsut syntax errors produced by LLMs, "subtle conceptual errors that a slightly sloppy, hasty junior dev might do." are neither subtle nor slightly sloppy, they actually are serious and harmful, and no junior devs have no experience to fix those.
> They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?"
Why just not hand write 100 loc with the help of an LLM for tests, documentation and some autocomplete instead of making it write 1000 loc and then clean it up? Also very difficult to do, 1000 lines is a lot.
> Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day.
It's a computer program running in the cloud, what exactly did he expected?
> Speedups. It's not clear how to measure the "speedup" of LLM assistance.
See above
> 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion.
mmm not sure, if you don't have domain knowledge you could have an initial stubb at the problem, what when you need to iterate over it? You don't if you don't have domain knowledge on your own
> Fun. I didn't anticipate that with agents programming feels more fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part.
No it's not fun, eg LLMs produce uninteresting uis, mostly bloated with react/html
> Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually.
My bet is that sooner or later he will get back to coding by hand for periods of time to avoid that, like many others, the damage overreliance on these tools bring is serious.
> Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it.
No programming it's not "syntactic details" the practice of programming it's everything but "syntactic details", one should learn how to program not the language X or Y
> What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows a lot.
Yet no measurable econimic effects so far
> Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro).
Did people with a smartphone outperformed photographers?
All of the real world code I have had to review created by AI is buggy slop (often with subtle, but weird bugs that don't show up for a while). But on HN I'm told "this is because your co-workers don't know how to AI right!!!!" Then when someone who supposedly must be an expert in getting things done with AI posts, it's always big claims with hand-wavy explanations/evidence.
Then the comments section is littered with no effort comments like this.
Yet oddly whenever anyone asks "show me the thing you built?" Either it looks like every other half-working vibe coded CRUD app... or it doesn't exist/can't be shown.
If you tell me you have discovered a miracle tool, just some me the results. Not taking increasingly ridiculous claims at face value is not "fear". What I don't understand is where comments like yours come from? What makes you need this to be more than it is?
I've worked extensively in the AI space, and believe that it is extremely useful, but these weird claims (even from people I respect a lot) that "something big and mysterious is happening, I just can't show you yet!" set of my alarms.
When sensible questions are met with ad hominems by supporters it further sets of alarm bells.
They have to maintain the hype until a somewhat credible exit appears and therefore lash out with boomer memes, FOMO, and the usual insane talking points like "there are builders and coders".
>Anyone wondering what exactly is he actually building? What? Where?
this is trivially answerable. it seems like they did not do even the slightest bit of research before asking question after question to seem smart and detailed.
On the contrary if it was for a job in a public sector I would just let the LLM spit out some output and play stupid, since salary is very low.
as the former, i've never felt _more ahead_ than now due to all of the latter succumbing to the llm hype
If current LLMs are ever deployed in systems harboring the big red button, they WILL most definitely somehow press that button.
If instead we believe in fantasies of a single all-knowing machine god that is 100% correct at all times, then... we really just have ourselves to blame. Might as well just have spammed that button by hand.
> "IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits."
As an added plus: those, who already have wealth will benefit the most, instead of the masses. Since the distribution and dissemination of new projects is at the same level as before, meaning you would need a lot of money. So no matter how clever you are with an llm, if you don't have the means to distribute it you will be left in the dirt.
Writing code in many cases is faster to me than writing English (that is how PLs are designed, btw!) LLM/agentic is very “neat” but still a toy to the professional, I would say. I doubt reports like this one. For those of us building real world products with shelf-lives (Is Andrej representative of this archetype?), I just don’t see the value-add touted out there. I’d love to be proven wrong. But writing code (in code, not English), to me and many others, is still faster than reading/proving it.
I think there’s a combination of fetishizing and Stockholm syndroming going on in these enthusiastic self-reports. PMW.
I started by copy pasting more and more stuff in chatgpt. Then using more and more in-IDE prompting, then more and more agent tools (Claude etc). And suddenly I realise I barely hand code anymore
For sure there's still a place for manual coding, especially schemas/queries or other fiddly things where a tiny mistake gets amplified, but the vast majority of "basic work" is now just prompting, and honestly the code quality is _better_ that it was before, all kinds of refactors I didn't think about or couldn't be bothered with have almost automatically
And people still call them stochastic parrots
ChatGPT 3.5/4 (2023-2024): The chat interface was verbose and clunky and it was just... wrong... like 70+% of the time. Not worth using.
CoPilot autocomplete and Gitlab Duo and Junie (late 2024-early 2025): Wayyy too aggressive at guessing exactly what I wasn't doing and hijacked my tab complete when pre-LLM type-tetris autocomplete was just more reliable.
Copilot Edit/early Cursor (early 2025): Ok, I can sort of see uses here but god is picking the right files all the time such a pain as it really means I need to have figured out what I wanted to do in such detail already that what was even the point? Also the models at that time just quickly descended into incoherency after like three prompts, if it went off track good luck ever correcting it.
Copilot Agent mode / Cursor (late 2025): Ok, great, if the scope is narrowly scoped, and I'm either going to write the tests for it or it's refactoring existing code it could do something. Like something mechanical like the library has a migration where we need to replace the use of methods A/B/C and replace them with a different combination of X/Y/Z. great, it can do that. Or like CRUD controller #341. I mean, sure, if my boss is going to pay for it, but not life changing.
Zed Agent mode / Cursor agent mode / Claude code (early 2026): Finally something where I can like describe the architecture and requirements of a feature, let it code, review that code, give it written instructions on how to clean it up / refactor / missing tests, and iterate.
But that was like 2 years of "really it's better and revolutionary now" before it actually got there. Now maybe in some languages or problem domains, it was useful for people earlier but I can understand people who don't care about "but it works now" when they're hearing it for the sixth time.
And I mean, what one hand gives the other takes away. I have a decent amount of new work dealing with MRs from my coworkers where they just grabbed the requirements from a stakeholder, shoved it into Claude or Cursor and it passed the existing tests and it's shipped without much understanding. When they wrote them themselves, they tested it more and were more prepared to support it in production...
Both can be true. You're tapping into every line of code publicly available, and your day-to-day really isn't that unique. They're really good at this kind of work.
Empowering people to do 10 times as much as they could before means they hit 100 times the roadblocks. Again, in a lot of ways we've already lived in that reality for the past many years. On a task-by-task basis programming today is already a lot easier than it was 20 years ago, and we just grew our desires and the amount of controls and process we apply. Problems arise faster than solutions. Growing our velocity means we're going to hit a lot more problems.
I'm not saying you're wrong, so much as saying, it's not the whole story and the only possibility. A lot of people today are kept out of programming just because they don't want to do that much on a computer all day, for instance. That isn't going to change. There's still going to be skills involved in being better than other people at getting the computers to do what you want.
Also on a long term basis we may find that while we can produce entry-level coders that are basically just proxies to the AI by the bucketful that it may become very difficult to advance in skills beyond that, and those who are already over the hurdle of having been forced to learn the hard way may end up with a very difficult to overcome moat around their skills, especially if the AIs plateau for any period of time. I am concerned that we are pulling up the ladder in a way the ladder has never been pulled up before.
The juniors though will radically have to upskill. The standard junior dev portfolio can be replicated by claude code in like three prompts
The game has changed and I don't think all the players are ready to handle it
I personally think the barrier is going to get higher, not lower. And we will be back expected to do more.
Day after day the global quality of software and learning resources will degrade as LLM grey goo consumes every single nook and cranny of the Internet. We will soon see the first signs of pure cargo cult design patterns, conventions and schemes that LLMs made up and then regurgitated. Only people who learned before LLMs became popular will know that they are not to be followed.
People who aren't learning to program without LLMs today are getting left behind.
That is assuming that LLMs plateau in capability, if they haven't already, which I think is highly likely.
its opposite, now in addition to all other skills, you need skill how to handle giant codebases of viobe-coded mess using AI.
Quite insightful.
Nevermind the fact he became successful _because_ of his skills and his brain.
It's such a visual and experiential thing that writing true success criteria it can iterate on seems like borderline impossible ahead of time.
Or slower, when the LLM doesn't understand what I want, which is a bigger issue when you spawn experiments from scratch (and have given limited context around what you are about to do).
Like, do these guys actually dog food real user experience, or are they all admins with the fast lane to the real model while everyone outside the org has to go through the 10 layers of model sheding, caching and other means and methods of saving money.
We all know these models are expensive as fuck to run and these companies are degrading service, A+B testing, and the rest. Do they actually ponder these things directly?
Just always seems like people are on drugs when they talk about the capabilities, and like, the drugs could be pure shit (good) or ditch weed, and we call just act like the pipeline for drugs is a consistent thing but it's really not, not at this stage where they're all burning cash through infrastructure. Definitely, like drug dealers, you know they're cutting the good stuff with low cost cached gibberish.
Can confirm. My partner's chatGPT wouldnt return anything useful for her given a specific query involving web use, while i got the desired result sitting side by side. She contacted support and they said nothing they can do about it, her account is in an A/B test group without some features removed. I imagine this saves them considerable resources despite still billing customers for them.
how much this is occurring is anyones guess
The underlying models are all actually really undifferentiated under the covers except for the post-training and base prompts. If you eliminate the base prompts the models behave near identically.
A conspiracy would be a helluva lot more interesting and fun, but I've spoken to these folks firsthand and it seems they already have enough challenges keeping the beast running.
Great idea! Le's pathalogize another thing! I love quickly othering whole concepts and putting them in my brain's "bad" box so I can feel superior.
https://github.com/karpathy/llm.c
The proof is in the pudding. Let's see your code
He said “…who has never written any production software…” yet you show toy projects instead.
Well done.
HN used to be a proper place for people actually curious about technology
Otherwise, I think you're incidentally right, your "ego" /is/ bruised, and you're looking for a way out by trying to prognosticate on the future of the technology. You're failing in two different ways.