I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
It's a really weird way to open up an article concluding that LLMs make one a worse programmer: "I definitely know how to use this tool optimally, and I conclude the tool sucks". Ok then. Also: the piano is a terrible, awful instrument; what a racket it makes.
I find it funny when people ask me if it's true that they can build an app using an LLM without knowing how to code. I think of this... that it took me months before I started feeling like I "got it" with fitting LLMs into my coding process. So, not only do you need to learn how to code, but getting to the point that the LLM feels like a natural extension of you has its own timeline on top.
What does this even mean?
In the first one and half years after ChatGPT released, when I used them there was a 100% rate, when they lied to me, I completely missed this honeymoon phase. The first time when it answered without problems was about 2 months ago. And that time was the first time when it answered one of them (ChatGPT) better than Google/Kagi/DDG could. Even yesterday, I tried to force Claude Opus to answer when is the next concert in Arena Wien, and it failed miserably. I tried other models too from Anthropic, and all failed. It successfully parsed the page of next events from the venue, then failed miserably. Sometimes it answered with events from the past, sometimes events in October. The closest was 21 August. When I asked what’s on 14 August, it said sorry, I’m right. When I asked about “events”, it simply ignored all of the movie nights. When I asked about them specifically, it was like I would have started a new conversation.
The only time when they made anything comparable to my code of quality was when they got a ton of examples of tests which looked almost the same. Even then, it made mistakes… when basically I had to change two lines, so copy pasting would have been faster.
There was an AI advocate here, who was so confident in his AI skill, that he showed something exact, which most of the people here try to avoid: recorded how he works with AIs. Here is the catch: he showed the same thing. There were already examples, he needed minimal modifications for the new code. And even then, copy pasting would have been quicker, and would have contained less mistakes… which he kept in the code, because it didn’t fail right away.
I feel like every time I have a prompt or use a new tool, I'm experimenting with how to make fire for the first time. It's not to say that I'm bad at it. I'm probably better than most people. But knowing how to use this tool is by far the largest challenge, in my opinion.
Insane stuff. It’s clear you can’t review so much changes in a day, so you’re just flooding your code base with code that you barely read.
Or is your job just re-doing the same boilerplate over and over again?
> ...thousands of lines of code ... quite high quality
A contradiction in terms.
Edit: it is Sunday. As I am relaxing, and spending time writing answers on HN, I keep a lazy eye on the progress of an LLM at work too. I got stuff done that would have taken me a few days of work by just clicking a "Continue" button now and then.
That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
I am still hesitant using AI for solving problems for me. Either it hallucinates and misleads me. Or it does a great job and I worry that my ability of reasoning through complex problems with rigor will degenerate. When my ability of solving complex problems degenerated, patience diminished, attention span destroyed, I will become so reliant on a service that other entities own to perform in my daily life. Genuine question - are people comfortable with this?
My comment is specifically in contrast to working in a codebase where I'm at "max AI productivity". In a new codebase, it just takes a bit of time to work out kinks and figure out tendencies of the LLMs in those codebases. It's not that I'm slower than I'd be without AI, I'm just not at my "usual" AI-driven productivity levels.
You use it when you know how to do something and know exactly what the solution looks like, but can't be arsed to do it. Like most UI work where you just want something in there with the basic framework to update content etc. There's nothing challenging in doing it, you know what has to be done, but figuring out the weird-ass React footguns takes time. Most LLMs can one-shot it with enough information.
You can also use it as a rubber duck, ask it to analyse some code, read and see if you agree. Ask for improvements or modifications, read and see if you agree.
It's a question of degree, but in general, yeah. I'm totally comfortable being reliant on other entities to solve complex problems for me.
That's how economies work [1]. I neither have nor want to acquire the lifetime of experience I would need to learn how to produce the tea leaves in my tea, or the clean potable water in it, or the mug they are contained within, or the concrete walls 50 meters up from ground level I am surrounded by, or so on and so forth. I can live a better life by outsourcing the need for this specialized knowledge to other people, and trade with them in exchange for my own increasingly-specialized knowledge. Even if I had 100 lifetimes to spend, and not the 1 I actually have, I would probably want to put most of them to things that, you know, aren't already solved-enough problems.
Everyone doing anything interesting works like this, with vanishingly few exceptions. My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant. And his accountant doesn't need to know how to rewire his turn of the century New England home. And if you look at the exceptions, like that really cute 'self sufficient' family who uploads weekly YouTube videos called "Our Homestead Life"... It often turns out that the revenue from that YouTube stream is nontrivial to keeping the whole operation running. In other words, even if they genuinely no longer go to Costco, it's kind of a gyp.
This is not quite the same thing. The AI is not perfect, it frequently makes mistakes or suboptimal code. As a software engineer, you are responsible for finding and fixing those. This means you have to review and fully understand everything that the AI has written.
Quite a different situation than your dad and his accountant.
If not, then they must themselves make mistakes or do things suboptimally sometimes. Whose responsibility is that - my dad, or my dad's accountant?
If it is my dad, does that then mean my dad has an obligation to review and fully understand everything the accountant has written?
And do we have to generalize that responsibility to everything and everyone my dad has to hand off work to in order to get something done? Clearly not, that's absurd. So where do we draw the line? You draw it in the same place I do for right now, but I don't see why we expect that line to be static.
Yes, and people who care and is knowledgeable do this already. I do this, for one.
Writing code should never have been a bottle neck. And since it wasn’t, any massive gains are due to being ok with trusting the AI.
And so if you don't use it then someone else will... But as for the models, we already have some pretty good open source ones like Qwen and it'll only get better from here so I'm not sure why the last part would be a dealbreaker
Getting 80% of the benefit of LLMs is trivial. You can ask it for some functions or to write a suite of unit tests and you’re done.
The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
I'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
You're still going to need to jump in, but when you can learn to get the LLM to write 95% of the code for you, that's incredibly powerful.
I’m making this a bit contrived, but I’m simplifying it to demonstrate the underlying point.
When an LLM is 80% effect, I’m limits to doing 5 things in parallel since I still need to jump in 20% of the time.
When an LLM is 90% effect, I can do 10 things at once. When it’s 95%, 20 things. 99%, 100 things.
Now, obviously I can’t actually juggle 10 or 20 things at once. However, the point is there are actually massive productivity gains to be had when you can reduce your involvement in a task from 20% to, even 10%. You’re effectively 2x as productive.
Or do you mean you are using long running agents to do tasks and then review those? I haven't seen such a workflow be productive so far.
Right now, I still need to intervene enough that I'm not actually doing a second coding project in parallel. I tend to focus on communication, documentation, and other artifacts that support the code I'm writing.
However, I am very close to hitting that point and occasionally do on easier tasks. There's a _very_ real tipping point in productivity when you have confidence that an LLM can accomplish a certain task without your intervention. You can start to do things legitimately in parallel when you're only really reviewing outputs and doing minor tweaks.
The problem is that you're learning a skill that will need refinement each time you switch to a new model. You will redo some of this learning on each new model you use.
This actually might not be a problem anyway, as all the models seem to be converging asymptotically towards "programming".
The better they do on the programming benchmarks, the further away from AGI they get.
> Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Do you find that these details translate between models? Sounds like it doesn't translate across codebases for you?
I have mostly moved away from this sort of fine-tuning approach because of experience a while ago around OpenAI's ChatGPT 3.5 and 4. Extra work on my end necessary with the older model wasn't with the new one, and sometimes counterintuitively caused worse performance by pointing it at what the way I'd do it vs the way it might have the best luck with. ESPECIALLY for the sycophantic models which will heavily index on "if you suggested that this thing might be related, I'll figure out some way to make sure it is!"
So more recently I generally stick to the "we'll handle a lot of the prompt nitty gritty" for you IDE or CLI agent stuff, but I find they still fall apart with large complex codebases and also that the tricks don't translate across codebases.
* Business context - these are things like code quality/robustness, expected spec coverage, expected performance needs, domain specific knowledge. These generally translate well between models, but can vary between code bases. For example, a core monolith is going to have higher standards than a one-off auxiliary service.
* Model focuses - Different models have different tendencies when searching a code base and building up their context. These are specific to each code base, but relatively obvious when they happen. For example, in one code base I work in, one model always seems to pick up our legacy notification system while another model happens to find our new one. It's not really a skill issue. It's just luck of the draw how files are named and how each of them search. They each just find a "valid" notification pattern in a different order.
LLMs are massively helpful for orienting to a new codebase, but it just takes some time to work out those little kinks.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
You want me to say "this stuff is really useful, here's why I think that. But lots of people on the internet have disagreed with me, here's links to their comments"?
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
but every time i've seen you comment on this website or other similar websites on the topic of using LLMs for coding, at least half of the responses you get express precisely the opposite perspective -- that they are not flying with LLMs at allso i think it is disingenuous to make claims like that without at least acknowledging the differences in experience which are pretty clearly demonstrated
this wouldn't be particularly worth mentioning, if it weren't for the fact that you comment extensively, prolifically, on agentic coding topics, on this website and many others, and your comments are generally effusively and uncritically positive, no matter what responses you get, over time, from anyone
What in the wooberjabbery is this even.
List of single-commit LLM generated stuff. Vibe coded shovelware like animated-rainbow-border [1] or unix-timestamp [2].
Calling these tools seems to be overstating it.
1: https://gist.github.com/simonw/2e56ee84e7321592f79ceaed2e81b...
2: https://gist.github.com/simonw/8c04788c5e4db11f6324ef5962127...
I wrote more about it here: https://simonwillison.net/2024/Oct/21/claude-artifacts/ - and a lot of them have explanations in posts under my tools tag: https://simonwillison.net/tags/tools/
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
these are absolutely trivial, toy example programs. they've got nothing to do with anything that anyone is meaningfully talking about when they talk about using LLMs for coding stuff.
is this kind of stuff what you're referring to when you comment on using LLMs for programming?
clone, i dunno, https://github.com/minio/minio, and ask the LLM to implement a non-trivial feature -- this is what everyone else is talking about! not "implement a YAML to JSON converter in the browser"
100s of single commit AI generated trash in the likes of "make the css background blue".
On display.
Like it's something.
You can't be serious.
Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos, though a few dozen of those are scrapers and shouldn't count) and https://pypi.org/user/simonw/ (340 published packages).
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
What do you mean by my script?
But it was already a warning before LLMs because, as you wrote, people are bad at measuring productivity (among many things).
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
I mean we do. I think programmers are more interested in long term maintainable software than its users are. Generally that makes sense, a user doesn't really care how much effort it takes to add features or fix bugs, these are things that programmers care about. Moreover the cost of mistakes of most software is so low that most people don't seem interested in paying extra for more reliable software. The few areas of software that require high reliability are the ones regulated or are sold by companies that offer SLAs or other such reliability agreements.
My observation over the years is that maintainability and reliability are much more important to programmers who comment in online forums than they are to users. It usually comes with the pride of work that programmers have but my observation is that this has little market demand.
Please talk to your users
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
---
wait. why are we fighting again? :) https://dmitriid.com/everything-around-llms-is-still-magical...
In this video (https://www.youtube.com/watch?v=EO3_qN_Ynsk) they present a slide by the company DX that surveyed 38,880 developers across 184 organizations, and found the surveyed developers claiming a 4 hour average time savings per developer per week. So all of these LLM workflows are only making the average developer 10% more productive in a given work week, with a bunch of developers getting less. Few developers are attaining productivity higher than that.
In this video by stanford researchers actively researching productivity using github commit data for private and public repositories (https://www.youtube.com/watch?v=tbDDYKRFjhk) they have a few very important data points in there:
1. There's zero correlation they've found between how productive respondants claim their productivity is and how it's actually measured, meaning people are poor judges of their own productivity numbers. This does refute the claims on the previous point I made but only if you assume people are wildly more productive then they claim on average.
2. They have been able to measure actual increase in rework and refactoring commits in the repositories measured as AI tools become more in use in those organizations. So even with being able to ship things faster, they are observing increase number of pull requests that need to fix those previous pushes.
3. They have measured that greenfield low complexity systems have pretty good measurements for productivity gains, but once you get more towards higher complexity systems or brownfield systems they start to measure much lower productivity gains, and even negative productivity with AI tools.
This goes hand in hand with this research paper: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... which had experienced devs in significant long term projects lose productivity when using AI tools, but also completely thought the AI tools were making them even more productivity.
Yes, all of these studies have their flaws and nitpicks we can go over that I'm not interested in rehashing. However, there's a lot more data and studies that show AI having very marginal productivity boost compared to what people claim than vice versa. I'm legitimately interested in other studies that can show significant productivity gains in brownfield projects.
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
He is actually recommending Copilot for price/performance reasons and his closing statement is "Don’t fall for the hype, but also, they are genuinely powerful tools sometimes."
So, it just seems like he never really gave a try at how to engineer better prompts that these more advanced models can use.
I've tried a few things that have mostly been positive. Starting with copilot in-line "predictive text on steroids" which works really well. It's definitely faster and more accurate than me typing on a traditional intellisense IDE. For me, this level of AI is cant-lose: it's very easy to see if a few lines of prediction is what you want.
I then did Cursor for a while, and that did what I wanted as well. Multi-file edits can be a real pain. Sometimes, it does some really odd things, but most of the time, I know what I want, I just don't want to find the files, make the edits on all of them, see if it compiles, and so on. It's a loop that you have to do as a junior dev, or you'll never understand how to code. But now I don't feel I learn anything from it, I just want the tool to magically transform the code for me, and it does that.
Now I'm on Claude. Somehow, I get a lot fewer excursions from what I wanted. I can do much more complex code edits, and I barely have to type anything. I sort of tell it what I would tell a junior dev. "Hey let's make a bunch of connections and just use whichever one receives the message first, discarding any subsequent copies". If I was talking to a real junior, I might answer a few questions during the day, but he would do this task with a fair bit of mess. It's a fiddly task, and there are assumptions to make about what the task actually is.
Somehow, Claude makes the right assumptions. Yes, indeed I do want a test that can output how often each of the incoming connections "wins". Correct, we need to send the subscriptions down all the connections. The kinds of assumptions a junior would understand and come up with himself.
I spend a lot of time with the LLM critiquing, rather than editing. "This thing could be abstracted, couldn't it?" and then it looks through the code and says "yeah I could generalize this like so..." and it means instead of spending my attention on finding things in files, I look at overall structure. This also means I don't need my highest level of attention, so I can do this sort of thing when I'm not even really able to concentrate, eg late at night or while I'm out with the kids somewhere.
So yeah, I might also say there's very little learning curve. It's not like I opened a manual or tutorial before using Claude. I just started talking to it in natural language about what it should do, and it's doing what I want. Unlike seemingly everyone else.
The blogging output on the other hand ...
That is not what that paper said, lol.
Which shows that LLMs, when given to devs who are inexperienced with LLMs but are very experienced with the code they're working on, don't provide a speedup even though it feels like it.
Which is of course a very constrained scenario. IME the LLM speedup is mostly in greenfield projects using APIs and libraries you're not very experienced with.
I read these articles and I feel like I am taking crazy pills sometimes. The person, enticed by the hype, makes a transparently half-hearted effort for just long enough to confirm their blatantly obvious bias. They then act like the now have ultimate authority on the subject to proclaim their pre-conceived notions were definitely true beyond any doubt.
Not all problems yield well to LLM coding agents. Not all people will be able or willing to use them effectively.
But I guess "I gave it a try and it is not for me" is a much less interesting article compared to "I gave it a try and I have proved it is as terrible as you fear".
I have also written a C++ code that has to have a runtime of years, meaning there can be absolutely no memory leaks or bugs whatsoever, or TV stops working. I wouldn't have a language model write any of that, at least not without testing the hell out of it and making sure it makes sense to myself.
It's not all or nothing here. These things are tools and should be used as such.
Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Obviously, to emphasize, this kind of thing happens all the time with human-generated code, but LLMs make the issue a lot worse because it lets you generate a ton of eventual mess so much faster.
Also, I do agree with your primary point (my comment was a bit tongue in cheek) - it's very helpful to know what should be core and what can be thrown away. It's just in the real world whenever "throwaway" code starts getting traction and getting usage, the powers that be rarely are OK with "Great, now let's rebuild/refactor with production usage in mind" - it's more like "faster faster faster".
So in the other camp you have seasoned engineers who will have a 5x longer design and planning process. But they also never get it right the first several iterations. And by the time their “properly-engineered” design gets its chance to shine, the business needs already changed.
They exist.
Because this is the first pass on any project, any component, ever. Design is done with iterations. One can and should throw out the original rough lynchpin and replace it with a more robust solution once it becomes evident that it is essential.
If you know that ahead of time and want to make it robust early, the answer is still rarely a single diligent one-shot to perfection - you absolutely should take multiple quick rough iterations to think through the possibility space before settling on your choice. Even that is quite conducive to LLM coding - and the resulting synthesis after attacking it from multiple angles is usually the strongest of all. Should still go over it all with a fine toothed comb at the end, and understand exactly why each choice was made, but the AI helps immensely in narrowing down the possibility space.
Not to rag on you though - you were being tongue in cheek - but we're kidding ourselves if we don't accept that like 90% of the code we write is rough throwaway code at first and only a small portion gets polished into critical form. That's just how all design works though.
I have been reprimanded and tediously spent collectively combing over said quick prototype code for far longer than the time originally provided to work on it though, as a proof of my incompetence! Does that count?
I then manually declare some functions, JSDoc comments for the return types, imports and stop halfway. By then the agent is able to think, ha!, you plan to replace all the api calls to this composable under the so and so namespace.
It's iterations and context. I don't use them for everything but I find that they help when my brain bandwidth begins to lag or I just need a boilerplate code before engineering specific use cases.
└── Dey well
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
Learning how to use LLMs in a coding workflow is trivial to start, but you find you get a bad taste early if you don't learn how to adapt both your workflow and its workflow. It is easy to get a trivially good result and then be disappointed in the followup. It is easy to try to start on something it's not good at and think it's worthless.
The pure dismissal of cursor, for example, means that the author didn't learn how to work with it. Now, it's certainly limited and some people just prefer Claude code. I'm not saying that's unfair. However, it requires a process adaptation.
Not everyone with a different opinion is dumber than you.
Just like I can recognize a clueless frontend developer when they say "React is basically just a newer jquery". Recognizing clueless engineers when they talk about AI can be pretty easy.
It's a sector that is both old and new: AI has been around forever, but even people who worked in the sector years ago are taken aback by what is suddenly possible, the workflows that are happening... hell, I've even seen cases where it's the very people who have been following GenAI forever that have a bias towards believing it's incapable of what it can do.
For context, I lead an AI R&D lab in Europe (https://ingram.tech/). I've seen some shit.
It seems to me the biggest barrier is that the person driving the tool needs to be experienced enough to recognize and assist when it runs into issues. But that's little different from any sophisticated tool.
It seems to me a lot of the criticism comes from placing completely unrealistic expectations on an LLM. "It's not perfect, therefore it sucks."
If you want to use a tool like Claude Code (or Gemini CLI or Cursor agent mode or Code CLI or Qwen Code) to solve complex problems you need to give them an environment they can operate in where they can solve that problem without causing too much damage if something goes wrong.
You need to think about sandboxing, and what tools to expose to them, and what secrets (if any) they should have access to, and how to control the risk of prompt injection if they might be exposed to potentially malicious sources of tokens.
The other week I wanted to experiment with some optimizations of configurations on my Fly.io hosted containers. I used Claude Code for this by:
- Creating a new Fly organization which I called Scratchpad
- Assigning that a spending limit (in case my coding agent went rogue or made dumb expensive mistakes)
- Creating a Fly API token that could only manipulate that organization - so I could be sure my coding agent couldn't touch any of my production deployments
- Putting together some examples of how to use the Fly CLI tool to deploy an app with a configuration change - just enough information that Claude Code could start running its own deploys
- Running Claude Code such that it had access to the relevant Fly command authenticated with my new Scratchpad API token
With all of the above in place I could run Claude in --dangerously-skip-permissions mode and know that the absolute worse that could happen is it might burn through the spending limit I had set.
This took a while to figure out! But now... any time I want to experiment with new Fly configuration patterns I can outsource much of that work safely to Claude.
There are plenty of useful LLM workflows that are possible to create pretty trivially.
The example you gave is not hardly the first thing a beginning LLM user would need. Yes, more sophisticated uses of an advanced tool require more experience. There's nothing different from any other tool here. You can find similar debates about programming languages.
Again, what I said in my original comment applies: people place unrealistic expectations on LLMs.
I suspect that this is at least partly is a psychological game people unconsciously play to try to minimize the competence of LLMs, to reduce the level of threat they feel. A sort of variation of terror management theory.
edit: I don’t have personal experience around spending limits but I vaguely recall them being useful for folks who want to set up AWS resources and swing for the fences, in startups without thinking too deeply about the infra. Again this isn’t a failure mode unique to LLMs although I can appreciate it not mapping perfectly to your scenario above
edit #2: fwict the LLM specific context of your scenario above is: providing examples, setting up API access somehow (eg maybe invoking a CLI?). The rest to me seems like good old software engineering
Yea, there’s some grunt work involved but in terms of learned ability all of that is obvious to someone who knew only a little bit about LLMs.
It’s not exactly a groundbreaking line of reasoning that leads one to the conclusion of “I shouldn’t let this non-deterministic system access production servers.”
Now, setting up an LLM so that they can iterate without a human in the loop is a learned skill, but not a huge one.
I don’t have to debug Emacs every day to write code. My CI workflow just runs every time a PR is created. When I type ‘make tests’, I get a report back. None of those things are perfect, but they are reliable.
What you're describing is a case of mismatched expectations.
Copilot isn't an LLM, for a start. You _combine_ it wil a selection of LLMs. And it absolutely has severe limitations compared to something like Claude Code in how it can interact with the programming environment.
"Hallucinations" are far less of a problem with software that grounds the AI to the truth in your compiler, diagnostics, static analysis, a running copy of your project, runnning your tests, executing dev tools in your shell, etc.
You're being overly pedantic here and moving goalposts. Copilot (for coding) without an LLM is pretty useless.
I stand by my assertion that these tools are all basically the same fundamental tech - LLMs.
Over generalizing. The synergy between the LLM and the client (cursor, Claude code, copilot, etc) make a huge difference in results.
With LLMs, the point is to eliminate tedious work in a trivial way. If it’s tedious to get an LLM to do tedious work, you have not accomplished anything.
If the work is not trivial enough for you to do yourself, then using an LLM will probably be a disaster, as you will not be able to judge the final output yourself without spending nearly the same amount of time it takes for you to develop the code on your own. So again, nothing is gained, only the illusion of gain.
The reason people think they are more productive using LLMs to tackle non-trivial problems is because LLMs are pretty good at producing “office theatre”. You look like you’re busy more often because you are in a tight feedback loop of prompting and reading LLM output, vs staring off into space thinking deeply about a problem and occasionally scribbling or typing something out.
We are learning that this is not going to be magic. There are some cases where it shines. If I spend the time, I can put out prototypes that are magic and I can test with users in a fraction of the time. That doesn't mean I can use that for production.
I can try three or four things during a meeting where I am generally paying attention, and look afterwards to see if it's pursuing.
I can have it work through drudgery if I provide it an example. I can have it propose a solution to a problem that is escaping me, and I can use it as a conversational partner for the best rubber duck I've ever seen.
But I'm adapting myself to the tool and I'm adapting the tool to me through learning how to prompt and how to develop guardrails.
Outside of coding, I can write chicken scratch and provide an example of what I want, and have it write a proposal for a PRD. I can have it break down a task, generate a list of proposed tickets, and after I've went through them have it generate them in jira (or anything else with an API). But the more I invest into learning how to use the tool, the less I have to clean up after.
Maybe one day in the future it will be better. However, the time invested into the tool means that 40 bucks of investment (20 into cursor, 20 into gpt) can add 10-15% boost in productivity. Putting 200 into claude might get you another 10% and it can get you 75% in greenfield and prototyping work. I bet that agency work can be sped up as much as 40% for that 200 bucks investment into claude.
That's a pretty good ROI.
And maybe some workloads can do even better. I haven't seen it yet but some people are further ahead than me.
Everything you mentioned is also fairly trivial, just a couple of one shot prompts needed.
Improving LLM output through better inputs is neither an illusion, nor as easy as learning how to google (entire companies are being built around improving llm outputs and measuring that improvement)
Keep in mind that the first reasoning model (o1) was released less than 8 months ago and Claude Code was released less than 6 months ago.
Slot machines on the other hand are truly random and success is luck based with no priors (the legal ones in the US anyways)
I have used neural networks since the 1980s, and modern LLM tech simply makes me happy, but there are strong limits to what I will use the current tech for.
The power-user tricks like "double quote phrase searches" and exclusion though -term are treated more as gentle guidelines now, because regular users aren't expected to figure them out.
There's always "verbatim" mode, though amusingly that appears to be almost entirely undocumented! I tried using Google to find the official documentation for that feature just now and couldn't do better than their 2011 blog entry introducing it: https://search.googleblog.com/2011/11/search-using-your-term...
Maybe if I was more skilled at Google I'd be able to use it to find documentation on its own features?
Pseudo-random number generators remain one of the most amazing things in computing IMO. Knuth volume 2. One of my favourite books.
LLMs will always suck at writing code that has not be written millions of times before. As soon as you venture slightly offroad, they falter.
That right there is your learning curve! Getting LLMs to write code that's not heavily represented in their training data takes experience and skill and isn't obvious to learn.
People are claiming that it takes time to build the muscles and train the correct footing to push, while I'm here learning mechanical theory and drawing up levers. If one managed to push the rock for one meter, he comes clamoring, ignoring the many who was injured by doing so, saying that one day he will be able to pick the rock up and throw it at the moon.
Anything up to 250,000 tokens I pipe into GPT-5 (prior to that o3), and beyond that I'll send them to Gemini 2.5 Pro.
For even larger code than that I'll fire up Codex CLI or Claude Code and let them grep their way to an answer.
This stuff has gotten good enough now that I no longer get stuck when new tools lack decent documentation - I'll pipe in just the source code (filtered for .go or .rs or .c files or whatever) and generate comprehensive documentation for myself from scratch.
You don't have the luxury of having someone who is deeply familiar with the code sanity check your perceived understanding of the code, i.e. you don't see where the LLM is horribly off-track because you don't have sufficient understanding of that code to see the error. In enterprise contexts this is very common tho so its quite likely that a lot of the haters here have seen PRs submitted by vibecoders to their own work which have been inadequate enough that they started to blame the tool. For example I have seen someone reinvent the wheel of the session handling by a client library because they were unaware that the existing session came batteries included and the LLM didn't hesitate to write the code again for them. The code worked, everything checked out but because the developer didn't know what they didn't know about they submitted a janky mess.
What are those things that they are good for? And consistently so?
If you have complex objects and you're doing complex operations on them, then setup code can get rather long.
Also I started in the pre-agents era and so I ended up with a pair-programming paradigm. Now everytime I conceptualize a new task in my head -- whether it is a few lines of data wrangling within a function, or generating an entire feature complete with integration tests -- I instinctively do a quick prompt-vs-manual coding evaluation and seamlessly jump to AI code generation if the prompt "feels" more promising in terms of total time and probability of correctness.
I think one of the skills is learning this kind of continuous evaluation and the judgement that goes with it.
Effective LLM usage these days is about a lot more than just the prompts.
it helps dramatically on finding bugs and issues. perhaps that's trivial to you, but it feels novel as we've only had effective agents in the last couple weeks.
Right now, you are entirely up to the random flawed information on the internet that you often can't repeat in trials, or your structured ideas on how to improve a thing.
That is difficult. It is difficult to take the information available right now, and come up with a reasonable way to improve the performance of LLMs through your ingenuity.
At some point it will be figured out, and every corporation will be following the same ideal setup, but at the moment it is a green field opportunity for the human brain to come up with novel and interesting ideas.
I believe that's difficult, and not just what google prefers. I guess we feel differently about it.
I recently started with fresh project, and until I got to the desired structure I only used AI to ask questions or suggestions. I organized and written most of the code.
Once it started to get into the shape that felt semi-permanent to me, I started a lot of queries like:
```
- Look at existing service X at folder services/x
- see how I deploy the service using k8s/services/x
- see how the docker file for service X looks like at services/x/Dockerfile
- now, I started service Y that does [this and that]
- create all that is needed for service Y to be skaffolded and deployed, follow the same pattern as service X
```
And it would go, read existing stuff for X, then generate all of the deployment/monitoring/readme/docker/k8s/helm/skaffold for Y
With zero to none mistakes. Both claude and gemini are more than capable to do such task. I had both of them generate 10-15 files with no errors, with code being able to be deployed right after (of course service will just answer and not do much more than that)
Then, I will take over again for a bit, do some business logic specific to Y, then again leverage AI to fill in missing bits, review, suggest stuff etc.
It might look slow, but it actually cuts most boring and most error prone steps when developing medium to large k8s backed project.
Whipping up greenfield projects is almost magical, of course. But that’s not most of my work.
My personal experience has been that AI has trouble keeping the scope of the change small and targeted. I have only been using Gemini 2.5 pro though, as we don’t have access to other models at my work. My friend tells me he uses Claud for coding and Gemini for documentation.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
LinkedIn is full of BS posturing, ignore it.
If you go by MBA types on LinkedIn that aren’t really developers or haven’t been in a long time, now they can vibe out some react components or a python script so it’s a revolution.
I tend to strongly agree with the "unpopular opinion" about the IDEs mentioned versus CLI (specifically, aider.chat and Claude Code).
Assuming (this is key) you have mastery of the language and framework you're using, working with the CLI tool in 25 year old XP practices is an incredible accelerant.
Caveats:
- You absolutely must bring taste and critical thinking, as the LLM has neither.
- You absolutely must bring systems thinking, as it cannot keep deep weirdness "in mind". By this I mean the second and third order things that "gotcha" about how things ought to work but don't.
- Finally, you should package up everything new about your language or frameworks since a few months or year before the knowledge cutoff date, and include a condensed synthesis in your context (e.g., Swift 6 and 6.1 versus the 5.10 and 2024's WWDC announcements that are all GPT-5 knows).
For this last one I find it useful to (a) use OpenAI's "Deep Research" to first whitepaper the gaps, then another pass to turn that into a Markdown context prompt, and finally bring that over to your LLM tooling to include as needed when doing a spec or in architect mode. Similarly, (b) use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
I'm confused why these two obvious steps aren't built into leading agentic tools, but maybe handling the LLM as a naive and outdated "Rain Man" type doesn't figure into mental models at most KoolAid-drinking "AI" startups, or maybe vibecoders don't care, so it's just not a priority.
Either way, context based development beats Leroy Jenkins.
It seems to me that currently there are 2 schools of thought:
1. Use repomap and/or LSP to help the models navigate the code base
2. Let the models figure things out with grep
Personally, I am 100% a grep guy, and my editor doesn't even have LSP enabled. So, it is very interesting to see how many of these agentic tools do exactly the same thing.
And Claude Code /init is a great feature that basically writes down the current mental model after the initial round of grep.
The strategy of one or the other brings differing big gaps and require context or prompt work to compensation.
They should be using 1 to keep overall lay of the land, and 2 before writing any code.
On the management side, however, we have all sorts of AI mandates, workshops, social media posts hyping our AI stuff, our whole "product vision" is some AI-hallucinated nightmare that nobody understands, you'd genuinely think we've been doing nothing but AI for the last decade the way we're contorting ourselves to shove "AI" into every single corner of the product. Every day I see our CxOs posting on LinkedIn about the random topic-of-the-hour regarding AI. When GPT-5 launched, it was like clockwork, "How We're Using GPT-5 At $COMPANY To Solve Problems We've Never Solved Before!" mere minutes after it was released (we did not have early access to it lol). Hilarious in retrospect, considering what a joke the launch was like with the hallucinated graphs and hilarious errors like in the Bernoulli's Principle slide.
Despite all the mandates and mandatory shoves coming from management, I've noticed the teams I'm close with (my team included) are starting to push back themselves a bit. They're getting rid of the spam generating PR bots that have never, not once, provided a useful PR comment. People are asking for the various subscriptions they were granted be revoked because they're not using them and it's a waste of money. Our own customers #1 piece of feedback is to focus less on stupid AI shit nobody ever asked for, and to instead improve the core product (duh). I'm even seeing our CTO who was fanboy number 1 start dialing it back a bit and relenting.
It's good to keep in mind that HN is primarily an advertisement platform for YC and their startups. If you check YC's recent batches, you would think that the 1 and only technology that exists in the world is AI, every single one of them mentions AI in one way or another. The majority of them are the lowest effort shit imaginable that just wraps some AI APIs and is calling it a product. There is a LOT of money riding on this hype wave, so there's also a lot of people with vested interests in making it seem like these systems work flawlessly. The less said about LinkedIn the better, that site is the epitome of the dead internet theory.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
How much of your workflow or intuition from 6 months ago is still relevant today? How long would it take to learn the relevant bits today?
Keep in mind that Claude Code was released less than 6 months ago.
If I was starting from fresh today I expect it would take me months of experimentation to get back to where I am now.
Working thoughtfully with LLMs has also helped me avoid a lot of the junk tips ("Always start with 'you are the greatest world expert in X', offer to tip it, ...") that are floating around out there.
Speaking mostly from experience of building automated, dynamic data processing workflows that utilize LLMs:
Things that work with one model, might hurt performance or be useless with another.
Many tricks that used to be necessary in the past are no longer relevant, or only applicable for weaker models.
This isn't me dimissing anyone's experience. It's ok to do things that become obsolete fairly quickly, especially if you derive some value from it. If you try to stay on top of a fast moving field, it's almost inevitable. I would not consider it a waste of time.
The reach is big enough to not care about our feelings. I wish it wasn't this way.
I recall The Mythical Man-Month stating a rough calculation that the average software developer writes about 10 net lines of new, production-ready code per day. For a tool like this going up an order of magnitude to about 100 lines of pretty good internal tooling seems reasonable.
OP sounds a few cuts above the 'average' software developer in terms of skill level. But here we also need to point out a CLI log viewer and querier is not the kind of thing you actually needed to be a top tier developer to crank out even in the pre-LLM era, unless you were going for lnav [1] levels of polish.
[1]: https://lnav.org/
I would rather qualify this statement a bit more - I would say "you can safely ignore if you are not building anything green field or build tools for self". In my experiments in the last one month or so, it is very efficient for building new components (small & medium). Making it efficient for the existing code base is a bit more tricky - you need to make sure it adheres to the way things are coded already, not to leak .env contents to LLMs, building a context from the existing components so that it does not read code every time (leading to cost and time escalations) and so on.
My main issue so far has been understanding the code that is generated. As of now that is the biggest bottleneck in increasing the productivity - i.e it takes a long time to review the code and push. In usual workflow of building, normally by the time the code complexity has increased in the system I would have sufficient mental construction to handle that complexity. I would know the inner workings of code. However if AI generates large piece of code getting into that code is taking a long time
The other thing I disagree with is the coverage of gemnini-cli: if you use gemini-cli for a single long work session, then you must set your Google API key as an environment variable when starting gemini-cli, otherwise you end up after a short while using Gemini-2.5-flash, and that leads to unhappy results. So, use gemini-cli for free for short and focused 3 or 4 minute work sessions and you are good, or pay for longer work sessions, and you are good.
I do have a random off topic comment: I just don’t get it: why do people live all day in an LLM-infused coding environment? LLM based tooling is great, but I view it as something I reach for a few times a day for coding and that feels just right. Separately, for non-coding tasks, reaching for LLM chat environments for research and brainstorming is helpful, but who really needs to do that more than once or twice a day?
The current state of LLM-driven development is already several steps down the path of an end-game where the overwhelming majority of code is written by the machine; our entire HCI for "building" is going to be so far different to how we do it now that we'll look back at the "hand-rolling code era" in a similar way to how we view programming by punch-cards today. The failure modes, the "but it SUCKS for my domain", the "it's a slot machine" etc etc are not-even-wrong. They're intermediate states except where they're not.
The exceptions to this end-game will be legion and exist only to prove the end-game rule.
Do they? I’ve found Clojure-MCP[1] to be very useful. OTOH, I’m not attempting to replace myself, only augment myself.
I like your phrasing of “OTOH, I’m not attempting to replace myself, only augment myself.” because that is my personal philosophy also.
I work mostly in C/C++.
The most valuable improvement of using this kind of tools, for me, is to easily find help when I have to work on boring/tedious tasks or when I want to have a Socratic conversation about a design idea with a not-so-smart but extremely knowledgeable colleague.
But for anything requiring a brain, it is almost useless.
* I let the AI do something
* I find bad bug or horrifying code
* I realize I have it too much slack
* hand code for a while
* go back to narrow prompts
* get lazy, review code a bit less add more complexity
* GOTO 1, hopefully with a better instinct for where/how to trust this model
Then over time you hone your instinct on what to delegate and what to handle yourself. And how deeply to pay attention.
It makes your existing strength and mobility greater, but don't be surprised if you fly into space that you will suffocate,
or if you fly over an ocean and run out gas, that you'll sink to the bottom,
or if you fly the suit in your fine glassware shop with patrons in the store, that your going to break and burn everything/everyone in there.
In case it matters, I was using Copilot that is for 'free' because my dayjob is open source, and the model was Claude Sonnet 3.7. I've not yet heard anyone else saying the same as me which is kind of peculiar.
I haven't found that to be true with my most recent usage of AI. I do a lot of programming in D, which is not popular like Python or Javascript, but Copilot knows it well enough to help me with things like templates, metaprogramming, and interoperating with GCC-produced DLL's on Windows. This is true in spite of the lack of a big pile of training data for these tasks. Importantly, it gets just enough things wrong when I ask it to write code for me that I have to understand everything well enough to debug it.
And promoting own startups are usually okay if that is phrased okay :)
Devin is perhaps the one that is most fully featured and I believe has been around the longest. Other examples that seem to be getting some attention recently are Warp, Cursor's own background agent implementation, Charlie Labs, Codegen, Tembo, and OpenAI's Codex.
I do not work for any of the aforementioned companies.
Ah yes. An unverifiable claim followed by "just google them yourself".
> Devin is perhaps the one that is most fully featured and I believe has been around the longest.
And it had been hilariously bad the longest. Is it better now? Maybe? I don't really know anyone even mentioning Devin anymore
> examples that seem to be getting some attention recently
So, "some attention", but you could "easily find them by searching".
> Charlie Labs, Codegen, Tembo
Never heard of them, but will take a look.
See how easy it was to mention them?
Some agent scaffolding performs better on benchmarks than others given the same underlying base model - see SWE Bench and Terminal Bench for examples.
Some may find certain background agents better than others simply because of UX. Some background agents have features that others don't - like memory systems, MCP, 3rd party integrations, etc.
I maintain it is easy to search for examples of background coding agents that are not Jules or Copilot. For me, searching "background coding agents" on google or duckduckgo returns some of the other examples that I mentioned.
Either I'm extremely lucky or I was lucky to find the guy who said it must all be test driven and guided by the usual principles of DRY etc. Claude Code works absolutely fantastically nine out of 10 times and when it doesn't we just roll back the three hours of nonsense it did postpone this feature or give it extra guidance.
If there's a test suite for the thing to run it's SO much less likely to break other features when it's working. Plus it can read the tests and use them to get a good idea about how everything is supposed to work already.
Telling Claude to write the test first, then execute it and watch it fail, then write the implementation has been giving me really great results.
Almost like hiring and scaling a team? There are also benchmarks that specifically measure this, and its in theory a very temporary problem (Aider Polyglot Benchmark is one such).
It’s mostly on point though. Although, in recent years I’ve been assigned to manage and plan projects at work, and the skills I’ve learnt from that greatly help to get effective results from an LLM I think.
It’s not perfect but it’s okay.
Like if you need to crap out a UI based on a JSON payload, make a service call, add a server endpoint, LLMs will typically do this correctly in one shot. These are common operations that are easily extrapolated from their training data. Where they tend to fail are tasks like business logic which have specific requirements that aren’t easily generalized.
I’ve also found that writing the scaffolding for the code yourself really helps focus the agent. I’ll typically add stubs for the functions I want, and create overall code structure, then have the agent fill the blanks. I’ve found this is a really effective approach for preventing the agent from going off into the weeds.
I also find that if it doesn’t get things right on the first shot, the chances are it’s not going to fix the underlying problems. It tends to just add kludges on top to address the problems you tell it about. If it didn’t get it mostly right at the start, then it’s better to just do it yourself.
All that said, I find enjoyment is an important aspect as well and shouldn’t be dismissed. If you’re less productive, but you enjoy the process more, then I see that as a net positive. If all LLMs accomplish is to make development more fun, that’s a good thing.
I also find that there's use for both terminal based tools and IDEs. The terminal REPL is great for initially sketching things out, but IDE based tooling makes it much easier to apply selective changes exactly where you want.
As a side note, got curious and asked GLM-4.5 to make a token field widget with React, and it did it in one shot.
It's also strange not to mention DeepSeek and GLM as options given that they cost orders of magnitude less per token than Claude or Gemini.
I use clojure for my day-to-day work, and I haven't found this to be true. Opus and GPT-5 are great friends when you start pushing limits on Clojure and the JVM.
> Or 4.1 Opus if you are a millionaire and want to pollute as much possible
I know this was written tongue-in-cheek, but at least in my opinion it's worth it to use the best model if you can. Opus is definitely better on harder programming problems.
> GPT 4.1 and 5 are mostly bad, but are very good at following strict guidelines.
This was interesting. At least in my experience GPT-5 seemed about as good as Opus. I found it to be _less_ good at following strict guidelines though. In one test Opus avoided a bug by strictly following the rules, while GPT-5 missed.
I'm sprry, but I disagree with this claim. That is not my experience, nor many others. It's true that you can make them do something without learning anything. However, it takes time to learn what they are good amd bad at, what information they need, and what nonsense they'll do without express guidance. It also takes time to know what to look for when reviewing results.
I also find that they work fine for languages without static types. You need need tests, yes, but you need them anyway.
Some comments here are reminiscent of antiquated discourse: "how many angels dance on the head of a pin?"
We somehow are trying to agree on some factual ramp-up time required for a dev to become competent coding with LLM's. This is inherently subjective! Why bother?
Perhaps certain LLMs are blessed with disproportionately more angels (nee "bugs") in the machines.
I enjoyed reading the article:
"The model looks good, but Google’s enshittification has won and it looks like no competent software developers are left. I would know, many of my friends work there."
Yikes!
Credit to the author for having the courage to post publically.
That was an unnecessary guilt-shaming remark.
It becomes farcical when not only are you missing the big thing but you're also proud of your ignorance and this guy is both.
When it is mentioned that LLMs "have terrible code organization skills", I think they are referring mainly to the size of the context. It is not the same to develop a module with hundreds of LoCs, one with thousands or one with tens of thousands of LoCs.
I am not very much in favor of skill degradation; I am not aware of a study that validates it in this regard. On the other hand, it is true that agents are constantly evolving, and I don't see any difficulties that cannot be overcome with the current evolutionary race, given that, in the end, coding is one of the most accessible functions for artificial intelligence.