The quality of AI-assisted software depends on unit of work management

danparsonson
·
1 minute ago
·
[ - ]

I seem to be in a minority but I find user stories or features to be really awkward and unnatural units of work for building software. Sure these things help to define the expected result but they shouldn't directly drive the development process. Imagine building a house that way - you don't build the living room, then the kitchen, then the bathroom etc.; you build floors, walls, the roof... The 'features' or use cases for the building arise out of the combination of different elements that were put into it, and usually right near the end of the build. The same is true for basically anything else that we build or create - if you're making a sculpture, do you finish working on one leg first before you move onto some other part?

Features are vertical slices through the software cake, but the cake is actually made out of horizontal layers. Creating a bunch of servings of cake and then trying to stick them together just results in a fragile mess that's difficult to work with and easy to break.

OptionOfT
·
2 hours ago
·
[ - ]

And the value of AI as pushed to us by these companies is in doing larger units of work.

But... reviewing code is harder than writing code. Expressing how I want something to be done in natural language is incredibly hard.

So over time I'm spending a lot of energy in those things, and only getting it 80% right.

Not to mention I'm constantly in this highly suspicious mode, trying to pierce through the veil of my own prompt and the code generated, because it's the edge cases that make work hard.

The end result is exhaustion. There is no recharge. Plans are front-loaded, and then you switch to auditing mode.

Whereas with code you front-load a good amount of design, but you can make changes as you go, and since you know your own code the effort to make those are much lower.

strogonoff
·
50 minutes ago
·
[ - ]

I somewhat dread code reviews. In order to properly evaluate the solution, you must first know what is the right solution to begin with. You must analyse the problem and arrive at it yourself. This is the brunt of the work, and yet you are not allowed the pleasant part of it: knowing that your work is shipped and feeling pride in it.

Working with LLM-generated code is mostly the same. The more sophisticated the autocomplete, the more mental overhead spent on understanding its output. There is an advantage: you are spared having to argue with a possibly defensive peer about what you believe is best. There is also a disadvantage: you do not feel like you are helping someone grow, and instead you are an unpaid (you are not paid for that in particular) contributor to a product by Microsoft (or similar) intended generally in longer term to push you and/or your peers out of the job. Additionally, there is no single mind that you can build rapport with and learn to understand the approaches and vibes of after a while.

nicce
·
52 minutes ago
·
[ - ]

> Expressing how I want something to be done in natural language is incredibly hard

Surprise, surprise… that is why programming languages were created.

dragonwriter
·
47 minutes ago
·
[ - ]

Programming languages don’t solve that problem, since someone still has to explain what needs to be done in natural language unless the end customer is also the programmer.

Programming languages were created because of the different problem of “its very hard to get computers to understand natural language even if you know how to express what you want in it”.

nicce
·
35 minutes ago
·
[ - ]

I don't see the difference? Natural language simply was lacking the level of precision. We see natural language words and symbols everywhere in programming languages. Natural language was fine-tuned with improved accuracy. And optimised to reduce the amount of needed words. The difference in the precision between natural languages and programming languages was simply just too big, so you needed "an interpreter" to translate the level of precision from customer to computer.

dragonwriter
·
18 minutes ago
·
[ - ]

Programming languages solve that machine code is hard for humans to work with for large problems and natural language (even when the meaning is perfectly clear) is very difficult (to the point of complete intractability with the level of knowledge and available hardware at the time programming languages were originally invented) to parse mechanically, a necessary first step in translation into machine code a computer could run.

Any problem with the difficulty of clearly expressing things in natural language (a real thing for which there have long been solutions between humans that are different than programming languages) was a problem that was technically unreachable at the user->machine interface because of that more fundamental problem for most of the history of computing (its arguable that LLMs are at the level where now it is potentially an issue with computers, but it took decades of having programming languages to be able to get the technical capacity to even experience the problem, it is not the problem programming languages address.)

palmotea
·
1 hour ago
·
[ - ]

That makes very clear these tools are not meant to serve you, you are meant to serve these tools.

ryoshu
·
55 minutes ago
·
[ - ]

Reverse centaur.

tedggh
·
2 hours ago
·
[ - ]

I found out that summarizing a completed task and feeding it to a new context works better than staying on the same context for multiple tasks. So let’s say I have a sprint with tasks 1, 2 and 3. I start by creating a project with general information including the spec, git issues, code base, folder trees, etc then work on Task 1. When done I ask for a summary using a template, which gives me a txt file describing what the original goal was, what we changed and what the next steps are. Then I repeat the process for Task 2 and I feed the summary from Task 1. At least in ChatGPT keeping the same context for multiple tasks has lots of issues like speed, increased hallucinations, and ChatGPT referencing content from old files.

smw
·
1 hour ago
·
[ - ]

Hell, claude even makes that part of the standard workflow, with /compact; cleverly using the llm itself to summarize the previous context

furyofantares
·
1 hour ago
·
[ - ]

/compact is really poor imo. Quality just falls off a ledge after it.

I much prefer to choose tasks that can be done with 25%+ context left and then just start the next task with fresh context.

If I'm getting low on context I have it summarize the plan and progress in a text file rather than use /compact and then start a fresh context and reference that file, which I can then edit and try again if I'm not getting good results.

datadrivenangel
·
3 hours ago
·
[ - ]

Keep your scope as small as necessary, but no smaller. This has been fundamentally true for project management work breakdown structures for decades.

BinaryIgor
·
1 hour ago
·
[ - ]

Interesting that it turns out to be true for code generation as well!

elpakal
·
1 hour ago
·
[ - ]

I agree with the premise of the article, and have felt that we're probably seeing AI code gen tools being limited by the constraints being put on them by traditional source code management tools like git and GitHub. Those tools were designed for incremental changes (patches), and have worked well for humans to organize changes so they could be more easily reviewed, maintained and reasoned about. Units of work in the form of features, patches etc rely on "pull requests" which are a function of the above.

liszper
·
3 hours ago
·
[ - ]

most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

thanks for the article, it's a good one

blibble
·
3 hours ago
·
[ - ]

> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

yes, just as was said each and every previous time OpenAI/anthropic shit out a new model

"now it doesn't suck!"

Filligree
·
2 hours ago
·
[ - ]

Each and every new model expands the scope of what you can do. You notice that, get elated when things that didn’t work start working, then three weeks later the honeymoon period is over and you notice the remaining limits.

The hedonic treadmill ensures it feels the same way each time.

But that doesn’t mean the models aren’t improving, nor that the scope isn’t expanding. If you compare today’s tools to those a year ago, the difference is stark.

thrawa8387336
·
1 hour ago
·
[ - ]

She is choosing GPT5 as the good example? Maybe Claude, maybe..

angusturner
·
2 hours ago
·
[ - ]

I think most SWEs do have a good idea where I work.

They know that its a significant, but not revolutionary improvement.

If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy.

If you need a prototype and don't care about code quality or maintenance, they are great.

Anyone claiming 2x, 5x, 10x etc is absolutely kidding themselves for any non-trivial software.

dingnuts
·
33 minutes ago
·
[ - ]

if the benefit is less than 2x then we're talking about AI assisted coding as being a very, very expensive IntelliSense. 1.x improvement just isn't much. My mind goes back to that study showing engineers claimed a 20% improvement and measured 20% reduction in productivity -- this is all encouraging me to just keep using traditional tools.

bluefirebrand
·
1 hour ago
·
[ - ]

> If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy

Compared to just doing it yourself though?

Imagine having to micromanage a junior developer like this to get good results

Ridiculous tbh

liszper
·
2 hours ago
·
[ - ]

I'd argue this just proves my point.

TheRoque
·
3 hours ago
·
[ - ]

It's true that I haven't been a hardcore agent-army vibe coder, I just try the popular ones once in a while in a naive way (isn't it the point of these tools, to have little friction ?), claude code for example. And it's cool ! But imperfect, and as this article attests, there's a lot of mental overhead to even have a shot at getting a decent output. And even if it's decent, it still needs to be reviewed and could include logical flaws.

I'd rather use it the other way, I'm the one in charge, and the AI reviews any logical flaw or things that I would have missed. I don't even have to think about context window since it'll only look at my new code logic.

So yeah, 3 years after the first ChatGPT and Copilot, I don't feel huge changes regarding "automated" AI programming, and I don't have any AI tool in my IDE, I pefer to have a chat using their website, to brainstorm, or occasionally find a solution to something I'm stuck on.

zeroonetwothree
·
3 hours ago
·
[ - ]

I use agents for coding small stuff at work almost every day. I would say there has been some improvement compared to a year ago but it’s not any sort of step change. They still are only able to complete simple “intern-level” tasks around 50% of the time. Which is helpful but not revolutionary.

rco8786
·
2 hours ago
·
[ - ]

I still use Claude Code and Cursor and tbh still run into a lot of the same issues. Hallucinating code, hallucinating requirements, even when scoped to a very simple "make this small change".

It's good enough that it helps, particularly in areas or languages that I'm unfamiliar with. But I'm constantly fighting with it.

kibwen
·
2 hours ago
·
[ - ]

Last week I wanted to generate some test data for some unit tests for a certain function in a C codebase. It's an audio codec library, so I could have modified the function to dump its inputs to disk and then run the library on any audio file and then hardcoded the input into the unit tests. Instead, I decided I wanted to save a few bytes and wanted to look at generating dummy data dynamically. I wanted to try out Claude for generating the code that would generate the data, so to keep the context manageable I extracted the function and all its dependencies into a self-contained C program (less than 200 lines altogether) and asked it to write a function that would generate dummy data, in C.

Impressively, it recognized the structure of the code and correctly identified it as a component of an audio codec library, and provided a reasonably complete description of many minute details specific to this codec and the work that the function was doing.

Rather less impressively, it decided to ignore my request and write a function that used C++ features throughout, such as type inference and lambdas, or should I say "lambdas" because it was actually just a function-defined-within-a-function that tried to access and mutate variables outside of its own function scope, like we were writing Javascript or something. Even apart from that, the code was rife with the sorts of warnings that even a default invocation of gcc would flag.

I can see why people would be wowed by this on its face. I wouldn't expect any average developer to have such a depth of knowledge and breadth of pattern-matching ability to be able to identify the specific task that this specific function in this specific audio codec was performing.

At the same time, this is clearly not a tool that's suitable for letting loose on a codebase without EXTREME supervision. This was a fresh session (no prior context to confuse it) using a tightly crafted prompt (a small, self-contained C program doing one thing) with a clear goal, and it still required constant handholding.

At the end of the day, I got the code working by editing it manually, but in an honest retrospective I would have to admit that the overall process actually didn't save me any time at all.

Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.

angusturner
·
2 hours ago
·
[ - ]

I feel this. I've had a few tasks now where in honest retrospect I find myself asking "did that really speed me up". Its a bit demoralising cause not only do you waste time, you have a worse mental model of the resulting code and feel less sense of ownership over the result.

Brainstorming, ideation and small, well defined tasks where I can quickly vet the solution : these feel like the sweet spot for current frontier model capabilities.

(Unless you are pumping out some sloppy React SPA that you don't care about anything except get it working as fast as possible - fine, get Claude code to one shot it)

Filligree
·
2 hours ago
·
[ - ]

There’s been a lot of noise about Claude performance degradation, and the current best option is probably Codex, but this still surprises me. It sounds like it succeeded on the hard part, then stumbled on the easy bit.

Just two questions, if you don’t mind satisfying my curiosity.

- Did you tell it to write C? Or better yet, what was the prompt? You can use Claude --resume to easily find that.

- Which model? (Sooner or Opus)? Though I’d have expected either one to work.

chrisweekly
·
2 hours ago
·
[ - ]

Sooner -> Sonnet

realusername
·
2 hours ago
·
[ - ]

I tried again recently and I see absolutely no difference. If there's been some improvement, it's very subtle.

There's a big difference with their benchmarks and real world coding.

xmpir
·
2 hours ago
·
[ - ]

Same as for human software engineers... We'll see Conway' law all again with agentic coding!

bryanrasmussen
·
2 hours ago
·
[ - ]

maybe it just works that way for Agents because they see in the data it works that way for humans.

marstall
·
1 hour ago
·
[ - ]

doing things in small chunks is good. so is it doing things in large chunks sometimes. In AI, like in life, there are no hard and fast rules and we're all figuring it out as we go. Like with "vibe coding" - sometimes it's ok to not even look at the code AI is generated, sometimes you need to understand every line.

It feels like part of my journey to being an "AI developer" is being present for those tradeoffs, metabolizing each one into my craft.

AI is a fickle, but powerful horse. I'm finding it a privilege to learn how to be a rider.

jonstewart
·
3 hours ago
·
[ - ]

I first tried getting specific with Claude Code. I made the Claude.md, I detailed how to do TDD, what steps it should take, the commands it should run. It was imperfect. Then I had it plan (think hard) and write the plan to a file. I’d clear context, have it read the plan, ask me questions, and then have it decompose the plan into a detailed plan of discrete tasks. Have it work its way through that. It would inevitably go sideways halfway through, even clearing context between each task. It wouldn’t run tests, it would commit breakage, it would flip flop between two different broken approaches, it was just awful. Now I’ve just been vibing, writing as little as possible and seeing what happens. That sucks, too.

It’s amazing at reviewing code. It will identify what you fear, the horrors that lie within the codebase, and it’ll bring them out into the sunlight and give you a 7 step plan for fixing them. And the coding model is good, it can write a function. But it can’t follow a plan worth shit. And if I have to be extremely detailed at the function by function level, then I should be in the editor coding. Claude code is an amazing niche tool for code reviews and dialogue and debugging and coping with new technologies and tools, but it is not a productivity enhancement for daily coding.

liszper
·
3 hours ago
·
[ - ]

With all due respect, you sound like someone who is just getting familiar with these tools. 100 more hours spent with AI coding and you will be much more productive. Coding with AI is a slightly different skill from coding, similar how managing software engineers is different from writing software.

abtinf
·
3 hours ago
·
[ - ]

liszper:

> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

Also liszper: oh, you tried the current approach and don’t agree with me? Well you just don’t know what you are doing.

bubblyworld
·
2 hours ago
·
[ - ]

Lol, what is up with everyone assuming there's no learning curve to these things? If you applied this argument to literally any other tool you would be laughed at, for good reason.

bluefirebrand
·
1 hour ago
·
[ - ]

Probably because "there's no learning curve they are just magic tools" is how they are marketed and how our managers are expecting them to work

liszper
·
2 hours ago
·
[ - ]

Yes, exactly. Learning new things is hard. Personally it took me about 200 hours to get started, and since then ~2500 hours to get familiar with the advanced techniques, and now I'm very happy with the results, managing extremely large codebases with LLM in production.

For context before that I had ~15 years of experience coding the traditional way.

chownie
·
2 hours ago
·
[ - ]

Has anyone else noticed the extreme dichotomy of developers using AI agents? Either AI agents essentially don't work, or they are apparently running legions of agents to produce some nebulous gigantic estate.

I think the crucial difference is that I do actually see evidence (ie the codebase) posted sometimes for the former, the latter could well be entirely mythos -- a 24 day old account evangelizing for the legion of agents story does kind of fit the theme.

azinman2
·
48 minutes ago
·
[ - ]

You should write a blog post about your learnings. If you could even give some high level highlights here that’d be really helpful.

·
2 hours ago
·
[ - ]

sarchertech
·
2 hours ago
·
[ - ]

How many users is production and how large is extremely large.

liszper
·
2 hours ago
·
[ - ]

200k DAU, 7 million registered, ~50 microservices, large monorepo

sarchertech
·
2 hours ago
·
[ - ]

You have 50 microservices for 200k daily users?

Let me guess this has something to do with AI?

liszper
·
2 hours ago
·
[ - ]

No, It has something to do with experience. The system is highly integrated to other platforms and have to stay afloat during burst loads.

pjc50
·
2 hours ago
·
[ - ]

.. what is this thing and can we see it?

liszper
·
1 hour ago
·
[ - ]

you can OSINT me pretty easily, not going to post it here for the sake of anonymity against crawlers who train models on our conversations. today's HN comments are tomorrow's coding LLMs

pjc50
·
3 hours ago
·
[ - ]

Funnily enough the same kind of approach you get from Lisp advocates and the more annoying faction of Linux advocacy (which isn't as prevalent these days, it seems)

liszper
·
2 hours ago
·
[ - ]

I'm also a lisper, yes.

jonstewart
·
4 minutes ago
·
[ - ]

I think it has more to do with the kind of software I write and their requirements then it has to do with spending more time with this current tool. For some things it's great, but it's been a net productivity loss for me on my core coding responsibilities.

ryandrake
·
2 hours ago
·
[ - ]

I'm starting to kind of dig C.C. but you're right, it definitely feels like a very confident, very ambitious high schooler level developer with infinite energy. You really have to give it very small tasks and be constantly steering its direction. At the end of the day, I'm not sure I'm really saving that much time coaching Claude to do the job right vs. just writing the code myself, but it's definitely a neat trick.

The difference from an actual junior developer, of course, is that the human junior developer learns from his mistakes and gets better, but Claude seems to be stuck at the level of expertise of its model, and you have to wait for the model to improve before Claude improves.

jonstewart
·
1 minute ago
·
[ - ]

The thing I am calling BS on is that there's much productivity gain in giving it very small tasks and constantly steering its direction. For 80% of code, I'm faster than it if that's what I have to do. For debugging? For telling it to integrate a new tool? Port my legacy build system to something better? It's great at that, removes some important barriers to entry.

TheRoque
·
3 hours ago
·
[ - ]

Then, it's the job of someone else to use these tools, not developers

liszper
·
3 hours ago
·
[ - ]

I agree with your point. I think this is the reason why most developers still don't get it, because AI coding ultimately requires a "higher level" methodology.

dgfitz
·
2 hours ago
·
[ - ]

"Hacker culture never took root in the 'AI' gold rush because the LLM 'coders' saw themselves not as hackers and explorers, but as temporarily understaffed middle-managers." [0]

This, this is you. This is the entire charade. It seems poetic somehow.

[0]https://news.ycombinator.com/item?id=45123094

liszper
·
2 hours ago
·
[ - ]

I see myself as a hacker.