This mirrors my experience using LLMs on personal projects. They can provide good advice only to the extent that your project stays within the bounds of well-known patterns. As soon as your codebase gets a little bit "weird" (ie trying to do anything novel and interesting), the model chokes, starts hallucinating, and makes your job considerably harder.
Put another way, LLMs make the easy stuff easier, but royally screws up the hard stuff. The gap does appear to be widening, not shrinking. They work best where we need them the least.
The other use case is targeted code review/improvement. "Suggest how I could improve this" fills a niche which is currently filled by linters, but can be more flexible and robust. It has its place.
The fundamental problem with LLMs is that they follow patterns, rather than doing any actual reasoning. This is essentially the observation made by the article; AI coding tools do a great job of following examples, but their usefulness is limited to the degree to which the problem to be solved maps to a followable example.
What does this mean?
After a few years off from this project, I refactored it all, and part of that refactoring was building a test suite that I can run. When ran, it will rebuild, normalize, and verify all the data in my app (scraped data).
When I deploy, it will also run these tests and then email if something breaks, but skip the seeding portion.
I had plans to do this before but the firebase emulator still had a lot of issues a few years ago, and refactoring this project gave me the freedom to finally build a proper testing environment and make my entire app make full use of my local firebase emulator without issue.
I like giving it my test cases in plain english. It still gets them wrong sometimes but 90% of the time they are good to go.
However what they said was they write code then send that to an LLM to generate tests.
There’s no other way to interpret what they wrote. There’s no way to get “I write tests using plain English and an LLM, then use those tests to help me write code from “I write code, I send that code to an LLM to generate tests, those tests give me a satisfying series of green checks”.
This is a case of the OP writing something that is close to the opposite of what they meant and you can’t reasonably iron man that to get to a stronger position for them.
I get that my wording was too harsh, but I estimated a 50% chance it was actually satire.
It also screwed up the imports of my tests pretty bad, some imports that worked before got changed for no good reason. It replaced the JetBrains NotNull annotation with a totally different annotation.
It was able to figure out how to update a DAO object when I added a new field. It got the type of the field wrong when updating the object corresponding to a row in that database column even though it wrote the liquibase migration and should have known the type —- we had chatted plenty about that migration.
It got many things right but I had to fix a lot of mistakes. It is not clear that it really saves time.
Cross between actually useful autocomplete, personalized StackOverflow and error diagnosis (just paste and error message in chat). I know I am just scratching the usefulness and I pretty much never do changes across multiple files, but I definitely see firm net positives at this point.
I'll add that my experience with the Codium plugin for IntelliJ is night and day different from the Windsurf editor from Codium.
The first one "just doesn't work" and struggles to see files that are in my project, the second basically works.
>The first one "just doesn't work"
Haha. You're on a roll.
>Please tell me this is satire.
No. I started doing TDD. It's fun to think about a piece of functionality, write out some tests, and then slowly make it pass. Removes a lot of cognitive load for me and serves as a micro todo. It's also nice that when you're working and think of something to add, you can just write out a quick test for it and add it to kanban later.
I can't tell you how many times I've worked on projects that are gigantic in complexity and don't do any testing, or use typescript, or both. You're always like 25% paranoid about everything you do and it's the worst.
Yeah that’s not TDD.
But I also have a day job and I can’t even begin to imagine how much extra work someone doing “TDD” by writing a function and then fixing it in place with a whole suite of generated tests would cause me.
I’m fine with TDD. I do it myself fairly often. I also go back in and delete the tests that I used to build it that aren’t actually going to be useful a year from now.
That’s not test driven development.
You write code, then you send the code to the LLM to create tests for you.
How can this possibly be interpreted to mean the reverse?
That you write tests first by asking the LLM in English to help you without “sending the code” you wrote because you haven’t written it yet. Then you use those tests to help you write the code.
Now if you misspoke then my comment isn’t relevant to your situation, but don’t pretend that I somehow interpreted what you said uncharitably. There’s no other way to interpret it.
As a test I just asked "ChatGPT 4o with canvas" to "Can you write a set of tests to test glBufferData and all of its edge cases?"
glBufferData is a 32 year old API so there's clearly plenty of examples for to have looked it. There are even multiple public tests for it including the official tests that are open sources and so easily scannable. It failed
It wrote 8 tests, 7 of those tests were wrong in that it did something wrong intentionally then asserted it go no error. It wasn't close to comprehensive. It didn't test the function actually put data in the buffer for example, nor did it check the set of valid enums to see that they work. Nor did it check that the target parameter actually works and affects the correct buffer bound to that target.
This is my experience with LLMs for code so far. I do get answers quicker from LLMs sometimes for tech questions vs searching via Google and reading stack overflow. But that's only sometimes. As a recent example, I was trying to add TypeScript types some JavaScript and it failed. I went round and round tell it it failed but it got stuck in a loop and just kept saying "Oh, sorry. How about this -- repeat of previous code"
/s aside, it’s what we all experience too. There’s a great divide between programming pre-around-2015 and thereafter. LLMs can only do recent programming, which is a product of tons of money getting loaded into the industry and creating jobs that made no sense ten years ago. Basically, the more repetitive boilerplate patterns configuration options import blocks row-obj-dto-obj conversion typecheck bullshit you write per day, the more LLMs help. I mean, one could abstract all that away using regular programming, but how would they sell their work for $^6 an AI for $^9 then?
Just yesterday, after reading yet another “oh you must try again” comment, I asked 4o about how to stop puppeteer from dumping errors into console and exit gracefully when I close the headful browser (all logs and code provided). Right away it slided into nonsense. I always finish my chats with what I think about it uncut, just in case someone uses these for further learning.
I’ll also often add hints at the top of the file in the form of comments or sample data to help keep it on the right track.
# Given a data set of size `size' >= 0, and a `text` string describing
# the subset size, return a 2-element tuple containing a text string
# describing the complement size and the actual size as an integer. The
# text string can be in one of four forms (after stripping leading and
# trailing whitespace):
#
# 1) the empty string, in which case return ("", 0)
# 2) a stringified integer, like "123", where 0 <= n <= size, in
# which case return (str(size-int(n)), size-int(n))
# 3) a stringified decimal value like "0.25" where 0 <= x <= 1.0, in
# which case compute the complement string as str(1 - x) and
# the complement size as size - (int(x * size)). Exponential
# notation is not supported, only numbers like "3.0", ".4", and "3.14"
# 4) a stringified fraction value like "1/3", where 0 <= x <= 1,
# in which case compute the complement string and value as #3
# but using a fraction instead of a decimal. Note that "1/2" of
# 51 must return ("1/2", 26), not ("1/2", 25).
#
# Otherwise, return ("error", -1)
def get_complement(text: str, size: int) -> tuple[str, int]:
...
For examples: get_complement("1/2", 100) == ("1/2", 50)
get_complement("0.6", 100) == ("0.4", 40)
get_complement("100", 100) == ("0", 0)
get_complement("0/1", 100) == ("1/1", 100)
Some of the harder test cases I came up were:get_complement("0.8158557553804697", 448_525_430): this tests the underlying system uses decimal.Decimal rather than a float, because float64 ends up on a 0.5 boundary and applies round-half-even resulting in a different value than the true decimal calculation, which does not end up with a 0.5. (The value is "365932053.4999999857944710")
get_complement("nan", 100): this is a valid decimal.Decimal but not allowed by the spec.
get_complement("1/0", 100): handle division-by-zero in fractions.Fraction
get_complement("0.", 100): this tests that the string complement is "1." or "1.0" and not "1"
get_complement("0.999999999999999", 100): this tests the complement is "0.000000000000001" and not "1E-15".
get_complement("0.5E0", 100): test that decimal parsing isn't simply done by decimal.Decimal(size) wrapped in an exception handler.
Also, this isn't the full spec. The real code reports parse errors (like recognizing the "1/" is an incomplete fraction) and if the value is out of range it uses the range boundary (so "-0.4" for input is treated as "0.0" and the complement is "1.0"), along with an error flag so the GUI can display the error message appropriately.
Also with your example above I probably would break the function down into smaller parts, for two reasons 1) you can more easily unit test the components; 2) generally I find AI performs better with more focused problems.
So I would probably first write a signature like this:
# input examples = "1/2" "100" "0.6" "0.99999" "0.5E0" "nan"
def string_ratio_to_decimal(text: str) -> number
Pasting that into Claude, without any other context, produces this result: https://claude.site/artifacts/58f1af0e-fe5b-4e72-89ba-aeebad...Sure. Internally I have multiple functions. Though I don't like unit testing below the public API as it inhibits refactoring and gives false coverage feedback, so all my tests go through the main API.
> Pasting that into Claude, without any other context
The context is the important part. Like the context which says "0.5E0" and "nan" are specifically not supported, and how the calculations need to use decimal arithmetic, not IEEE 754 float64.
Also, the hard part is generating the complement with correct formatting, not parsing float-or-fraction, which is first-year CS assignment.
> # Handle special values
Python and C accept "Infinity" as an alternative to "Inf". The correct way is to defer to the underlying system then check if the returned value is infinite or a NaN. Which is what will happen here because when those string checks fail, and the check for "/" fails, it will correctly process through float().
Yes, this section isn't needed.
> # Handle empty string
My spec says the empty string is not an error.
> numerator, denominator = text.split("/"); num = float(numerator); den = float(denominator)
This allows "1.2/3.4" and "inf/nan", which were not in the input examples and therefore support for them should be interpreted as accidental scope creep.
They were also not part of the test suite, which means the tests cannot distinguish between these two clearly different implementations:
num = float(numerator)
den = float(denominator)
and: num = int(numerator)
den = int(denominator)
Here's a version which follows the same style as the linked-to code, but is easier to understand: if not isinstance(text, str):
return None
# Remove whitespace
text = text.strip()
# Handle empty string
if not text:
return None
# Handle ratio format (e.g., "1/2")
if "/" in text:
try:
numerator, denominator = text.split("/")
num = int(numerator)
den = int(denominator)
if den == 0:
return float("inf") if num > 0 else float("-inf") if num < 0 else float("nan")
return num / den
except ValueError:
return None
# Handle regular numbers (inf, nan, scientific notation, etc.)
try:
return float(text)
except ValueError:
return None
It still doesn't come anywhere near handling the actual problem spec I gave.That said, if you are using Gen AI without a advanced rag system feeding it lots of constraints and patterns/templates I wish you luck.
"Unlock a codebase that your engineers and AI love."
I think they do often act opinionated and show some decision-making ability, so AI alignment really is important.
While, causally, we usually think of a programming language as being one thing, but in reality a programming language generally only specifies a syntax. All of the other features of a language emerge from the people using them. And because of that, two different people can end up speaking two completely different languages even when sharing the same syntax.
This is especially apparent when you witness someone who is familiar with programming in language X, who then starts learning language Y. You'll notice, at least at first, they will still try to write their programs in language X using Y syntax, instead of embracing language Y in all its glory. Now, multiply that by the millions of developers who will touch code in a popular language like Python, Java, or Typescript and things end up all over the place.
So while you might have a lot more code to train on overall, you need a lot more code for the LLM to be able to discern the different dialects that emerge out of the additional variety. Quantity doesn't imply quality.
Basically, for any statement about AI helpfulness, you need to quantify how far it can help you. Depending on your personality, anything else is likely either always a success (if you have a positive outlook) or a failure (if you focus on the negative).
It would be great if they were good at the hard stuff too, but if I had to pick, the basics is where i want them the most. My brain just really dislikes that stuff, and i find it challenging to stay focused and motivated on those things.
But these tools often don't generate working, let alone bug-free, code. Even for simple things, you still need to review and fix it, or waste time re-prompting them. All this takes time and effort, so I wonder how much time you're actually saving in the long run.
This is my experience with generation as well - but I still don't trust it for the easy stuff and thus the model ends up being a hindrance in all scenarios. It is much easier for me to comprehend something I'm actively writing so making sure a generative AI isn't hallucinating costs more than me just writing it myself in the first place.
I never use it for something that really requires knowledge of the code base, so the quality of the code base doesn't really matter. Also, I don't think it has ever provided me something I wouldn't have been able to do myself pretty quickly.
To be clear, our context window can be really huge if you are living the project. But not if you are new to it or even getting back to it after a few years.
In theory, the codebase should be, as it is, understandable (and it is, with a great deal of rigorous study). In reality, that's simply not the case, not for any non-trivial software system.
What I'm really saying is that our software development software is missing a very important dimension.
And all of that understanding will come from people complaining about you fixing a bug.
Au contraire. I hate writing boilerplate. I hate digging through APIs. I hate typing the same damn thing over and over again.
The easy stuff is mind numbing. The hard stuff is fun.
Coincidentally this also happens with developers in unfamiliar territory.
I'd like to think most developers know how to say "I don't know, let's do some research" but in reality, many probably just take a similar approach to the LLM - feign competence and just hack out whatever is needed for today's goal, don't worry about tomorrow.
Really smart junior developers actually have a shot at learning better and moving on from this stage.
If high quality closed off codebases were used in training, would we see an improvement in LLM quality for more complex use cases?
I disagree, but it’s largely a matter of expectations. I don’t expect them to solve hard problems for me. That’s currently still my job. But when I’m writing new code, even for a legacy system, they can save a lot of time in getting the initial coding done, helping write comments, unit tests, and so on.
It’s not doing difficult work, but it saves a lot of toil.
I agree but I find its still a great productivity boost for certain tasks, cutting through the hype and figuring out tasks that are well suited to these tools and prompting optimially has taken me a long time.
E.g. pointing the AI at your code and getting it to write unit tests or writing more boilerplate, faster.
I'm a cofounder at www.ellipsis.dev - we tried to build code generation for a LONG time before we realized that AI Code Review is way more doable with SOTA
Any sources? Seems unlikely that LLMs would be good at something with so little training data in the widely available internet.
its like you are building website thats not using MVC and complain that LLM advice is garbage...
because they are easier to maintain, there should be no clever tricks or arch.
all software arch should be boring and simple, with as few tricks as possible, unless it is absolutely warranted
I read somewhere that 1/6 of the time should be allocated to refactoring (every 6th cycle). I wonder how that should be done with LLMs.
Good luck to anyone having to maintain legacy LLM-generated codebases in the future, I won't.
Just like most of the web frameworks and ORMs I've been forced to use over the years.
Or, for a joke, LLMs plagiarize!
But they still fail at devops because so many config scripts are at never versions than the training set
I never wanted the LLM to take over the (fun) part - thinking through the hard/unusual parts of the problem - but you’re also not wrong that they’re needed the least for the boilerplate. It’s still nice :)
The LLM gains are in efficiency for rote tasks, not solving the other hard problems that make up 98% of the day. The idea that LLMs are going to advance software in any substantial way seems implausible to me - It's an efficiency tool in the same category as other IDE features, an autocomplete search engine on steroids, not even remotely approaching AGI (yet).
I disagree. They won't do that for existing developers. But they will make it so that tech-savy people will be able to do much more. And they might even make it so that one-off customization per person will become feasable.
Imagine you want to sort hackernews comments by number of character inline in your browser. Tell the AI to add this feature and maybe it will work (just for you). That's some ways I can see substantial changes happen in the future.
This is the thing about the kind of free advertising so many on this site provide for these llm corpos.
I’ve seen so many comparisons between “ai” and “stack overflow” that mirror this sentiment of “it’s still nice :)”.
Who’s laying off and replacing thousands of working staff for “still nice :)” or because of “stack overflow”?
Who’s hiring former alphabet agency heads to their board for “still nice :)”?
Who’s forcing these services into everything for “still nice :)”?
Who’s raising billions for “still nice :)”?
So while developers argue tooth and nail for these tools that they seemingly think everyone only sees through their personal lens of a “still nice :)” developer tool, the companies are leveraging that effort to oversell their product beyond the scope of “still nice :)”.
I'd argue that a lot of this is not "tech debt" but just signs of maturity in a codebase. Real world business requirements don't often map cleanly onto any given pattern. Over time codebases develop these "scars", little patches of weirdness. It's often tempting for the younger, less experienced engineer to declare this as tech debt or cruft or whatever, and that a full re-write is needed. Only to re-learn the lessons those scars taught in the first place.
Fast forward to now and we're basically back to where we started. Only now they're working on code that was written in a different language, which I suppose is (to misappropriate a Royce quote) "worth something, but not much."
That said, this is also a great example of why I get so irritated with colleagues who believe it's possible for code to be "self-documenting" on anything larger than a micro-scale. That's what the original code tried to do, and it meant that its current maintainers were left without any frickin' clue why all those epicycles were in there. Sure, documentation can go stale, but even a slightly inaccurate accounting for the reason would have, at the very least, served as a clear reminder that a reason did indeed exist. Without that, there wasn't much to prevent them from falling into the perennially popular assumption that one's esteemed predecessors were idiots who had no clue what they were doing.
Just to emphasize the point: even if it's not obvious why there is a line of code, it should at least be obvious that the line of code does something. It's important to find out what that something is and remember it for a refactor. At the very least, the knowledge could help you figure out a bug a day or two before you decide to pore over every line in the diff.
https://www.chesterton.org/taking-a-fence-down/ has the full cite on the names.
I never let that happen again.
This idea of easy, worry-free database replatforming strikes me as kind of a shibboleth for identifying people who’ve never done it before. In reality they all have subtle differences in semantics and query optimization behavior that mean that every touch point needs close attention to make sure you understand how the behavior in that part of the system changes (assume it will change) and if that change is acceptable. Thinking abstraction layers can eliminate the need for close attention to a DBMS port is the software engineering equivalent of thinking adaptive cruise control means you can play Slay the Spire while driving to the office.
Yeah like Oracle would ever let that happen
I leave myself notes when I do bug fixes for this exact reason.
Which is borderline the reason for version control: Do a git/svn blame on that line, find what commit it was added, and see what the commit message was. Bonus points if it links to a case on a system you still use. Sure the commit message can be useless, but it's at least something you're forced to enter when committing code, rather than external documentation that can be missed and now be misleading. Version control can even show you that codebase at time that change was made so you can see it in context (which has saved me a few times, showing what something was added for so I could confirm a suspicion).
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
Big Open Source is plotting against the working class developer.
My code doesn't acquire bugs by sitting there in 2024 any more than it did in 2004. On most projects these days I'm using Django + Preact + HTM. Preact and HTM get loaded from static files by my root Django template. My PyPi dependencies are pinned to specific versions, and usually I have <10 (usually it's just Django and Django REST framework, sometimes it's even just Django).
> Back to that two page function. Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn’t have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.
> Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it’s like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.
> When you throw away code and start from scratch, you are throwing away all that knowledge. All those collected bug fixes. Years of programming work.
As a funny aside, I actually noticed this in a completely different field, serial stories on the web (mostly on RoyalRoad)!
Occasionally an author will attempt a rewrite of a story either because the feedback was very critical, or they did not like where their own story had ended up.
I have yet to see a single example for a truly successful rewrite, where the rewrite was really significantly (or at all) better than the original. Usually the rewrite will not get any better ratings or more readers than the first draft - and for good reasons.
There will be improvements, but it will be on the edges. At the core it still remains the same story with the same problems, and some style changes or some improved dialogs don't change that.
----
By the way, there is an old 2016 HN thread with 106 comments "When to Rewrite from Scratch – Autopsy of Failed Software" -- https://news.ycombinator.com/item?id=11553813
----
A rewrite story I heard a long time ago and that I think would actually work best when the issues are severe was from a company that lost all their code (I don't remember the context, it was not data loss). They had spent many years to get to where they had been when they lost everything. They thought it would take almost as many years to get there again, but they started anyway. Turned out they were done in only half a year this time, and much better!
I think having to work with and around your old code (or story, in the RoyalRoad example) is a severe limit on how much you can improve. Your thoughts are not free, most of your mental effort will be around reusing the old code.
That is my own experience too: Writing the software is not my bottleneck. It's finding out what to write in the first place, and the many many small agonizing decisions along the way. I now see that meta knowledge is far more important. For very large projects it may be more difficult though.
I did this myself once, in the early Internet growth days. The company had its own equivalent of PHP (which was still pretty new at the time) and a business software based on it. I was tasked with refactoring the 1.0 version. I threw the code away after a brief look and rewrote from scratch. I did it because I believed having to consider the existing code would be much slower than writing new.
I have no complaints about the 1.0 version, the first version is always limited by the by then still low comprehension of the problem. I think version 2.0 releases might benefit the most from just throwing the 1.x code away and starting fresh, if the understanding of the problem evolv3ed substantially during - and through - the development.
Do you have an opinion when this maturity is too mature?
Let's say, you would need to add a major feature that would drastically change the existing code base. On top of that, by changing the language, this major feature would be effortless to add.
When it is worth to fight with scars or just rewrite?
Rewrites tend to focus all in on implementation.
Wow. It's hard to believe that people are earnestly supposing this. From everything we have evidence of so far, AI generated code is destined to be a prolific font of tech debt. It's irregular, inconsistent, highly sensitive to specific prompting and context inputs, and generally produces "make do" code at best. It can be extremely "cheap" vs traditional contributions, but gets to where it's going by the shortest path rather than the most forward-looking or comprehensive.
And so it does indeed work best with young projects where the prevailing tech debt load remains low enough that the project can absorb large additions of new debt and incoherence, but that's not to the advantage of young projects. It's setting those projects up to be young and debt-swamped much sooner than they would otherwise be.
If mature projects can't use generative AI as extensively, that's going to be to their advantage, not their detriment -- at least in terms of tech debt. They'll be forced to continue plodding along at their lumbering pace while competitors bloom and burst in cycles of rapid initial development followed by premature seizure/collapse.
And to be clear: AI generated code can have real value, but the framing of this article is bonkers.
> I let AI write the parsing and hoooo boy do I regret it.
He's kindly fixed the server 500's now though xD
Instead of genAI doing the rubbish, boring, low status part of the job, you should do the bits of the job no one will reward you for, and then watch as your boss waxes lyrical about how genAI is amazing once you've done all the hard work for it?
It just feels like if you're re-directing your efforts to help the AI, because the AI isn't very good at actual complex coding tasks then... what's the benefit of AI in the first place? It's nice that it helps you with the easy bit, but the easy bit shouldn't be that much of your actual work and at the end of the day... it's easy?
This gives very similar vibes to: "I wanted machines to do all the soul crushing monotonous jobs so we would be free to go and paint and write books and fulfill our creative passions but instead we've created a machine to trivially create any art work but can't work a till"
Machine learning is the high interest credit card of technical debt.
So every time we generate the same boilerplate we really do copy/paste adding to maintenance costs.
We are amazed looking at the code generation capabilities of LLMs forgetting the goal is to have less code - not more.
Do you package it in a reusable library so that you don't have to do the same prompting again?
Or rather - just because it is so easy to do - you don't bother?
If that's the later - that's exactly the pattern I am talking about.
We are a long ways from automating our jobs away, instead our expertise evolves.
I suspect doctors go through a similar evolution as surgical methods are updated.
I would love to read or participate in the discussion of how to be strategic in this new world. Specifically, how to best utilize code generating tools as a SWE. I suppose I can wait a couple of years for new school SWEs to teach me, unless anyone is aware of content on this?
The blind copy-paste has generally been a bad idea though. Still need to read the code spit out, ask for explanations, do some iterating.
But if you have a code base with predictable software architectural patterns, the AI will likely recognise and help with all the boilerplate.
Of course there is a lot of middle ground between bad and good.
For additional context I have not been a software engineer professionally for over a decade but still am in the engineering field.
Usually I will feed in a few functions (or just 1), sometimes a whole module if it small enough, and prompt it for general performance, and maintainability improvements. I just kinda iterate from there. I also restart chats often
> This experience has lead most developers to “watch and wait” for the tools to improve until they can handle ‘production-level’ complexity in software.
You will be waiting until the heat death of the universe.
If you are unable to articulate the exact nature of your problem, it won't ever matter how powerful the model is. Even a nuclear weapon will fail to have effect on target if you can't approximate its location.
Ideas like dumpstering all of the codebase into a gigantic context window seem insufficient, since the reason you are involved in the first place is because that heap is not doing what the customer wants it to do. It is currently a representation of where you don't want to be.
Because with AI you can turn any problem into a black box. You build a model, and call it "solved". But then reality hits ...
I thought that at the beginning the code might be a bit messy because there is the need to iterate fast and quality comes with time, what's the experience of the crowd on this?
Coding haphazardly can be a lot more thrilling, though! I certainly don't enjoy the process of maintaining high quality code. It is lovely in hindsight, but an awful slog in the moment. I suspect that is why startups often need to sacrifice quality: The aforementioned thrill is the motivation to build something that has a high probability of being a complete waste of time. It doesn't matter how fast you can theoretically iterate if you can't compel yourself to work on it.
Anecdotally, I find you can get about 3 days of speed from cutting corners - after that, as you say, you get slowed down more than you got sped up. First day, you get massive speed from going haphazard; second day, you're running out of corners to cut, and on the third day you start running into problems you created for yourself on the first day.
There's little reason to try to go straight for the final product when you don't know exactly how to get there, and that's frequently the case. Build toys to learn what you need efficiently, toss them, and then build the real thing. Trying to shoot for the final product while also changing direction multiple times along the way tends to create code with multiple conflicting goals subtly encoded in it, and it'll just confuse you and others later.
You gotta make the right trade-off at the right time.
Active tradeoff analysis and a structure that allows for honest reflection on current needs is the holy grail.
Choices are rarely about what is best and are rather about finding the least worst option.
A user deleted their account and there’s now a request to register that account with that username? We didn’t think of that (concerns from ux on imposter and abuse to be handled). Better code in a catch and handle this. Do this 100x times and you code has 100x custom branching logic that potentially interacts n^2 ways since each exceptional event could probably occur in conjunction with other exceptional events.
It’s why I caution strongly against rewrites. It’s easy to look at code and say it’s too complex for what it does but is the complexity actually needless? Can you think of a way to refactor the complexity out? If so do that refactor if not a rewrite won't solve it.
If the new codebase is messy because the team is moving fast as parent describes, that means the dev team is doing sloppy work in order to move fast. That type of speed is very short lived, because it's a lot harder to add 100 bugfixes to an already-messy codebase.
Young companies tend to have systems that are small enough or with institutional knowledge to pivot when needed and tend to have small teams with good lines of communication that allow for as shared purpose and values.
Architectural erosion is a long tailed problem typically.
Large legacy companies that can avoid architectural erosion do better than some startups who don't actively target maintainability, but it tends to require stronger commitment from Leadership than most orgs can maintain.
In my experience most large companies confuse the need to maintain adaptability with a need to impose silly policies that are applied irrespective of the long term impacts.
Integration and disintegration drivers are too fluid, context sensitive, and long term for prescription at a central layer.
The possibility mythical Amazon API edict is an example where focusing on separation and product focus could work, with high costs if you never get to the scale where it pays off.
The runways and guardrails concept seems to be a good thing in the clients I have worked for.
Maybe… to put it another way, it’s that time spent on quality isn’t time spent on discovery, but it’s only time spent on quality that gets you quality. So while a company is heavily focused on discovery - iteration, p/m fit, engineers figuring it out, etc - it’s not making a good codebase, and if they never carve out time to focus on quality, that won’t change.
That’s not entirely true - IMO, there’s a synergistic, not exclusionary relationship between the two - but it gets the idea across, I think.
That's the point when a ton of disinterested, inexperienced, and less handpicked people start pushing code in - driven not by the need to build good software, but to close jira tickets.
This invariably results in stagnating productivity at best, and upper management wondering why they are often not delivering on the pre-expansion level, let alone one that would be expected of 3x the headcount.
It's very hard to retrofit quality into existing code. It really should be there from the very start.
The codebases that use the MOST COMMONLY USED LIBRARIES benefit the most from generative AI tools
That means one might find themselves using deprecated but still supported features.
If LLMs came out during the Python 2/3 schism for example, they'd be generating an ever increasing pile of Python 2 code.
But unless you include pagination needs to be handled as well, the LLM will naively just implement the bare minimum.
Context matters. And supplying enough context is what makes all the difference when interacting with these kind of solutions.
> I asked the AI to write me some code to get a list of all the objects in an S3 bucket
they didn’t ask for all the objects in the first returned page of the query
they asked for all the objects.
the necessary context is there.
LLMs are just on par with devs who don’t read tickets properly / don’t pay attention to the API they’re calling (i’ve had this exact case happen with someone in a previous team and it was a combination of both).
In other more obscure cases I just add the documentation to it's context and let it work based on that.
I completely agree. That's why my stance is to wait and see, and in the meanwhile get our shit together, as in make our code maintainable by any intelligent being, human or not.
Missing test? Great, I'll get help identifying what the code should be doing, then use AI to write a boatload of tests in service towards those goals. Then I'll use it to help refactor some of the code.
But unlike the article, this requires actively engaging with the tool rather than, as they say a "sit and wait" (i.e., lazy) approach to developing.
For example a RAG pipeline. People are rushing things to market that are not built to last. The likes of LangChain etc. offer little software engineering polishing. I wish there were a more mature enterprise framework. Spring AI is still in the making and Go is lagging behind.
Asking it for higher level planning / architecture is just asking for pain
The world is complex and we have to write a lot of code to capture that complexity. LLMs are good at the first 20% but balk at the 80% effort to match reality
http://jmc.stanford.edu/articles/lisp.html
> This paper concentrates on the development of the basic ideas of LISP... when the programming language was implemented and applied to problems of artificial intelligence.
OTOH if devs are getting the simpler stuff done faster maybe they have more time to work on debt.
LLMs can’t understand why your firewall rules have strange forwards for ancient enterprise systems, nor can they “automate” Operations on legacy systems or custom implementations. The only way to fix those issues is to throw money and political will behind addressing technical debt in a permanent sense, which no organization seemingly wants to do.
These things aren’t silver bullets, and throwing more technology at an inherently political problem (tech debt) won’t ever solve it.
I find this works because its much easier to debug a subtle GPT bug in a well validated interface than the same bug buried in a nested for loop somewhere.
This is for tiny code snippets, hello-world size, stringing together some primitives to render relatively simple objects.
Turns out, if the codebase / framework is a bit obscure and poorly documented, even the genie can't help.
So you say, but {citation needed}. Stuff like this is simply not known yet.
AI can easily be applied in legacy codebases, like to help with time-consuming refactoring.
Or, y'know, just not bother with any of this bullshit. "We must rewrite everything so that CoPilot will sometimes give correct answers!" I mean, is this worth the effort? Why? This seems bonkers, on the face of it.
It doesn't matter, it's the new hotness. Look at scrum, how shit it is for software and for devs, yet it's absolutely everywhere.
Remember "move fast and break things?" Everyone started taking that as gospel and writing garbage code. It seems the industry is run by toddlers.
/rant
This isn't AI doing.
It's the doing of adding any new feature to a product with existing tech debt.
And since AI for most companies is a feature, like any feature, it only makes the tech debt worse.
Sheesh! The Lizard People walk among us.
While there is no guarantee that the same trajectory is true for programming, we need to heed how emotionally attached we can be to denying the possibility.
Creating react pages is the new COBOL
How does one determine if that's even possible, much less estimate the work involved to get there?
After all, 'subtle control flow, long-range dependencies, and unexpected patterns' do not always indicate tech-debt.
"GARBAGE IN -- GARBAGE OUT!!"
The tooling for this will only improve.
They're trying other techniques to improve what we already have atm, but we're almost at the limit of its capabilities.
I'm sure nothing will change in the future either.
https://www.reuters.com/technology/artificial-intelligence/o...
They're trying other techniques to improve what we already have atm.
The moment you need to do something novel or complicated they choke up.
This is why I'm not very confident that tools like Vercel's v0 (https://v0.dev/) are useful for more than just playing around. It seems very impressive at first glance - but it's a mile wide and only an inch deep.
Some boilerplate is good.
Boilerplate code exists when the next step is often to start customizing it in a unique and unpredictable way.
- 80% of your work is easy, and is accomplished in 20% of the time
- 20% of your work is hard, and takes 80% of the time
If you believe AI can x2 that easy 80% of your work, you have only managed to reduce that 20% to 10%. Someone else can work out that x improvement (1.11?), but it's nowhere near x2.This means you're getting paid 2x more, right?
...Right?