And people just sit around, unimpressed, and complain that ... what ... it isn't a perfect superintelligence that understands everything perfectly? This is the most amazing technology I've experienced as a 50+ year old nerd that has been sitting deep in tech for basically my whole life. This is the stuff of science fiction, and while there totally are limitations, the speed at which it is progressing is insane. And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
Crazy.
And this Tweeter's complaints do not sound like a demand for superintelligence. They sound like a demand for something far more basic than the hype has been promising for years now. - "They continue to fabricate links, references, and quotes, like they did from day one." - "I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error." (Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.) - "They reference a scientific publication, I look it up, it doesn't exist." - "I have tried Gemini, and actually it was even worse in that it frequently refuses to even search for a source and instead gives me instructions for how to do it myself." - "I also use them for quick estimates for orders of magnitude and they get them wrong all the time. " - "Yesterday I uploaded a paper to GPT to ask it to write a summary and it told me the paper is from 2023, when the header of the PDF clearly says it's from 2025. "
A company I worked for spent millions on a customer service solution that never worked. I wouldn’t say that contracted software is useless.
A lot of the "LLMs are worthless" talk I see tends to follow this pattern:
1. Someone gets an idea, like feeding papers into an LLM, and asks it to do something beyond its scope and proper use-case.
2. The LLM, predictably, fails.
3. Users declare not that they misused the tool, but that the tool itself is fundamentally corrupted.
It in my mind is no different to the steam roller being invented, and people remaking how well it flattens asphalt. Then a vocal group trying to use this flattening device to iron clothing in bulk, and declaring steamrollers useless when it fails at this task.
If the data and relationships in those insert queries matter, at some unknown future date you may find yourself cursing your choice to use an LLM for this task. On the other hand you might not ever find out and just experience a faint sense of unease as to why your customers have quietly dropped your product.
Maybe then they’ll snap out of it.
I’ve already seen people completely mess things up. It’s hilarious. Someone who thinks they’re in “founder mode” and a “software engineer” because chatgpt or their cursor vomited out 800 lines of python code.
I'd say maybe up to 5-10 years ago, there was an attitude of learning something to gain mastery of it.
Today, it seems like people want to skip levels which eventually leads to catastrophic failure. Might as well accelerate it so we can all collectively snap out of it.
And is a credulous executive class en masse buying into that steam roller industry marketing and the demos of a cadre of influencer vibe ironers who’ve never had to think about the longer term impacts of steam rolling clothes?
Thank you for mentioning that! What a great example of something an LLM can pretty well do that otherwise can take a lot of time looking up Ansible docs to figure out the best way to do things. I'm guessing the outputs aren't as good as someone real familiar with Ansible could do, but it's a great place to start! It's such a good idea that it seems obvious in hindsight now :-)
I'm not sure if it's a facet of my ADHD, or mild dyslexia, but I find reading documentation very hard. It's actually a wonder I've managed to learn as much as I have, given how hard it is for me to parse large amounts of text on a screen.
Having the ability to interact with a conversational type documentation system, then bullshit check it against the docs after is a game changer for me.
Then there's also the issue that examples in documentation are often very contrived, and sometimes more confusing. So there's value in "work up this to do such and such an operation" sometimes. Then you can interrogate the functionality better.
1. LLMs have been massively overhyped, including by some of the major players.
2. LLMs have significant problems and limitations.
3. LLMs can do some incredibly impressive things and can be profoundly useful for some applications.
I would go so far as to say that #2 and #3 are hardly even debatable at this point. Everyone acknowledges #2, and the only people I see denying #3 are people who either haven't investigated or are so annoyed by #1 that they're willing to sacrifice their credibility as an intellectually honest observer.
This is not the type of report I'd use an LLM to generate. I'd use a database or spreadsheet.
Blindly using and trusting LLMs is a massive minefield that users really don't take seriously. These mistakes are amusing, but eventually someone is going to use an LLM for something important and hallucinations are going to be deadly. Imagine a pilot or pharmacist using an LLM to make decisions.
Some information needs to come from authoritative sources in an unmodified format.
But as soon as your use case goes beyond that LLMs are almost useless.
The main complaint that yes its extremely helpful in that specific subset of problems, it’s not actually pushing human knowledge forward. Nothing novel is being created with it.
It has created this illusion of being extremely helpful when in reality it is a shallow kind of help.
Not true. It's only worthless for the things you can't easily verify. If you have a test for a function and ask an LLM to generate the function, it's very easy to say whether it succeeded or not.
In some cases, just being able to generate the function with the right types will mostly mean the LLM's solution is correct. Want a `List(Maybe a) -> Maybe(List(a))`? There's a very good chance a LLM will either write the right function or fail the type check.
Are you speaking for yourself or everyone?
In a research context, it provides pointers, and keywords for further investigation. In a report-writing context it provides textual content.
Neither of these or the thousand other uses are worthless. Its when you expect working and complete work product that it's (subjectively, maybe) worthless but frankly aiming for that with current gen technology is a fool's errand.
It's (currently) like an ad saying "this product can improve your stuff up to 300%"
* LLMs are dangerously useless for certain domains.
* ... but can be quite useful for others.
* The real problem is: They make it real tricky to tell, because most of all they are trained to sound professional and authoritative. They hallucinate papers because that's what authoritative answers look like.
That already means I think LLMs are far less useful than they appear to be. It doesn't matter how amazing a technology is: If it has failure modes and it is very difficult to know what they are, it's dangerous technology no matter how awesome it is when it is working well. It's far simpler to deal with tech that has failure modes but you know about them / once things start failing it's easy to notice.
Add to it the incessant hype, and, oh boy. I am not at all surprised that LLMs have a ridiculously wide range as to detractors/supporters. Supporters of it hype the everloving fuck out of it, and that hype can easily seem justified due to how LLMs can produce conversational, authoritative sounding answers that are explicitly designed to make your human brain go: Wow, this is a great answer!
... but experts read it and can see the problems there. Which lots of tech suffers from: as a random example: Plenty of highly upvoted apparently fantastically written Stack Overflow answers have problems. For example, it's a great answer... for 10 years ago; it is a bad idea today because the answer has been obsoleted.
But between the fact that it's overhyped and particularly complex to determine an LLM answer is hallucinated drivel, it's logical to me that experts are hyperbolic when highlighting the problems. That's a natural reaction when you have a thing that SEEMS amazing but actually isn't.
To be fair, that's a huge problem with stack overflow and its culture. A better version of stack overflow wouldn't have that particular issue.
Thats nice and impressive, but there are still important issues and shortcomings. Obligatory, semirelated xkcd: https://xkcd.com/937/
You're the first in the thread to have brought that up; there are far more charitable ways to have interpreted the post you're replying to.
LLMs are _tools_, not oracles. They require thought and skill to use, and not every LLM is fungible with every other one, just like flathead, Phillips, and hex-head screwdrivers aren't freely interchangeable.
Far better to just get these problems resolved.
The secondary market snake oil salesmen <cough>Manus</cough>? That's another matter entirely and a very high degree of skepticism for their claims is certainly warranted. But that's not different than many other huckster-saturated domains.
You're right that they're tools, but I think the complaint here is that they're bad tools, much worse than they are hyped to be, to the point that they actually make you less efficient because you have to do more legwork to verify what they're saying. And I'm not sure that "prompt training," which is what I think you're suggesting, is an answer.
I had several bad experiences lately. With Claude 3.7 I asked how to restore a running database in AWS to a snapshot (RDS, if anyone cares). It basically said "Sure, just go to the db in the AWS console and select 'Restore from snapshot' in the actions menu." There was no such button. I later read AWS docs that said you cannot restore a running database to a snapshot, you have to create a new one.
I'm not sure that any amount of prompting will make me feel confident that it's finally not making stuff up.
I agree that hallucination is still a problem, albeit a lot less of one than it was in the recent past. If you're using LLMs for tasks where you are not directly providing it the context it needs, or where it doesn't have solid tooling to find and incorporate that context itself, that risk is increased.
Hard disagree, and I feel like this assumption might be at the root of why some people seem so down on LLMs.
They’re a tool. When they’re useful to me, they’re so useful they save me hours (sometimes days) and allow me to do things I couldn’t otherwise, and when they’re not they’re not.
It never takes me very long to figure out which scenario I’m in, but I 100% understand and accept that figuring that out is on me and part of the deal!
Sure if you think you can “vibe code” (or “vibe founder”) your way to massive success but getting LLMs to do stuff you’re clueless about without anyone way to check, you’re going to have a bad time, but the fact they can’t (so far) do that doesn’t make them worthless.
I recently did this with a (pretty large) exported CSV of calories/exercise data from MyFitnessPal and asked it to evaluate it against my goals/past bloodwork etc (which I have in a "Claude Project" so that it has access to all that information + info I had it condense and add to the project context from previous convos).
It wrote a script to extract out extremely relevant metrics (like ratio of macronutrients on a daily basis for example), then ran it and proceeded to talk about the result, correlating it with past context.
Use the tools properly and you will get the desired results.
Likely the most accurate measure of progress would be watching detractors goalposts move over time.
Why have these companies not manually engineered out a problem like this by now? Just do a check to make sure links are real. That's pretty unimpressive to me.
There are no fabricated links, references, or quotes, in OpenAI's GPT 4.5 + Deep Research.
It's unfortunate the cost of a Deep Research bespoke white paper is so high. That mode is phenomenal for pre-work domain research. You get an analyst's two week writeup in under 20 minutes, for the low cost of $200/month (though I've seen estimates that white paper cost OpenAI over USD 3000 to produce for you, which explains the monthly limits).
You still need to be a domain expert to make use of this, just as you need to be to make use of an analyst. Both the analyst and Deep Research can generate flawed writeups with similar misunderstandings: mis-synthesizing, misapplication, or missing inclusion of some essential.
Neither analyst nor LLM is a substitute for mastery.
What are the use cases where the expected performance is high?
>What are the use cases where the expected performance is high?
https://openai.com/index/introducing-chatgpt-pro/
o1-pro is probably at top tier human level performance on most small coding tasks and definitely at answering STEM questions. o3 is even better but not released outside of it powering Deep Research.
https://codeforces.com/blog/entry/137543 o3 is top 200 on Codeforces for example.
Yet the hucksters hyping AI are falling all over themselves saying AI can do all this stuff. This is where the centi-billion dollar valuations are coming from. It's been years and these super hyped AIs still suck at basic tasks.
When pre-AI shit Google gave wrong answers it at least linked to the source of the wrong answers. LLMs just output something that looks like a link and calls it a day.
https://marginalrevolution.com/marginalrevolution/2025/02/de...
<<I've had the hallucination problem too, which renders it less than useful on any complex research project as far as I'm concerned.>>
These quotes are from the link you posted. There are a lot more.
Hosted and free or subscription-based DeepResearch like tools that integrate LLMs with search functionality (the whole domain of "RAG" or "Retrieval Augmented Generation") will be elementary for a long time yet simply because the cost of the average query starts to go up exponentially and there isn't that much money in it yet. Many people have and will continue to build their own research tools where they can determine how much compute time and API access cost they're willing to spend on a given query. OCR remains a hard problem, let alone appropriately chunking potentially hundreds of long documents into context length and synthesizing the outputs of potentially thousands of LLM outputs into a single response.
More than marketing, I think from my experience it's chat with little control over context as the primary interface of most non-engineers with LLMs that leads to (mis)expectations of the tool in front of them. Having so little control over what is actually being input to the model makes it difficult to learn to treat a prompt as something more like a program.
OpenAI did similar things by focusing to the point of absurdity on 'safety' for what was basically a natural language search engine that has a habit of inventing nonsensical stuff. But on that same note (and also as you alluded to) - I do agree that LLMs have a lot of use as natural language search engines in spite of their proclivity to hallucinate. Being able to describe a e.g. function call (or some esoteric piece of history) by description and then often get the precise term/event that I'm looking for is just incredibly useful.
But LLMs obviously are not sentient, are not setting us on the path to AGI, or any other such nonsense. They're arguably what search engines should have been 10 or 15 years ago, but anti-competitive monopolization of the industry meant that search engine technology progress basically stalled out, if not regressed for the sake of ads (and individual 'entrepreneurs' becoming better at SEO), about the time Google fully established itself.
I presume you are referring to this Google engineer, who was sacked for making the claim. Hardly an example of AI companies overhyping the tech; precisely the opposite, in fact. https://www.bbc.co.uk/news/technology-62275326
It seems to be a common human hallucination to imagine that large organisations are conspiring against us.
I wasn't making a political point. You see similar evidence-free allegations against international organisations and national government bodies.
That leaves the question of whether the organization is commensal, symbiotic or predatory towards any given "us".
That's not what happened. Google stomped hard on Lemoine, saying clearly that he was wrong about LaMDA being sentient ... and then they fired him for leaking the transcripts.
Your whole argument here is based on false information and faulty logic.
Were people capable to lift concrete pillars, cranes would not be sought.
Edit: #@!! snipers. Speak up.
Noting that people are imperfect is not a justification for the weaknesses in LLMs. Since around late 2022 some people started stating LLMs are "smart like their cousin", to which the answer remains "we hope that your cousin has a proportionate employment".
If you built a crane that only lifts 15kg, it's no justification that "many people lift 10". The purpose of the crane is to lift as needed, with abundance for safety.
If we build cranes, it is because people are not sufficient: the relative weakness of people is, far from a consolation of weak cranes, the very reason why we want strong cranes. Similarly for intelligence and other qualities.
People are known to use use «false information and faulty logic»: but they are not being called "adequate".
> angry at
There's a subculture around here that thinks it normal to downvote without any rebuttal - equivalent to "sneering and leaving" (quite impolite), almost all times it leaves us without a clue about what could be the point of disapproval.
I mean, you're right that he's silly and Google didn't want to be part of it, but it was (and is?) taken seriously that: LLMs are nascent AGI, companies are pouring money to get there first, we might be a year or two away. Take these as true, it's at least possible that Google might have something chained up in their basement.
In retrospect, Google dismissed him because he was acting in a strange and destructive way. At the time, it could be spun as just further evidence: they're silencing him because he's right. Could it have created such hysteria and silliness if the environment hadn't been so poisoned by the talk of imminent AGI/sentience?
I don't think they were, but I think it's pretty clear they were marketed as being the imminent path to super-intelligence, or something like it. OpenAI were saying GPT-(n-1) is as intelligent as a high school student, GPT-(n) is a university student, GPT-(n+1) will be.. something.
The focus on safety, and the concept of "AI", preexisted the product. An LLM was just the thing they eventually made; it wasn't the thing they were hoping to make. They applied their existing beliefs to it anyway.
No, first time I hear about it. I guess the secret to happiness is not following leaks. I had very low expectations before trying LLMs and I’m extremely impressed now.
https://www.theguardian.com/technology/2022/jun/12/google-en...
He was fired and a casual browse of his blog makes it quite clear that he was a few fries short of a Happy Meal all along.
Not following leaks, or just the news, not living in the real world, not caring of the consequences of reality: anybody can think he's """happy""" with psychedelia and with just living in private world. But it is the same kind of "happy" that comes with "just smile".
If you did not get information that there are severe pitfalls - which is by the way so unrelated to the "it's sentient thing", as we are talking about the faults in the products, not the faults in human fools -, you are supposed to see them from your own judgement.
You mean like actual literature, textbooks and scientific papers? You can't get them in bulk without pirating. Thank intellectual property laws.
> from social media clouds the companies control
I.e. conversations of real people about matters of real life.
But if it satisfies your elitist, ivory-towerish vision of "healthy information diet" for LLMs, then consider that e.g. Twitter is where, until now, you'd get most updates from the best minds in several scientific fields. Or that besides r/All, the Reddit dataset also contains r/AskHistorians and other subreddits where actual experts answer questions and give first-hand accounts of things.
The actually important bit though, is that LLM training manages to extract value from both the "bullshit" and whatever you'd call "not bullshit", as the model has to learn to work with natural language just as much as it has to learn hard facts or scientific theories.
Did many people overhype LLMs? Yes, like with everything else (transhumanist ideas, quantum physics). It helps being more picky who one listens to, and whether they're just painting pretty pictures with words, or actually have something resembling a rational argument in there.
For some tasks they're still next to useless, and people who do those tasks understandably don't get the hype.
Tell a lab biologist or chemist to use an LLM to help them with their work and they'll get very little useful out of it.
Ask an attorney to use it and it's going to miss things that are blindingly obvious to the attorney.
Ask a professional researcher to use it and it won't come up with good sources.
For me, I've had a lot of those really frustrating experiences where I'm having difficulty on a topic and it gives me utter incorrect junk because there just isn't a lot already published about that data.
I've fed it tricky programming tasks and gotten back code that doesn't work, and that I can't debug because I have no idea what it's trying to do, or I'm not familiar with the libraries it used.
But truthfully 90% of work related programming is not problem solving, it's implementing business logic. And dealing with poor, ever changing customer specs. Which an llm will not help with.
Au contraire, these are exactly things LLMs are super helpful at - most of business logic in any company is just doing the same thing every other company is doing; there's not that many unique challenges in day-to-day programming (or business in general). And then, more than half of the work of "implementing business logic" is feeding data in and out, presenting it to the user, and a bunch of other things that boil down to gluing together preexisting components and frameworks - again, a kind of work that LLMs are quite a big time-saver for, if you use them right.
If you think "it can't quite do what I need, I'll wait a little longer until it can" you may still be waiting 50 years from now.
Most programmers understand reading code is often harder than writing it. Especially when someone else wrote the code. I'm a bit amused by the cognitive dissonance of programmers understanding that and then praising code handed to them by an LLM.
It's not that LLMs are useless for programming (or other technical tasks) but they're very junior practitioners. Even when they get "smarter" with reasoning or more parameters their nature of confabulation means they can't be fully trusted in the way their proponents suggest we trust them.
It's not that people don't make mistakes but they often make reasonable mistakes. LLMs make unreasonable mistakes at random. There's no way to predict the distribution of their mistakes. I can learn a human junior developer sucks at memory management or something. I can ask them to improve areas they're weak in and check those areas of their work in more detail.
I have to spend a lot of time reviewing all output from LLMs because there's rarely rhyme or reason to their errors. They save me a bunch of typing but replace a lot of my savings with reviews and debugging.
For many industries/people work is a means to earn, not something to be passionate in for its own sake. Its a means to provide for other things in life you are actually passionate about (e.g. family, lifestyle, etc). In the end AI may get your job eventually but if it gets you much later vs other industries/domains you win from a capital perspective as other goods get cheaper and you still command your pre-AI scarcity premium. This makes it easier for them to acquire more assets from the early disrupted industries and shield them from eventual AI taking over.
I'm seeing this directly in software. Less new frameworks/libraries/etc outside the AI domain being published IMO, more apprehension from companies to open source their work and/or expose what they do, etc. Attracting talent is also no longer as strong of a reason to showcase what you do to prospective employees - economic conditions and/or AI make that less necessary as well.
As with all LLM usage right now, it's a tool and not fit for every purpose. But it has legit uses for some attorney tasks.
That's a terrible use for an LLM. There are several deterministic search engines attorneys use to find relevant case law, where you don't have to check to see if the cases actually exist after it produces results. Plus, the actual text of the case is usually very important, and isn't available if you're using an LLM.
Which isn't to say they're not useful for attorneys. I've had success getting them to do some secretarial and administrative things. But for the core of what attorneys do, they're not great.
The orchestration of LLms that will be reading transcripts, reading emails, reading case law, and preparing briefs with sources is unavoidable in the next 3 years. I don’t doubt multiple industry specialized solutions are already under development.
Just asking chatGPT to make your case for you is missing the opportunity.
If anyone is unable to get Claud 3.7 or Gemini 2.5 to accelerate their development work I have to doubt their sentience at this point. (Or more likely doubt that they’re actively testing these things regularly)
I don’t trust it blindly, and I often don’t use most of what it suggests; but I do apply critical thinking to evaluate what might be useful.
The simplest example is using it as a reverse dictionary. If I know there’s a word for a concept, I’ll ask an LLM. When I read the response, I either recognize the word or verify it using a regular dictionary.
I think a lot of the contention in these discussions is because people are using it for different purposes: it's unreliable for some purposes and it is excellent at others.
Wait, why would a law firm create its own repository of case law? It's not like it has access to secret case law that other lawfirms do not.
Only if you're okay with it missing stuff. If I hired a lawyer, and they used a magic robot rather than doing proper research, and thus missed relevant information, and this later came to light, I'd be going after them for malpractice, tbh.
https://legal.thomsonreuters.com/en/products/westlaw-edge https://www.lexisnexis.com/en-us/products/protege.page
This is because programmers talk on the forums that programmers scrape to get data to train the models.
I’m not a biologist (good or bad) but the scientists I know (who I think are good) often complain that most of the work is drudgery unrelated to the science they love.
Edit to add: and regardless, I'm less interested in the "LLM's aren't ever useful to science" part of the point. The point that actual LLM usage in science will mostly be for cases where they seem useful but actually introduce subtle problems is much more important. I have observed this happening with trainees.
The technology is indeed amazing and very amusing, but like all the good things in the hands of corporate overlords, it will be slowly turning into profit-milking abomination.
This is your interpretation of what these companies are saying. I'd love to see if some company specifically anything like that?
Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now? How many things from today when brought back into 2010 would make the person using it make it feel like they're being tricked or pranked? We already take them for granted even thought they've only been around for less than half of a decade.
LLMs aren't a catch all solution to the world's problems; or something that is going to help us in every facet of our lives; or an accelerator for every industry that exists out there. But at no point in history could you talk to your phone about general topics, get information, practice language skills, build an assistant that teaches your kid about the basics of science, use something to accelerate your work in a many different ways etc...
Looking at llms shouldn't be boolean, it shouldn't be between they're the best thing ever invented vs they're useless; but it seems like everyone presents the issue in this manner and Sabine is part of that problem.
>I'd love to see if some company specifically anything like that?
1. DeepMind researchers: Sparks of Artificial General Intelligence: Early experiments with GPT-4 - https://arxiv.org/abs/2303.12712
2. "GPT-4 is not AGI, but it does exhibit more general intelligence than previous models." - Sam Altman
3. Musk has claimed that AI is on the path to "understanding the universe." His branding of Tesla's self-driving AI as "Full Self-Driving" (FSD) also misleadingly suggests a level of autonomous reasoning that doesn't exist.
4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think.
>Out of the last 100 years how many inventions have been made that could make any human awe like llms do right now?
ELIZA is an early natural language processing computer program developed from 1964 to 1967
ELIZA's creator, Weizenbaum, intended the program as a method to explore communication between humans and machines. He was surprised and shocked that some people, including Weizenbaum's secretary, attributed human-like feelings to the computer program. 60 years ago.
So as you can see, us humans are not too hard to fool with this.
Also,
"4. Meta's AI chief scientist, Yann LeCun, has repeatedly said they are working on giving AI "common sense" and "world models" similar to how humans think."
completely misses the mark. That LLMs don't do this is a criticism from old-school AI researchers like Gary Marcus; LeCun is saying that they are addressing the criticism by developing the sorts of technology that Marcus says are necessary.
As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
I don't know exactly what your point is with ELIZA?
> So as you can see, us humans are not too hard to fool with this.
I mean ok? How is that related to having a 30 minute conversation with ChatGPT where it teaches you a language? Or Claude outputting an entire application in a single go? Or having them guide you through fixing your fridge by uploading the instructions? Or using NotebookLM to help you digest a scientific paper?
Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
There’s no denying that when AI helps you achieve or learn something new, it’s a fascinating moment — proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
Others addressed code, but with system design specifically - this is more of an engineering field now, in that there's established patterns, a set of components at various levels of abstraction, and a fuck ton of material about how to do it, including but not limited to everything FAANG publishes as preparatory material for their System Design interviews. At this point in time, we have both a good theoretical framework and a large collection of "design patterns" solving common problems. The need for advanced reasoning is limited, and almost no one is facing unique problems here.
I've tested it recently, and suffice it to say, Claude 3.7 Sonnet can design systems just fine - in fact much better than I'd expect a random senior engineer to. Having the breadth of knowledge and being really good at fitting patterns is a big advantage it has over people.
> They push the narrative that they’ve created something akin to human cognition
I am saying they're not doing that, they're doing sales and marketing and it's you that interprets this as possible/true. In my analogy if the company said it's a hammer that can do anything, you wouldn't use it to debug elixir. You understand what hammers are for and you realize the scope is different. Same here. It's a tool that has its uses and limits.
> Your example actually highlights this well. AI excels at language, so it’s naturally strong in teaching (especially for language learning ;)). But coding is different. It’s not just about syntax; it requires problem-solving, debugging, and system design — areas where AI struggles because it lacks true reasoning.
I disagree since I use it daily and Claude is really good at coding. It's saving me a lot of time. It's not gonna build a new Waymo but I don't expect it to. But this is besides the point. In the original tweet what Sabine is implying is that it's useless and OpenAI should be worth less than a shoe factory. When in fact this is a very poor approach to look at LLMs and their value and both sides of the spectrum are problematic (those that say it's a catch all AGI and those that say hurr it couldn't solve P versus NP it's trash).
You've moved the goalpost from "they're not saying it" to "they're saying, but you're not supposed to believe it."
Person you replied to: they intentionally use suggestive language that leads people to think AI is approaching human cognition. This helps with hype, investment, and PR.
Your response: As do all companies in the world. If you want to buy a hammer, the company will sell it as the best hammer in the world. It's the norm.
> proof that we’re living in 2025, not 1967. But the more commercialised it gets, the more mythical and misleading the narrative becomes
You seem to be living in 2024, or 2023. People generally have far more pragmatic expectations these days, and the companies are doing a lot less overselling ... in part because it's harder to come up with hype that exceeds the actual performance of these systems.
A statement on their personal Twitter might not be "the company's" statement, but who cares?
Sam Altman's social media IS OpenAI marketing.
That's reallyyyy trying hard to minimise the capability of LLMs and their potentials that we're still discovering. But you do you I guess.
> In either case your example is showing what? That lying is normal in the business world and should be done by the CEOs as part of their job description? That they should or should not go to jail for it? I am really missing your point here, no offence.
If you run through the message chain you'll see first that the comment OP is claiming companies market llms as AGI, and then the next guy quotes Altmans tweet to support it. I am saying companies don't claim llms are AGI and that CEOs are doing CEO things; my examples are Elon (didn't go to jail btw) and the other two that did.
> For all that we know, OpenAI and their ilk are not doing that really.
I am on the same page here.
> Anyone who finds them 'mindblowing' clearly does not have a complex enough use case.
What is the point of llms? If their only point is complex use cases then they're useless, let's throw them away. If their point/scope/application is wider and they're doing something for a non negligible percentage of people then who are you to gauge whether they deserve to be mindblowing to someone or not regardless of their use case?
Lots e.g. vacuum cleaners.
> But at no point in history could you talk to your phone
You could always "talk" to your phone just like you could "talk" to a parrot or a dog. What does that even mean?
If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
> If we're talking about LLMs, I still haven't been able to have a real conversation with 1. There's too much of a lag to feel like a conversation and often doesn't reply with anything related.
I don't believe this one bit. But keep on trucking.
Not necessarily. (Some aphonic, adactyl downvoters seem to have possibly tried to nudge you into noticing that your idea above is against some entailed spirit the guidelines.)
The poster may have meant that for the use natural to him, he feels in the results the same utility of discussing with a good animal. "Clarifying one's prompts" may be effective in some cases, but it's probably not what others seek. It is possible that many want the good old combination of "informative" and "insightful": in practice there may be issues with both.
It's not even that. Can the LLM run away, stop the conversation or even say no? It's as much as your boss "talking" to you about the task and not giving you a chance to respond. Is that a talk? It's 1-way.
E.g. ask the LLM who invented Wikipedia. It will respond with "facts". If I ask a friend, the reply might be "look it up yourself". This a real conversation. Until then.
Even parrots and dogs can respond differently than a forced reply exactly how you need it.
A German Onion-like magazine has a wrapper around ChatGPT that behaves like that called „DeppGPT“ (IdiotGPT), likely implemented with a decent prompt.
Imagine the LLM is halfway through its journey to the Moon, and mentally correct for ~1.5 seconds of light lag.
> and often doesn't reply with anything related.
Use better microphone, or stop mumbling.
What is the layman to make of the claim that we now have “reasoning” models? Certainly sounds like a claim of human-like cognition, even though the reality is different.
I think you’re going too far in imagining what one group of people will make of what another group of people is saying, without actually putting yourself in either group.
Yes, it hallucinates and if you replace your brain with one of these things, you won't last too long. However, it can do things which, in the hands of someone experienced, are very empowering. And it doesn't take an expert to see the potential.
As it stands, it sounds like a case of "it's great in practice but the important question is how good it is in theory."
I use LLMs. They're somewhat useful if you're on a non niche problem. They're also useful instead of search engines, but that's because search has been entshittified more than because a LLM is better.
However 90% of the marketing material about them is simply disgusting. The bigwigs sound like they're spreading a new religion, and most enthusiasts sound like they're new converts to some sect.
If you're marketing it as a tool, fine. If you're marketing it as the third and fourth coming of $DEITY, get lost.
The problem for me is that I could use that type of assistance precisely when I hit that "niche problem" zone. Non-niche problems are usually already solved.
Like search. Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers. I have no such problems with my SearxNG instance.
Tough luck. On the other hand, we're still justified in asking for money to do the niche problems with our fleshy brains, right? In spite of the likes of Altman saying every week that we'll be obsoleted in 5 years by his products. Like ... cold fusion? Always 5 years away?
[I have more hope for cold fusion than these "AIs" though.]
> Popular search engines like Google and Bing are mostly garbage because they keep trying to shove gen AI in my face with made up answers.
No they became garbage significantly before "AI". Google at least has gradually reduced the number of results returned and expanded the search scope to the point that you want a reminder of the i2c api syntax on a raspberry pi and they return 20 beginner tutorial results that show you how to unpack the damn thing and do the first login instead.
I'm not marketing it. I'm not a marketer. I'm a developer trying to create an informed opinion on its utility and the marketing speak you criticize is far away from the truth.
The problem is this notion that it's just completely bullshit. The way it's worded irks me. "I genuinely don't understand...". It's quite easy to see the utility and acknowledging that doesn't, in any way, detract from valid criticisms of the technology and the people who peddle.
So someone who isn't already invested can genuinely draw the conclusion they're similar and not worth the time.
Edit: oh wait
>because some marketing people are over-promising
all "AI" marketing people that I've seen. Ok 98%. And all my LLM info is from what gets posted on HN.
> I will retaliate by choosing to believe false things
"I will retaliate by cataloguing them as pathological liars and not waste my time with them any more".
It hurts nobody but the person choosing ignorance.
And 100% of the marketers of course.
Pinch of salt & all.
They actually remind me of myself, as I experience being a native English speaker now living in Berlin and attempting to use a language I mainly learned as an adult.
I can often appear competent in my use of the language, but then I'll do something stupid like asking someone in the office if we have a "Gabelstapler" I can borrow — Gabelstapler is "forklift truck", I meant to ask for a stapler, which is "Tacker" or "Hefter", and I somehow managed to make this mistake directly after carefully looking up the word. (Even this is a big improvement for me, as I started off like Officer Crabtree from Allo' Allo').
Akin to human cognition but still a few bricks short of a load, as it were.
Are you trying to say that LLMs are useful now but you think that will stop being the case at some point in the future?
The tech industry, especially big corporations, doesn’t chase innovation; it chases repeatable, predictable profit.
I mean fine, argue that they're mistaken to be concerned, if that's your belief. But dismissing it all as obvious shilling is not that argument.
I'm not a functionalist and my belief is that AI — especially LLMs — will never achieve real understanding or consciousness, no matter how much we scale them. Language prediction is just a computation, but human thought is more than that.
Above you wrote "we all know the only real Intelligence ... is" as your support for attributing venial motives to people taking AI progress seriously. OK, now I know your basis for that claim. I've read three of the guys you mention, agree they're intelligent and except for Searle have some good things to say. But it's really unconvincing as support for an AI-is-fake claim, and especially for an everyone-knows claim.
But, if you spend too much time fawning over how impressive these things are, you might forget that something being impressive doesn't translate into something being useful.
Well, are they useful? ... Yeah, of course LLMs are useful, but we need to remain somewhat grounded in reality. How useful are LLMs? Well, they can dump out a boilerplate React frontend to a CRUD API, so I can imagine it could very well be harmful to a lot of software jobs, but I hope it doesn't bruise too many egos to point out that dumping out yet another UI that does the same thing we've done 1,000,000 times before isn't exactly novel. So it's useful for some software engineering tasks. Can it debug a complex crash? So far I'm around zero for ten and believe me, I'm trying. From Claude 3.7 to Gemini 2.5, Cursor to Claude Code, it's really hard to get these things to work through a problem the way anyone above the junior dev level can. Almost unilaterally, they just keep digging themselves deeper until they eventually give up and try to null out the code so that the buggy code path doesn't execute.
So when Sabine says they're useless for interpreting scientific publications, I have zero trouble believing that. Scoring high on some shitty benchmarks whose solutions are in the training set is not akin to generalized knowledge. And these huge context windows sound impressive, but dump a moderately large document into them and it's often a challenge to get them to actually pay attention to the details that matter. The best shot you have by far is if the document you need it to reference definitely was already in the training data.
It is very cool and even useful to some degree what LLMs can do, but just scoring a few more points on some benchmarks is simply not going to fix the problems current AI architecture has. There is only one Internet, and we literally lit it on fire to try to make these models score a few more points. The sooner the market catches up to the fact that they ran out of Internet to scrape and we're still nowhere near the singularity, the better.
Hardly. I pretty much have been using LLM at least weekly (most of the time daily) since GPT3.5. I am still amazed. It's really, really hard to not be bullish for me.
It kinda reminds me the days I learned Unix-like command line. At least once a week, I shouted to me self: "What? There is a one-liner that does that? People use awk/sed/xargs this way??" That's how I feel about LLM so far.
Yesterday Gemini 2.5 Pro suggested running "ps aux | grep filename.exe" to find a Wine process (pgrep is the much better way to go for that, but it's still wrong here) and get the PID, then pass that into "winedbg --attach" which is wrong in two different ways, because there is no --attach argument and the PID you pass into winedbg needs to be the Win32 one not the UNIX one. Not an impressive showing. (I already knew how to do all of this, but I was curious if it had any insights I didn't.)
For people with less experience I can see how getting e.g. tailored FFmpeg commands generated is immensely useful. On the other hand, I spent a decent amount of effort learning how to use a lot of these tools and for most of the ways I use them it would be horrific overkill to ask an LLM for something that I don't even need to look anything up to write myself.
Will people in the future simply not learn to write CLI commands? Very possible. However, I've come to a different, related conclusion: I think that these areas where LLMs really succeed in are examples of areas where we're doing a lot of needless work and requiring too much arcane knowledge. This counts for CLI usage and web development for sure. What we actually want to do should be significantly less complex to do. The LLM actually sort of solves this problem to the extent that it works, but it's a horrible kludge solution. Literally converting video files and performing basic operations on them should not require Googling reference material and Q&A websites for fifteen minutes. We've built a vastly overly complicated computing environment and there is a real chance that the primary user of many of the interfaces will eventually not even be humans. If the interface for the computer becomes the LLM, it's mostly going to be wasted if we keep using the same crappy underlying interfaces that got us into the "how do I extract tar file" problem in the first place.
As a yet that's exactly what people get paid to do every day. And if it saves them time, they won't exactly get bored of that feature.
That’s why every low code solution and boilerplate generator for the last 30 years failed to deliver on the promises they made.
If your site has users, it will evolve. I’ve seen users take what was a simple trucking job posting form and repurpose an unused “trailer type” field to track the status of the job req.
Every single app that starts out as a low code/no code solution given enough time and users will evolve beyond that low code solution. They may keep using it, but they’ll move beyond being able to maintain it exclusively through a low code interface.
- Architecture (making it easy to adjust part of the codebase and understanding it)
- Testing (making sure the current version works and future version won't break it)
- Requirements (describing the current version and the planned changes)
- ...
If a project was just a clone, I'd sure people would just buy the existing version and be done with it. And sometimes they do, then a unique requirement comes and the whole process comes back into play.
They are useful enough that they can passably replace (much more expensive) humans in a lot of noncritical jobs, thus being a tangible tool for securing enterprise bottom lines.
They're still useful, but they're not going to make cheap employees wildly more productive, and outside maybe a rare, perfect niche, they're not going to increase expensive employees' productivity so much that you can lay off a bunch of the cheap ones. Like, they're not even close to that, and haven't really been getting much closer despite improvements.
This is so clearly biased that it boarders on parody. You can only get out what you put in. The real use case of current LLMs is that any project that would previously require collaboration can now be down solo with a much faster turnover. Of course in 20 years when compute finally catches up they will just be super intelligent AGI
Despite the ridiculous hype, though, I have found that these things have crossed into usefulness. I imagine for people with less experience, these tools are a godsend, enabling them to do things they definitely couldn't do on their own before. Cool.
Beyond that? I definitely struggle to find things I can do with these tools that I couldn't do better without. The main advantage so far is that these tools can do these things very fast and relatively cheaply. Personally, I would love to have a tool that I can describe what I want in detailed but plain English and have it be done. It would probably ruin my career, but it would be amazing for building software. It'd be like having an army of developers on your desktop computer.
But, alas, a lot of the cool shit I'd love to do with LLMs doesn't seem to pan out. They're really good at TypeScript and web stuff, but their proficiency definitely tapers off as you veer out. It seems to work best when you can find tasks that basically amount to translation, like converting between programming languages in a fuzzy way (e.g. trying to translate idioms). What's troubling me the most is that they can generate shitloads of code but basically can't really debug the code they write beyond the most entry-level problem-solving. Reverse engineering also seems like an amazing use case, but the implementations I've seen so far definitely are not scratching the itch.
> Of course in 20 years when compute finally catches up they will just be super intelligent AGI
I am betting against this. Not the "20 years" part, it could be months for all we know; but the "compute finally catches up" part. Our brains don't burn kilowatts of power to do what they do, yet given basically unbounded time and compute, current AI architectures are simply unable to do things that humans can, and there aren't many benchmarks that are demonstrating how absolutely cataclysmically wide the gap is.
I'm certain there's nothing magical about the meat brain, as much as that is existentially challenging. I'm not sure that this follows through to the idea that you could replicate it on a cluster of graphics cards, but I'm also not personally betting against that idea, either. On the other hand, getting the absurd results we have gotten out of AI models today didn't involve modest increases. It involved explosive investment in every dimension. You can only explode those dimensions out so far before you start to run up against the limitations of... well, physics.
Maybe understanding what LLMs are fundamentally doing to replicate what looks to us like intelligence will help us understand the true nature of the brain or of human intelligence, hell if I know, but what I feel most strongly about is this: I do not believe LLMs are replicating some portion of human intelligence. They are very obviously neither a subset or superset or particularly close to either. They are some weird entity that overlaps in other ways we don't fully comprehend yet.
The big problem with being bullish in the stock market sense is that OpenAI isn't selling the LLMs that currently exist to their investors, they're selling AGI. Their pitch to investors is more or less this:
> If we accomplish our goal we (and you) will have infinite money. So the expected value of any investment in our technology is infinite dollars. No, you don't need to ask what the odds are of us accomplishing our goal, because any percent times infinity is infinity.
Since OpenAI and all the founders riding on their coat tails are selling AGI, you see a natural backlash against LLMs that points out that they are not AGI and show no signs of asymptotically approaching AGI—they're asymptotically approaching something that will be amazing and transformative in ways that are not immediately clear, but what is clear to those who are watching closely is that they're not approaching Altman's promises.
The AI bubble will burst, and it's going to be painful. I agree with the author that that is inevitable, and it's shocking how few people see it. But also, we're getting a lot of cool tech out of it and plenty of it is being released into the open and heavily commoditized, so that's great!
For your edification:
https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
For reference, 1997 original: By advanced artificial general intelligence, I mean AI systems that rival or surpass the human brain in complexity and speed, that can acquire, manipulate and reason with general knowledge, and that are usable in essentially any phase of industrial or military operations where a human intelligence would otherwise be needed.
2014 wiki requirements: reason, use strategy, solve puzzles, and make judgments under uncertainty; represent knowledge, including commonsense knowledge; plan; learn; communicate in natural language; and integrate all these skills towards common goals.
Although if you look at the trajectory of goalposts since https://web.archive.org/web/20140327014303/https://en.wikipe... , I could concede the point, that people discarded the obvious definition in the light of recent events.
A dog in the sun may be hot, but that doesn't make it a hot dog.
You can use a towel to dry your hair, but that doesn't make the towel a hair dryer.
Putting coffee on a dining room table doesn't turn it into a coffee table.
Spreading Elmer's glue on your teeth doesn't make it tooth paste.
The White House is, in fact, a white house, but my neighbor's white house is not The White House.
I could go on, but I think the above is a sufficient selection to show that language does not, in fact, work that way. You can't decompose a compound noun into its component morphemes and expect to be able to derive the compound's meaning from them.
> in most cases
What do you think will happen if we will start comparing the lengths of the list ["hot dog", ...] and the list ["blue bird", "aeroplane", "sunny March day", ...]?
A bluebird is a specific species. A blue parrot is not a bluebird.
An aeroplane is a vehicle that flies through the air at high speeds, but if you broke it down into morphemes and tried to reason it out that way you could easily argue that a two-dimensional flat surface that extends infinitely in all directions and intersects the air should count.
Sunny March day isn't a compound noun, it's a noun phrase.
Can you point me to a single compound noun (that is, a two-or-more-part word that is widely used enough to earn a definition in a dictionary, like AGI) that can be subjected to the kind of breaking apart into morphemes that you're doing without yielding obviously nonsensical re-interpretations?
A paste made of a ground up tooth is clearly tooth paste because it is both a tooth and paste.
For example, “homophobia” literally means “same-fear” - are homophobes afraid of sameness? Do they have an unusual need for variety and novelty?
Firetruck — It's not a truck that is on fire nor is it a truck that delivers fire.
Butterfly — Not a fly made of butter.
Starfish — Not a fish, not a star.
Pineapple — Neither a pine nor an apple.
Guinea pig — A rodent, not a pig.
Koala bear — A marsupial, not a bear.
Silverfish — An insect, not a fish.
In the 90s, Robert Metcalfe infamously wrote "Almost all of the many predictions now being made about 1996 hinge on the Internet’s continuing exponential growth. But I predict the Internet, which only just recently got this section here in InfoWorld, will soon go spectacularly supernova and in 1996 catastrophically collapse." I feel like we are just hearing LLM versions of this quote over and over now, but they will prove to be equally accurate.
Generic. For the Internet, more complex questions would have been "What are the potential benefits, what the potential risks, what will grow faster" etc. The problem is not the growth but what that growth means. For LLMs, the big clear question is "will they stop just being LLMs, and when will they". Progress is seen, but we seek a revolution.
I think this is the source of a lot of the hype. There are people salivating at the thought of no longer needing to employ the peasant class. They want it so badly that they'll say anything to get more investment in LLMs even if it might only ever allow them to fire a fraction of their workers, and even if their products and services suffer because the output they get with "AI" is worse than what the humans they throw away were providing.
They know they're overselling it, but they're also still on their knees praying that by some miracle their LLMs trained on the collective wisdom of facebook and youtube comments will one day gain actual intelligence and they can stop paying human workers.
In the meantime, they'll shove "AI" into everything they can think of for testing and refinement. They'll make us beta test it for them. They don't really care if their AI makes your customer service experience go to shit. They don't care if their AI screws up your bill. They don't care if their AI rejects your claims or you get denied services you've been paying for and are entitled to. They don't care if their AI unfairly denies you parole or mistakenly makes you the suspect of a crime. They don't care if Dr. Sbaitso 2.0 misdiagnoses you. Your suffering is worth it to them as long as they can cut their headcount by any amount and can keep feeding the AI more and more information because just maybe with enough data one day their greatest dream will become reality, and even if that never happens a lot of people are currently making massive amounts of money selling that lie.
The problem is that the bubble will burst eventually. The more time goes by and AI doesn't live up to the hype the harder that hype becomes to sell. Especially when by shoving AI into everything they're exposing a lot of hugely embarrassing shortcomings. Repeating "AI will happen in just 10 more years" gives people a lot of time to make money and cash out though.
On the plus side, we do get some cool toys to play with and the dream of replacing humans has sparked more interest in robotics so it's not all bad.
Something important happened when we turned the tables around, I don't feel it gets the credit it should. It used to be humans telling machines what to do. Now we're doing the opposite.
Sometimes the means are just as important as the ends, if not more
So going back to apples-and-apples comparison, i.e. assuming that "spend a lot of money to get it done for you" is not on the table, I'd trust current SOTA LLM to do a typical person's taxes better than they themselves would.
If a person is making a smaller income their tax situation is probably very simple, and can be handled by automated tools like TurboTax (as the sibling comment suggests).
I don't see a lot of value add from LLMs in this particular context. It's a situation where small mistakes can result in legal trouble or thousands of dollars of losses.
People who paste undisclosed AI slop in forums deserve their own place in hell, no argument there. But what are some good examples of simple tax questions where current models are dangerously wrong? If it's not a private forum, can you post any links to those questions?
Anyway, the magic robot 'knew' all that. Where it slipped up was in actually _working_ with it. Someone asked for a comparison of taxation on a 20 year investment in individual stocks vs ETFs, assuming re-investment of dividends and the same overall growth rate. The machine happily generated a comparison showing individual stocks doing massively better... On closer inspection, it was comparing growth for 20 years for the individual stocks to growth of 8 years for the ETFs. (It also got the marginal income tax rate wrong.)
But the nonsense it spat out _looked_ authoritative on first glance, and it was a couple of replies before it was pointed out that it was completely wrong. The problem isn't that the machine doesn't know the rules; insofar as it 'knows' anything, it knows the rules. But it certainly can't reliably apply them.
(I'd post a link, but they deleted it after it was pointed out that it was nonsense.)
This failure seems similar to a case that someone brought up earlier ( https://news.ycombinator.com/item?id=43466531 ). While better than expected at computation, the transformer model ultimately overestimates its own ability, running afoul of Dunning-Kruger much like humans tend to.
Replying here due to rate-limiting:
One interesting thing is that when one model fails spectacularly like that, its competitors often do not. If you were to cut/paste the same prompt and feed it to o1-pro, Claude 3.7, and Gemini 2.5, it's possible that they would all get it wrong (after all, I doubt they saw a lot of Irish tax law during training.) But if they do, they will very likely make different errors.
Unfortunately it doesn't sound like that experiment can be run now, but I've run similar tests often enough to tell me that wrong answers or faulty reasoning are more likely model-specific shortcomings rather than technology-specific shortcomings.
That's why I get triggered when people speak authoritatively on here about what AI models "can't do" or "will never be able to do." These people have almost always, almost without exception, been proven dead wrong in the past, but that never seems to bother them.
Things could stall out and we'll have bumps and delays ... I hope. If this thing progresses at the same pace, or speeds up, well ... reality will change.
Or not. Even as they are, we can build some cool stuff with them.
The trouble is that, while incredibly amazing, mind blowing technology, it falls down flat often enough that it is a big gamble to use. It is never clear, at least to me, what it is good at and what it isn't good at. Many things I assume it will struggle with, it jumps in with ease, and vice versa.
As the failures mount, I admittedly do find it becoming harder and harder to compel myself to see if it will work for my next task. It very well might succeed, but by the time I go to all the trouble to find out it often feels that I may as well just do it the old fashioned way.
If I'm not alone, that could be a big challenge in seeing long-term commercial success. Especially given that commercial success for LLMs is currently defined as 'take over the world' and not 'sustain mom and pop'.
> the speed at which it is progressing is insane.
But same goes for the users! As a result the failure rate appears to be closer to a constant. Until we reach the end of human achievement, where the humans can no longer think of new ways to use LLMs, that is unlikely to change.
The author says they use several LLMs every day and they always produce incorrect results. That "feels" weird, because it seems like you'd develop an intuition fairly quickly for the kinds of questions you'd ask that LLMs can and can't answer. If I want something with links to back up what is being said, I know I should ask Perplexity or maybe just ask a long-form prompt-like question of Google or Kagi. If I want a Python or bash program I'm probably going to ask ChatGPT or Gemini. If I want to work on some code I want to be in Cursor and am probably using Claude. For general life questions, I've been asking Claude and ChatGPT.
Running into the same issue with LLMs over and over for years, with all due respect, seems like the "doing the same thing and expecting different results" situation.
If you work on stuff that is at all niche (as in, stack overflow was probably not going to have the answer you needed even before LLMs became popular), then it's not surprising when LLMs can't help because they've not been trained.
For people that were already going fast and needed or wanted to put out more code more quickly, I'm sure LLMs will speed them up even more.
For those of us working on niche stuff, we weren't going fast in the first place or being judged on how quickly we ship in all likelihood. So LLMs (even if they were trained on our stuff) aren't going to be able to speed us up because the bottleneck has never been about not being able to write enough code fast enough. There are architectural and environmental and testing related bottlenecks that LLMs don't get rid of.
I don't think I'm working on anything particularly niche, but nor is it cookie-cutter generic either, and that could be enough to drastically reduce their utility.
I just tried to use the latest Gemini release to help me figure out how to do some very basic Google Cloud setup. I thought my own ignorance in this area was to blame for the 30 minutes I spent trying to follow its instructions - only to discover that Gemini had wildly hallucinated key parts of the plan. And that’s Google’s own flagship model!
I think it’s pretty telling that companies are still struggling to find product-market fit in most fields outside of code completion.
But ask it to solve some leet code and it’s brilliant.
I should start collecting examples, if only for threads like this. Recently I tried to llm a tsserver plugin that treats lines ending with "//del" as empty. You can only imagine all the sneaky failures in the chat and the total uselessness of these results.
Anything that is not literally millions (billions?) of times in the training set is doomed to be fantasized about by an LLM. In various ways, tones, etc. After many such threads I came to conclusion that people who find it mostly useful are simply treading water as they probably have done most of their career. Their average product is a react form with a crud endpoint and excitement about it. I can't explain their success reports otherwise, cause it rarely works on anything beyond that.
If your job is copy-pasting from Stack Overflow then LLMs are an upgrade.
I wonder how much this affects our fundraising, for example. No VC understands the science here, so they turn to advisors (which is great!) or to LLMs… which has us starting off on the wrong foot.
And how many actual humans, with a fair bit of training, can become a little bit less than useless?
I mean, my parents used to have this dog that would just look at you like "go get you own damn ball, stupid human" if you threw a ball around him.
--edit--
and, yes, the dog also made grammatical mistakes.
When your user says that your product doesn’t work for them, saying they’re using it wrong is not an excuse.
Because it has a sample size of our collective human knowledge and language big enough to trick our brains into believing that.
As a parallel thought, it reminds of a trick derren brown did. He picked every horse correctly across 6 races. The person who he was picking for was obviously stunned, as were the audience watching it.
The reality of course is just that people couldn't comprehend that he just had to go to extreme and tedious lengths to make this happen. They started with 7000 people and filmed every one like it was going to be the "one" and then the probability pyramid just dropped people out. It was such a vast undertaking of time and effort that we're biased towards believing there must be something really happening here.
LLMs currently are a natural language interface to a Microsoft Encarta like system that is so unbelievably detailed and all encompassing that we risk accepting that there's something more going on there. There isn't.
Yes, it's artificial intelligence. It's not the real thing, it's artificial.
There is no meaningful interpretation of the word intelligence that applies, psychologically or philosophically, to what is going on. Machine Learning is far more apt and far less misleading.
I saw the transition from ML to AI happen in academic papers and then pitch decks in real time. It was to refill the well when investors were losing faith that ML could deliver on the promises. It was not progress driven.
this doesn't make any more sense than calling LLMs "intelligence". There is no "our intelligence" beyond a concept or an idea that you or someone else may have about the collective, which is an abstraction.
What we do each have our own intelligence, and that intelligence is and likely always be, no matter how science progresses, ineffable. So my point is you can't say your made up/ill defined concept is any realer than any other made up/ill defined concept.
No, that's not my problem with it. My problem with it is that inbuilt into the models of all LLMs is that they'll fabricate a lot. What's worse, people are treating them as authoritative.
Sure, sometimes it produces useful code. And often, it'll simply call the "doTheHardPart()" method. I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over.
Outside of programming, this is much worse. I've both seen online and heard people quote LLM output as if it were authoritative. That to me is the bigger danger of LLMs to society. People just don't understand that LLMs aren't high powered attorneys, or world renown doctors. And, unfortunately, the incorrect perception of LLMs is being hyped both by LLM companies and by "journalists" who are all to ready to simply run with and discuss the press releases from said LLM companies.
Still the elephant in the room. We need an AI technology that can output "don't know" when appropriate. How's that coming along?
This remains very far from proven.
The null hypothesis that would be necessary to reject, therefore, is a most unfortunate one, viz. that by training for plausibility we are creating the world's most convincing bullshit machines.
That is the most horribly dangerous idea, as we demand that the agent guesses not, even - and especially - when the agent is a champion at guessing - we demand that the agent checks.
If G guesses from the multiplication table with remarkable success, we more strongly demand that G computes its output accurately instead.
Oracles that, out of extraordinary average accuracy, people may forget are not computers, are dangerous.
The best example of this was an arguement I had a little while ago where I was talking about self driving and I was mentioning that I have a hard time trusting any system relying only on cameras, to which I was being told that I didn't understand how machine learning works and obviously they were correct and I was wrong and every car would be self driving within 5 years. All of these things could easily be verified independently.
Suffice to say that I am not sure that the "bullshit-radar" is that adaptive...
Mind you, this is not limited to the particular issue at hand but I think those situations needs to be highlighted, because we get fooled easily by authoritative delivery...
Humans don't and LLM's are essentially trained to resemble most humans.
To make another parallel: that's why we have automated testing in software (long before LLMs). Because you can't trust without checking.
Unless you are in sales or marketing, getting caught lying is really detrimental to your career.
Too little don't knows and end up being wrong, an idiot.
Seems to work for many people. I suspect my career has been hampered by a higher-than-average willingness to say "I don't know"...
People know that computers are deterministic, but most don't realize that determinism and accuracy are orthogonal. Most non-IT people give computers authoritative deference they do not deserve. This has been a huge issue with things like Shot Spotter, facial recognition, etc.
One thing I see a lot on X is people asking Grok what movie or show a scene is from.
LLMs must be really, really bad at this because not only is it never right, it actually just makes something up that doesn't exist. Every, single, time.
I really wish it would just say "I'm not good at this, so I do not know."
I mean, it basically does the same thing if you ask it to do anything racist or offensive, so that override ability is obviously there.
So if it identifies the request as identifying a movie scene, just say 'I don't know', for example.
No different than when asking ChatGPT to generate images or videos or whatever before it could, it would just tell you it was unable to.
So it can say “I don’t know”
But the most likely thing to continue a paper with is not to say at the end „I don‘t know“. It is actually providing sources which it proceeds to do wrongly.
Heh. Easiest answer in the world. To be able to say "don't know", one has first to be able to "know". And we ain't there yet, by large. Not even flying by a million miles of it.
If a lawyer consistently makes stuff up on legal filings, in the worst cases they can lose their license (though they'll most likely end up getting fines).
If a doctor really sucks, they become uninsurable and ultimately could lose their medical license.
Devs that don't double check their work will cause havoc with the product and, not only will they earn low opinions from their colleges, they could face termination.
Again, not perfect, but also not unfigured out.
The "dunno" must not be hardcoded in the data, it must be an output of judgement.
You have an amount of material that speaks of the endeavours in some sport of some "Michael Jordan", the logic in the system decides that if a "Michael Jordan" in context can be construed to be "that" "Michael Jordan" then there will be sound probabilities he is a sportsman; you have very little material about a "John R. Brickabracker", the logic in the system decides that the material is insufficient to take a good guess.
This exists, each next token has a probability assigned to it. High probability means "it knows", if there's two or more tokens of similar probability, or the prob of the first token is low in general, then you are less confident about that datum.
Of course there's areas where there's more than one possible answer, but both possibilities are very consistent. I feel LLMs (chatgpt) do this fine.
Also can we stop pretending with the generic name for ChatGPT? It's like calling Viagra sildenafil instead of viagra, cut it out, there's the real deal and there's imitations.
It’s very rarely clear or explicit enough when that’s the case. Which makes sense considering that the LLMs themselves do not know the actual probabilities
Sure but they often are not necessarily easily interpretable or reliable.
You can use it to compare a model’s confidence of several different answers to the same question but anything else gets complicated and not necessarily that useful.
What? I use several LLM's, including ChatGPT, every day. It's not like they have it all cornered..
It isn't. LLMs are autocomplete with a huge context. It doesn't know anything.
nobody freaks out when humans make mistakes, but we assume our nascent AIs, being machines, should always function correctly all the time
The latter option every single time
A tool that does not function is a defective tool. When I issue a command, it better does it correctly or it will be replaced.
It's a different type of tool - a person has to treat it that way.
I wouldn't bat an eye if people were taking code suggestions, then review it and edit it to make it correct. But from what I see, it's pretty a direct push to production if they got it to compile, which is different from correct.
Thats not an LLM problem. But indeed quite bothersome. Dont tell me what Chatgpt told you. Tell me what you know. Maybe you got it from ChatGPT and verified it. Great. But my jaw kind of drops when people cite an LLM and just assume it’s correct.
Branding for current products have this property today - for example, apple products are seen as being used by creatives and such.
people used to say the exact same thing with wikipedia back when it first started.
In fact, the error might even be a good thing; it reminds attentive readers that Wikipedia is an unreliable source and you always have to check if citations actually say the thing which is being said in the sentence they're attached to.
Comments like these honestly make me much more concerned than LLM hallucinations. There have been numerous times when I've tracked down the source for a claim, only to find that the source was saying something different, or that the source was completely unreliable (sometimes on the crackpot level).
Currently, there's a much greater understanding that LLM's are unreliable. Whereas I often see people treat Wikipedia, posts on AskHistorians, YouTube videos, studies from advocacy groups, and other questionable sources as if they can be relied on.
The big problem is that people in general are terrible at exercising critical thinking when they're presented with information. It's probably less of an issue with LLMs at the moment, since they're new technology and a certain amount of skepticism gets applied to their output. But the issue is that once people have gotten more used to them, they'll turn off they're critical thinking in the same manner that they turn it off when absorbing information from other sources that they're used to.
See the Wikipedia page on the subject :)
You can check them, but Wikipedia doesn't care what they say. When I checked a citation on the French Toast page, and noted that the source said the opposite of what Wikipedia did by annotating that citation with [failed verification], an editor showed up to remove that annotation and scold me that the only thing that mattered was whether the source existed, not what it might or might not say.
The weird part is when people get really concerned that someone might treat the former as a reliable source, but then turn around and argue that people should treat the latter as a reliable source.
[0]: https://en.wikipedia.org/wiki/Equivocation
[1]: https://chatgpt.com/share/67e6adf3-3598-8003-8ccd-68564b7194...
For politically loaded topics, though, Wikipedia has become increasingly biased towards one side over the past 10-15 years.
One of these things is not like the others! Almost always, when I see somebody claiming Wikipedia is wrong about something, it's because they're some kind of crackpot. I find errors in Wikipedia several times a year; probably the majority of my contribution history to Wikipedia https://en.wikipedia.org/wiki/Special:Contributions/Kragen consists of me correcting errors in it. Occasionally my correction is incorrect, so someone corrects my correction. This happens several times a decade.
By contrast, I find many YouTube videos and studies from advocacy groups to be full of errors, and there is no mechanism for even the authors themselves to correct them, much less for someone else to do so. (I don't know enough about posts on AskHistorians to comment intelligently, but I assume that if there's a major factual error, the top-voted comments will tell you so—unlike YouTube or advocacy-group studies—but minor errors will generally remain uncorrected; and that generally only a single person's expertise is applied to getting the post right.)
But none of these are in the same league as LLM output, which in my experience usually contains more falsehoods than facts.
Wikipedia being world-editable and thus unreliable has been beaten into everyone's minds for decades.
LLMs just popped into existence a few years ago, backed by much hype and marketing about "intelligence". No, normal people you find on the street do not in fact understand that they are unreliable. Watch some less computer literate people interact with ChatGPT - it's terrifying. They trust every word!
If you read a non-fiction book on any topic, you can probably assume that half of the information in it is just extrapolated from the authors experience.
Even scientific articles are full of inaccurate statements, the only thing you can somewhat trust are the narrow questions answered by the data, which is usually a small effect that may or may not be reproducible...
Nonfiction books and scientific papers generally only have one person, or at best a dozen or so (with rare exceptions like CERN papers), giving attention to their correctness. Email messages and YouTube videos generally only have one. This limits the expertise that can be brought to bear on them. Books can be corrected in later printings, an advantage not enjoyed by the other three. Email messages and YouTube videos are usually displayed together with replies, but usually comments pointing out errors in YouTube videos get drowned in worthless me-too noise.
But popular Wikipedia articles are routinely corrected by hundreds or thousands of people, all of whom must come to a rough consensus on what is true before the paragraph stabilizes.
Consequently, although you can easily find errors in Wikipedia, they are much less common in these other media.
Newspaper articles? It really depends. I wouldn't take paraphrased quotes or "sources say" as fact.
But as you move to generally more reliable sources, you also have to be aware that they can mislead in different ways, such as constructing the information in a particular way to push a particular narrative, or leaving out inconvenient facts.
It behaves more like an accountable mediator of authority.
Perhaps LLMs offering those (among other) features would be reasonably matched in a authorativity comparison.
Basically at the level of other publishers, meaning they can be as biased as MSNBC or Fox News, depending on who controls them.
So what is your point? You seem to have placed assumptions there. And broad ones, so that differences between the two things, and complexities, the important details, do not appear.
It is, if the purpose of LLMs was to be AI. "Large language model" as a choir of pseudorandom millions converged into a voice - that was achieved, but it is by definition out of the professional realm. If it is to be taken as "artificial intelligence", then it has to have competitive intelligence.
Yes but they're literally told by allegedly authoritative sources that it's going to change everything and eliminate intellectual labor, so is it totally their fault?
They've heard about the uncountable sums of money spent on creating such software, why would they assume it was anything short of advertised?
Why does this imply that they’re always correct? I’m always genuinely confused when people pretend like hallucinations are some secret that AI companies are hiding. Literally every chat interface says something like “LLMs are not always accurate”.
In small, de-emphasized text, relegated to the far corner of the screen. Yet, none of the TV advertisements I've seen have spent any significant fraction of the ad warning about these dangers. Every ad I've seen presents someone asking a question to the LLM, getting an answer and immediately trusting it.
So, yes, they all have some light-grey 12px disclaimer somewhere. Surprisingly, that disclaimer does not carry nearly the same weight as the rest of the industry's combined marketing efforts.
I just opened ChatGPT.com and typed in the question “When was Mr T born?”.
When I got the answer there were these things on screen:
- A menu trigger in the top-left.
- Log in / Sign up in the top right
- The discussion, in the centre.
- A T&Cs disclaimer at the bottom.
- An input box at the bottom.
- “ChatGPT can make mistakes. Check important info.” directly underneath the input box.
I dislike the fact that it’s low contrast, but it’s not in a far corner, it’s immediately below the primary input. There’s a grand total of six things on screen, two of which are tucked away in a corner.
This is a very minimal UI, and they put the warning message right where people interact with it. It’s not lost in a corner of a busy interface somewhere.
Though, my real point is we need to weigh that disclaimer, against the combined messaging and marketing efforts of the AI industry. No TV ad gives me that disclaimer.
Here's an Apple Intelligence ad: https://www.youtube.com/watch?v=A0BXZhdDqZM. No disclaimer.
Here's a Meta AI ad: https://www.youtube.com/watch?v=2clcDZ-oapU. No disclaimer.
Then we can look at people's behavior. Look at the (surprisingly numerous) cases of lawyers getting taken to the woodshed by a judge for submitting filings to a court with chat GPT introduced fake citations! Or, someone like Ana Navarro confidentially repeating an incorrect fact, and when people pushed back saying "take it up with chat GPT" (https://x.com/ananavarro/status/1864049783637217423).
I just don't think the average person who isn't following this closely understands the disclaimer. Hell, they probably don't even really read it, because most people skip over reading most de-emphasized text in most-UIs.
So, in my opinion, whether it's right next to the text-box or not, the disclaimer simply cannot carry the same amount of cultural impact as the "other side of the ledger" that are making wild, unfounded claims to the public.
That was necessary to build trust until they had enough power to convert that trust into money and power.
> Literally every chat interface says something like “LLMs are not always accurate”.
Thank you.
Well... yeah.
No disclaimer is gonna change that.
The underlying cause: 3rd order ignorance:
3rd Order Ignorance (3OI)—Lack of Process. I have 3OI when I don't know a suitably efficient way to find out I don't know that I don't know something. This is lack of process, and it presents me with a major problem: If I have 3OI, I don't know of a way to find out there are things I don't know that I don't know.
—- not from an llm
My process: use llms and see what I can do with them while taking their Output with a grain of salt.
Symptom: "Response was, 'Use the `solvetheproblem` command'". // Cause: "It has no method to know that there is no `solvetheproblem` command". // Alarm: "It is suggested that it is trying to guess a plausible world through lacking wisdom and data". // Fault: "It should have a database of what seems to be states of facts, and it should have built the ability to predict the world more faithfully to facts".
I’m counting down the days when some AI hallucination makes its way all the way to the C-suite. People will get way too comfortable with AI and don’t understand just how wrong it can be.
Some assumption will come from AI, no one will check it and it’ll become a basic business input. Then suddenly one day someone smart will say “thats not true” and someone will trace it back to AI. I know it.
I assume at that point in time there will be some general directive on using AI and not assuming it’s correct. And then AI will slowly go out of favor.
Claude is cheaper, faster, produces better code.
If your junior developer is just "junior", that is one matter; if your junior developer hallucinates documentation details, that's different.
--
Edit: new information may contribute to even this exchange, see https://www.anthropic.com/research/tracing-thoughts-language...
> It turns out that, in Claude, refusal to answer is the default behavior
I.e., boxes that incline to different approaches to heuristic will behave differently and offer different value (to be further assessed within a framework of complexity, e.g. "be creative but strict" etc.)
Claude is substantially cheaper for me, per reviewed, fixed change committed. More importantly to me, it demands less of my limited time per reviewed, fixed change committed.
Having a junior dev working with me at this point wouldn't be worth it to me if it wasn't for the training aspect: We still need pipelines of people who will learn to use the AI models, and who will learn to do the things it can't do well.
Just to be clear, since that expression may reveal a misunderstanding, I meant the sophisticated version of
((gain_jd-loss_jd)>(gain_llm-loss_llm))?(jd):(llm)
But my point was: it's good that Claude has become a rightful legend in the realm of coding, but before and regardless, a candidate that told you "that class will have a .SolveAnyProblem() method: I want to believe" presents an handicap. As you said no assistant revealed to be perfect, but assistants who attempt mixing coding sessions and creative fiction writing raise alarms.The solution is to be selective and careful like always
The same is true about the internet, and people even used to use these arguments to try to dissuade people from getting their information online (back when Wikipedia was considered a running joke, and journalists mocked blogs). But today it would be considered silly to dissuade someone from using the internet just because the information there is extremely unreliable.
Many programmers will say Stack Overflow is invaluable, but it's also unreliable. The answer is to use it as a tool and a jumping off point to help you solve your problem, not to assume that its authoritative.
The strange thing to me these days is the number of people who will talk about the problems with misinformation coming from LLMs, but then who seem to uncritically believe all sorts of other misinformation they encounter online, in the media, or through friends.
Yes, you need to verify the information you're getting, and this applies to far more than just LLMs.
I can peruse your previous posts to see how truthful you are, I can tell if your post has been down/upvoted, I can read responses to your post to see if you've been called out on anything, etc.
This applies tenfold in real life where over time you get to build comprehensive mental models of other people.
Regardless of anything else it’s extremely too early to make such claims. We have to wait until people start allowing “AI agents” to make autonomous blackbox decision with minimal supervision since nobody has any clue what’s happening.
Even if we tone down the SciFi dystopia angle not that many people really use LMMs in non superficial ways yet. What I’m most afraid of would be the next generation growing without the ability to critically synthesize information on their own.
But the implication of what you are saying is that academic rigour is going to be ditched overnight because of LLMs.
That’s a little bit odd. Has the scientific community ever thrown up its collective hands and said “ok, there are easier ways to do things now, we can take the rest of the decade off, phew what a relief!”
Not across all level and certainly not overnight. But a lot of children entering the pipeline might end up having a very different experience than anyone else before LLMs (unless they are very lucky to be in an environment that provides them better opportunities).
> cannot critically synthesize information on their own.
That’s true, but if we even less people will try to so that or even know where to start that will get even worse.
Because in people's experience, LLMs are often correct.
You are right LLMs are not authoritative, but people trust it exactly because they often do produce correct answers.
Happened to me as well. Wanted it to quickly write an algorithm for standard deviation over a stream of data, which is a text-book algorithm. It did it almost right, but messed up the final formula and the code gave wrong answers. Weird, considering some correct codes exist for that problem in Wikipedia.
FWIW, here's 4o writing a selection sort: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d...
And all the models are identical in not being able to discern what is real or something it just made up.
I mean asking a straightforward question like: https://chatgpt.com/share/67e60f66-aacc-800c-9e1d-303982f54d... is entirely pointless as a test
"I've even caught it literally writing the wrong algorithm when asked to implement a specific and well known algorithm. For example, asking it "write a selection sort" and watching it write a bubble sort instead. No amount of re-prompts pushes it to the right algorithm in those cases either, instead it'll regenerate the same wrong algorithm over and over."
Code created by LLM's doesnt compile, hallucinated API's.. invalid syntax and completely broken logic, why would you trust it with someones life !
I imagine we're talking about it as an extra resource rather than trusting it as final in a life or death decision.
I'd like to think so. Trust is also one of those non-concrete terms that have different meanings to different people. I'd like to think that doctors use their own judgement to include the output from their trained models, I just wonder how long it is till they become the default judgement when humans get lazy.
Black-box decisions I absolutely have a problem with. But an extra resource considered by people with an understanding of risks is fine by me. Like I've said in other comments, I understand what it is and isn't good at, and have a great time using ChatGPT for feedback or planning or extrapolating or brainstorming. I automatically filter out the "Good point! This is a fantastic idea..." response it inevitably starts with...
In fact, the phenomenon of pseudo-intelligence scares those who were hoping to get tools that limited the original problem, as opposed to potentially boosting it.
See, now that is something I don't know why I should trust: a random person on the internet citing a statistics that they saw someone else say.
Unlike the LLM, i'm willing to be truthful about my memory.
The irony...
So what? People are wrong all the time. What happens when people are wrong? Things go wrong. What happens then? People learn that the way they got their information wasn't robust enough and they'll adapt to be more careful in the future.
This is the way it has always worked. But people are "worried" about LLMs... Because they're new. Don't worry, it's just another tool in the box, people are perfectly capable of being wrong without LLMs.
For those sensitive use cases, it is imperative we create regulation, like every other technology that came before it, to minimize the inherent risks.
In an unrelated example, I saw someone saying recently they don't like a new version of an LLM because it no longer has "cool" conversations with them, so take that as you will from a psychological perspective.
The LLM doesn’t need to be perfect. Just needs to beat a typical human.
LLM opponents aren’t wrong about the limits of LLMs. They vastly overestimate humans.
And many, many companies are proposing and implementing uses for LLM's to intentionally obscure that accountability.
If a person makes up something, innocently or maliciously, and someone believes it and ends up getting harmed, that person can have some liability for the harm.
If a LLM hallucinates something, that somone believes and they end up getting harmed, there's no accountability. And it seems that AI companies are pushing for laws & regulations that further protect them from this liability.
These models can be useful tools, but the targets these AI companies are shooting for are going to be activly harmful in an economy that insists you do something productive for the continued right to exist.
1. To make those harmed whole. On this, you have a good point. The desire of AI firms or those using AI to be indemnified from the harms their use of AI causes is a problem as they will harm people. But it isn't relevant to the question of whether LLMs are useful or whether they beat a human.
2. To incentivize the human to behave properly. This is moot with LLMs. There is no laziness or competing incentive for them.
That’s not a positive at all, the complete opposite. It’s not about laziness but being able to somewhat accurately estimate and balance risk/benefit ratio.
The fact that making a wrong decision would have significant costs for you and other people should have a significant influence on decision making.
The incentives for the LLM are dictated by the company, at the moment it only seems to be 'whatever ensures we continue to get sales'.
An airline tried to blame its chatbot for inaccurate advice it gave (whether a discount could be claimed after a flight). Tribunal said no, its chatbot was not a separate legal entity.
https://www.bbc.com/travel/article/20240222-air-canada-chatb...
Imagine a chatbot making false promises to prospective customers. Your claim gets denied, you fight it out only to learn their ToS absolves them of "AI hallucinations".
On the contrary. Humans can earn trust, learn, and can admit to being wrong or not knowing something. Further, humans are capable of independent research to figure out what it is they don't know.
My problem isn't that humans are doing similar things to LLMs, my problem is that humans can understand consequences of bullshitting at the wrong time. LLMs, on the other hand, operate purely on bullshitting. Sometimes they are right, sometimes they are wrong. But what they'll never do or tell you is "how confident am I that this answer is right". They leave the hard work of calling out the bullshit on the human.
There's a level of social trust that exists which LLMs don't follow. I can trust when my doctor says "you have a cold" that I probably have a cold. They've seen it a million times before and they are pretty good at diagnosing that problem. I can also know that doctor is probably bullshitting me if they start giving me advice for my legal problems, because it's unlikely you are going to find a doctor/lawyer.
> Just needs to beat a typical human.
My issue is we can't even measure accurately how good humans are at their jobs. You now want to trust that the metrics and benchmarks used to judge LLMs are actually good measures? So much of the LLM advocates try and pretend like you can objectively measure goodness in subjective fields by just writing some unit tests. It's literally the "Oh look, I have an oracle java certificate" or "Aws solutions architect" method of determining competence.
And so many of these tests aren't being written by experts. Perhaps the coding tests, but the legal tests? Medical tests?
The problem is LLM companies are bullshiting society on how competently they can measure LLM competence.
Some humans can, certainly. Humans as a race? Maybe, ish.
You can do the same with LLM, I gaslight chatgpt all the time so it not hallucinate
In fact we do know how good doctors and lawyers are at their jobs, and the answer is "not very." Medical negligence claims are a huge problem. Claims agains lawyers are harder to win - for obvious reasons - but there is plenty of evidence that lawyers cannot be presumed competent.
As for coding, it took a friend of mine three days to go from a cold start with zero dev experience to creating a usable PDF editor with a basic GUI for a specific small set of features she needed for ebook design.
No external help, just conversations with ChatGPT and some Googling.
Obviously LLMs have issues, but if we're now in the "Beginners can program their own custom apps" phase of the cycle, the potential is huge.
This is actually an interesting one - I’ve seen a case where some copy/pasted PDF saving code caused hundreds of thousands of subtly corrupted PDFs (invoices, reports, etc.) over the span of years. It was a mistake that would be very easy for an LLM to make, but I sure wouldn’t want to rely on chatgpt to fix all of those PDFs and the production code relying on them.
> days to go from a cold start with zero dev experience
How is that relevant?
This paragraph makes little sense. A negligence claim is based on a deviation from some reasonable standard, which is essentially a proxy for the level of care/service that most practitioners would apply in a given situation. If doctors were as regularly incompetent as you are trying to argue then the standard for negligence would be lower because the overall standard in the industry would reflect such incompetence. So the existence of negligence claims actually tells us little about how good a doctor is individually or how good doctors are as a group, just that there is a standard that their performance can be measured against.
I think most people would agree with you that medical negligence claims are a huge problem, but I think that most of those people would say the problem is that so many of these claims are frivolous rather than meritorious, resulting in doctors paying more for malpractice insurance than necessary and also resulting in doctors asking for unnecessarily burdensome additional testing with little diagnostic value so that they don’t get sued.
I won’t defend lawyers. They’re generally scum.
If LLM output is like a magic 8 ball you shake, that is not very valuable unless it is workload management for a human who will validate the fitness of the output.
It's one of those "quantities is so fascisnating, lets ignore how we got here in the first place"
If you're lucky it figures it out. If you aren't, it makes stuff up in a way that seems almost purposefully calculated to fool you into assuming that it's figured everything out. That's the real problem with LLM's: they fundamentally cannot be trusted because they're just a glorified autocomplete; they don't come with any inbuilt sense of when they might be getting things wrong.
What matters is speeding up how fast I can find information. Not only will LLMs sometimes answer my obscure questions perfectly themselves, but they also help to point me to the jargon I need to use to find that information online. In many areas this has been hugely valuble to me.
Sometimes you do just have to cut your losses. I've given up on asking LLMs for help with Zig, for example. It is just too obscure a language I guess, because the hallucination rate is too high to be useful. But for webdev, Python, matplotlib, or bash help? It is invaluable to me, even though it makes mistakes every now and then.
What is the point of limiting delegation to such an extreme dichotomy? As apposed to getting more things done?
The vast majority of useful things we delegate, or do for others ourselves, are not as well specified, or as legally liable for any imperfections, as an accountant doing accounting.
The entire pitch behind AI is that it can automate these jobs. If you can’t trust it, then AI is useless. Excluding art theft obviously.
I will take LLM over real person anytime! At least it does not get b*hurt when I double check!
Spend some time with current reasoning models. Your experience is obsolete if you still hold this belief.
Sounds like your experiences, along with zozbot234's, are different enough from mine that they are worth repeating and understanding. I'll report back with the results I see on the current models.
- LLMs are a miraculous technology that are capable of tasks far beyond what we believed would be achievable with AI/ML in the near future. Playing with them makes me constantly feel like "this is like sci-fi, this shouldn't be possible with 2025's technology".
- LLMs are fairly clueless for many tasks that are easy enough for humans, and they are nowhere near AGI. It's also unclear whether they scale up towards that goal. They are also worse programmers than people make them to be. (At least I'm not happy with their results.)
- Achieving AGI doesn't seem impossibly unlikely any more, and doing so is likely to be an existentially disastrous event for humanity, and the worst fodder of my nightmares. (Also in the sense of an existential doomsday scenario, but even just the tought of becoming... irrelevant is depressing.)
Having one of these beliefs makes me the "AI hyper" stereotype, another makes me the "AI naysayer" stereotype and yet another makes me the "AI doomer" stereotype. So I guess I'm all of those!
In my opinion, there can exist no AI, person, tool, ultra-sentient omniscient being, etc. that would ever render you irrelevant. Your existence, experiences, and perception of reality are all literally irreplaceable, and (again, just my opinion) inherently meaningful. I don't think anyone's value comes from their ability to perform any particular feat to any particular degree of skill. I only say this because I had similar feelings of anxiety when considering the idea of becoming "irrelevant", and I've seen many others say similar things, but I think that fear is largely a product of misunderstanding what makes our lives meaningful.
Seems like she's given a drill with a flathead, and just complains for months on end that it often fails (she didnt charge the drill) or gives her useless results (she uses philipheads). How about figuring out what works and what doesn't, and adjusting your use of the tool accordingly? If she is a painter, don't blame the drill for messing up her painting.
When I tried to use the technology that %90 meant 1 out of every 10 things I wrote were incorrect. If it had been a keyboard I would have thrown it in the trash. That is were my Palm ended up.
People expect their technology to do things better not almost as well as a human. Waymo with LIDAR hasn't killed people. Tesla, with camera only, has done so multiple times. I will ride in a Waymo never in a Tesla self driving car.
Almost every counter-criticism of LLMs almost boil down to
1. you're holding it wrong
2. Well, I use it $DAYJOB and it works great for me! (And $DAYJOB is software engineering).
I'm glad your wife was able to save 2 hours of work, but forgive me if that doesn't translate to the trillion dollar valuation OpenAI is claiming. It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
When computers/internet first came about, there were (and still are!) people who would struggle with basic tasks. Without knowing the specific task you are trying to do, its hard to judge whether its a problem with the model or you.
I would also say that prompting isn't as simple as made out to be. It is a skill in itself and requires you to be a good communicator. In fact, I would say there is a reasonable chance that even if we end up with AGI level models, a good chunk of people will not be able to use it effectively because they can't communicate requirements clearly.
> It's strange you don't see the inherent irony in your post. Instead of your wife just directly uploading the dataset and a prompt, she first has to prompt it to write code. There are clear limitations and it looks like LLMs are stuck at some sort of wall.
What's ironic about that? That's such a tiny imperfection. If that's anything near the biggest flaw then things look amazing. (Not that I think it is, but I'm not here to talk about my opinion, I'm here to talk about your irony claim.)
This reply is 4 comments deep into such cases, and the OP is about a well educated person who describes their difficulties.
>What's ironic about that? That's such a tiny imperfection.
I'd argue it's not tiny - it highlights the limitations of LLMs. LLMs excel at writing basic code but seem to struggle, or are untrustworthy, outside of those tasks.
Imagine generalizing his case: his wife goes to work and tells other bookkeepers "ChatGPClaudeSeek is amazing, it saved 2 hours for me". A coworker, married to a lawyer, instead of a software engineer, hearing this tries it for himself, and comes up short. Returning to work the next day and talking about his experience is told - "oh you weren't holding it right, ChatGPClaudeSeek can't do the work for you, you have to ask it to write code, that you must then run". Turns out he needs an expert to hold it properly and from the coworker's point of view he would probably need to hire an expert to help automate the task, which will likely only be marginally less expensive than it was 5 years ago.
From where I stand, things don't look amazing; at least as amazing as the fundraisers have claimed. I agree that LLMs are awesome tools - but I'm evaluating from a point of a potential future where OpenAI is worth a trillion dollars and is replacing every job. You call it a tiny imperfection, but that comes across as myopic to me - large swaths of industries can't effectively use LLMs! How is that tiny?
The LLM wrote the code, then used the code itself, without needing a coder around. So the only negative was needing to ask it specifically to use code, right? In that case, with code being the thing it's good at, "tell the LLM to make and use code" is going to be in the basic tutorials. It doesn't need an expert. It really is about "holding it right" in a non-mocking way, the kind of instructions you expect to go through for using a new tool.
If you can go through a one hour or less training course while only half paying attention, and immediately save two hours on your first use, that's a great return on the time investment.
It`d take more time for me to flesh this out than I want to give but the basic idea is I am not just sitting there "expecting things". I´ve been puzzled too at why so many people don´t seem to get it or are so frustrated like this lady, and in my observation this is their common element. It just looks very passive to me, the way they seem to use the machines and expect a result to be "given" to them.
PS. It reminds me very strongly of how our parent generation uses computers. Like the whole way of thinking is different, I cannot even understand why they would act certain ways or be afraid of acting in other ways, it´s like they use a different compass or have a very different (and wrong) model in their head of how this thing in front of them works.
But I mirror the confusion why people are still bullish on it. The current valuation for it is because the market thinks that it's able to write code like a senior engineer and have AGI, because that's how they're marketed by the LLM providers.
I'm not even certain if they'll be ubiquitous after the venture capital investments are gone and the service needs to actually be priced without losing money, because they're (at least currently) mostly pretty expensive to run.
In markets, perception is reality, and the perception is that these companies are innovative. That’s it.
“In the short run, the market is a voting machine but in the long run, it is a weighing machine.”
- Benjamin Graham, 1949
NFT is still a great tool if you want a bunch of unique tokens as part of a blockchain app. ERC-721 was proven a capable protocol in a variety of projects. What it isn't, and never will be, is an amazing investment opportunity, or a method to collect cool rare apes and go to yacht parties.
LLMs will settle in and have their place too, just not in the forefront of every investors mind.
The technology is useful, for some people, in some situations. It will get more useful for more people in more situations as it improves.
Current valuations are too high (Gartner hype cycle), after they collapse valuations will be too low (again, hype cycle), then it'll settle down and the real work happens.
There's almost a negligible chance any one of these shops stays truly independent, unless propped up by a state-level actor (China/EU)
You might have some consulting/service companies that will promise to tailor big models to your specific needs, but they will be valued accordingly (nowhere near billions).
I don't know if the survivors are going to be in consulting - there is some kind of LLM-base product capability, you could conceivably see a set of LLM-based products building companies emerge. But it'll probably be a bit different, like the mobile app boom was a bit different from the web boom.
this endgame AI company are to create Intelligence equal to human, imagine you are not paying 23k workforce, that's a lot of money to be made
If machines can easily replace all of your workers, that means other people's machines can also replace your workers.
LLMs might lead to AGI. Eventually.
Meanwhile every company that is spruiking that, and betting their business that that's going to happen before they run out of VC funding, is going to fail.
I expect it to eventually be a duopoly like android and iOS. At world scale, it might divide us in a way that politics and nationalities never did. Humans will fall into one of two AI tribes.
Additionally, you can use reasoning model thinking with non-reasoning models to improve output, so I wouldn't be surprised if the common pattern was routing hard queries to reasoning models to solve at a high level, then routing the solution plan to a smaller on device model for faster inference.
If I'm wrong though and some digital alchemy finally manages to turn our facebook comments into a super-intelligence we'll only have a few years of an increasingly hellish dystopia before the machines do the smart thing and humanity gets what we deserve.
I feel like a lot of people mean that OpenAI is burning through venture capital money. It's debatable, but it's a huge jump to go from that to thinking it's going to crash the stock market (OpenAI isn't even publicly traded).
I'm not going to check all of the companies, but at least looking at the first two, I'm not really seeing anything out of the ordinary.
They just need to stay a bit ahead of the open source releases, which is basically the status quo. The leading AI firms have a lot of accumulated know-how wrt. building new models and training them, that the average "no-name cloud" vendor doesn't.
No, OpenAI alone additionally need approximately $5B of additional cash each and every year.
I think Claude is useful. But if they charged enough money to be cashflow positive, it's not obvious enough people would think so. Let alone enough money to generate returns to their investors.
Based on that alone it’s worth quite a lot.
Is there a shortage of React apps out there that companies are desperate for?
I'm not having a go at you--this is a genuine inquiry.
How many average people are feeling like they're missing some software that they're able to prompt into existence?
I think if anything, the last few years have revealed the opposite, that there's a large/huge surplus of people in the greater software business at large that don't meet the demand when money isn't cheap.
I think anyone in the "average" range of skill looking for a job can attest to the difficulties in finding a new/any job.
I have a friend in IT who never knew how to code but is now saving a lot of time every month with just basic scripts.
IMO if coding models get good enough to replace devs, we will see an explosion of software before it flattens out.
tech always shortage because its a kitchen sink problem, corporation want to reduce headcount so it saves money in the long run
We're several years in now, and have lots of A:B comparisons to study across orgs that allowed and prohibited AI assistants. Is one of those groups running away with massive productivity gains?
Because I don't think anybody's noticed that yet. We see layoffs that makes sense on their own after a boom, and cut across AI-friendly and -unfriendly orgs. But we don't seem to see anybody suddenly breaking out with 2x or 5x or 10x productivity gains on actual deliverables. In contrast, the enshittening just seems to be continuing as it has for years and the pace of new products and features is holding steady. No?
You mean... two years in? Where was the internet 2 years into it?
You may be intending to refer to 1971 (about two years after the creation of ARPANet) but really the more accurate comparison would be to 1995 (about two years since ISPs started offering SLIP/PPP dialup to the general public for $50/month or less).
And I think the comparison to 1995, the year of the Netscape IPO and URLs starting to appear in commercials and on packaging for consumer products, is apt: LLMs have been a research technology for a while, it’s their availability to the general public that’s new in the last couple of years. Yet while the scale of hype is comparable, the products aren’t: LLMs still don’t anything remotely like what their boosters claim, and have done nothing to justify the insane amounts of money being poured into them. With the Internet, however, there were already plenty of retailers starting to make real money doing electronic commerce by 1995, not just by providing infrastructure and related services.
It’s worth really paying attention to Ed Zitron’s arguments here: The numbers in the real world just don’t support the continued amount of investment in LLMs. They’re a perfectly fine area of advanced research but they’re not a product, much less a world-changing one, and they won’t be any time soon due to their inherent limitations.
I think it's pretty fair to say that they have close to doubled my productivity as a programmer. My girlfriend uses ChatGPT daily for her work, which is not "tech" at all. It's fair to be skeptical of exactly how far they can go but a claim like this is pretty wild.
It remains to be seen how viable this casual usage actually is once this money dries up and you actually need to pay per prompt. We'll just have to see where the pricing will eventually settle, before that we're all just speculating.
My grandfather didn’t care about these and you don’t care about LLMs, we get it
> They’re a perfectly fine area of advanced research but they’re not a product
lol come on man
But these are people that wanted to be in programming in the first place.
This "my mom can now code and got a job because of LLMs" myth, does this creature really exist in the wild?
They aren’t building anything themselves. I find this to be disingenuous as best, and is a sign to me of bubble attribution.
I also think that re-branding Machine Learning as AI to also be disingenuous.
These technologies of course have their use cases and excel in some things, but this isn’t the ushering of actual, sapient intelligence, that for the majority of the term’s existence was the de facto agreed standard for the term “AI”. This technology does lack the actual markers of what is generally accepted as intelligence to begin with
> IBM had developed a paper plan for such a machine and took this paper plan across the country to some 20 concerns that we thought could use such a machine. I would like to tell you that the [IBM 701] machine rents for between $12,000 and $18,000 a month, so it was not the type of thing that could be sold from place to place. But, as a result of our trip, on which we expected to get orders for five machines, we came home with orders for 18.
LLMs today feel like the former, but are being marketed as the latter. Fully believe that advancements will make them better, but in their current state they're being touted for their possibilities, not their actual capabilities.
I'm for using AI now as the tool they are, but AI is a while off taking senior development jobs. So when I see them being hyped for doing that it just feels like a hype bubble.
I personally think Tesla is positioned well for EV first world compared to other brands. But Chinese companies are catching up.
they are neither FSD nor having cars at the moment
We saw this with the web. Pet.com was not a billion dollar company but the web was real.
I am actually of the belief that LLMs will be amazing but that rank and file companies are going to be the ones that benefit the most.
Just like the internet.
But moore's law should kick in, shouldn't it?
No it's not. If it was valued for that it'd be at least 10X what it is now.
Blockchains are becoming real-time data structures where everyone has admin level read-only access to everyone.
HN not believe blockchain same way Apple,Microsoft,Google etc does
It reminds me a lot of when I first started playing No Man's Sky (the video game). Billions of galaxies! Exotic, one of a kind life forms on every planet! Endless possibilities! I poured hundreds of hours into the game! But, despite all the variety and possibilities, the patterns emerge, and every 'new' planet just feels like a first-person fractal viewer. Pretty, sometimes kinda nifty, but eventually very boring and repetitive. The illusion wore off, and I couldn't really enjoy it anymore.
I have played with a LOT of models over the years. They can be neat, interesting, and kinda cool at times, but the patterns of output and mistakes shatters the illusion that I'm talking to anything but a rather expensive auto-complete.
IMO there are two distinct reasons for this:
1. You've got the Sam Altman's of the world claiming that LLMs are or nearly are AGI and that ASI is right around the corner. It's obvious this isn't true even if LLMs are still incredibly powerful and useful. But Sam doing the whole "is it AGI?" dance gets old really quick.
2. LLMs are an existential threat to basically every knowledge worker job on the planet. Peoples' natural response to threats is to become defensive.
Just off the top of my head there are plenty of knowledge worker jobs where the knowledge isn’t public, nor really in written form anywhere. There just simply wouldn’t be anything for AI to train on.
Given the typical problems of LLMs they are not. You still need them to check the results. It’s like FSD, impressive when it works, bad if not, scary because you never known beforehand when it’s failing
My wife and I both work on and with LLMs and they seem to be, like… 5-10% productivity boosters on a good day. I’m not sure they’re even that good averaged over a year. And they don’t seem to be getting a lot better in ways that change that. Also, they’re that good if you’re good at using them and I can tell you most people really, really are not.
I remember when it was possible to be “good at Google”. It was probably a similar productivity boost. I was good at Google. Most (like, over 95% of) people were not, and didn’t seem to be able to get there, and… also remained entirely employable despite that.
Even if they fail 1% of the time, the cost savings are too great. Businesses will take the risk.
I feel bad for people who haven't yet experienced how useful these models are for programming.
Some also just prefer manually entering everything. Those people I will never understand.
For reference, I program systems code in C/C++ in a large, proprietary codebase.
My experiences with OpenAI(a year ago or more), and more recently, Cursor, Grok-v3 and Deepseek-r1, were all failures. The later two started out OK and got worse over time.
What I haven't done is asked "AI" to whip up a more standard application. I have some ideas(an ncurses frontend to p4 written in python similar to tig, for instance), but haven't gotten around to it.
I want this stuff to work, but so far it hasn't. Now I don't think "programming" a computer in english is a very good idea anyway, but I want a competent AI assistant to pair program with. To the degree that people are getting results, to me it seems they are leveraging very high-level APIs/libraries of code which are not written by AI and solving well-solved, "common" problems(simple games, simple web or phone apps). Sort of like how people gloss over the heavy lifting done by language itself when they praise the results from LLMs in other fields.
I know it eventually will work. I just don't know when. I also get annoyed by the hype of folks who think they can become software engineers because they can talk to an LLM. Most of my job isn't programming. Most of my job is thinking about what the solution should be, talking to other people like me in meetings, understanding what customers really want beyond what they are saying, and tracking what I'm doing in various forms(which is something I really do want AI to help me with).
Vibe coding is aptly named because it's sort of the VB6 of the modern era. Holy cow! I wrote a Windows GUI App!!!. It's letting non-programmers and semi-programmers(the "I write glue code in Python to munge data and API ins/outs" crowd) create usable things. Cool! So did spreadsheets. So did Hypercard. Andrej tweeting that he made a phone app was kinda cool but also kinda sad. If this is what the hundreds of billions spent on AI(and my bank account thanks you for that) delivers then the bubble is going to pop soon.
Usually that's because of context: LLMs are not very good at understanding a very large amount of context, but if you don't give LLMs enough context, they can't magically figure it out on their own. This relegates AI to only really being useful for pretty self-contained examples where the amount of code is small, and you can provide all the context it needs to do its job in a relatively small amount of text (few thousand words or lines of code at most).
That's why I think LLMs are only useful right now in real-world software development for things like one-off functions, new prototypes, writing small scripts, or automating lots of manual changes you have to do. For example, I love using o3-mini-high to take existing tests that I have and modifying them to make a new test case. Often this involves lots of tiny changes that are annoying to write, and o3-mini-high can make those changes pretty reliably. You just give it a TODO list of changes, and it goes ahead and does it. But I'm not asking these models how to implement a new feature in our codebase.
I think this is why a lot of software developers have a bad view of AI. It's just not very good at the core software development work right now, but it's good enough at prototypes to make people freak out about how software development is going to be replaced.
That's not to mention that often when people first try to use LLMs for coding, they don't give the LLMs enough context or instructions to do well. Sometimes I will spend 2-3 minutes writing a prompt, but I often see other people putting the bare minimum effort into it, and then being surprised when it doesn't work very well.
And in terms of time saved, if I am just changing string constants, it’s not going to help much. But if I’m restructuring the test to verify things in a different way, then it is helpful. For example, recently I was writing tests for the JSON output of a program, using jq. In this case, it’s pretty easy to describe the tests I want to make in English, but translating that to jq commands is annoying and a bit tedious. But o3-mini-high can do it for me from the English very well.
Annoying to do myself, but easy to describe, is the sweet spot. It is definitely not universally useful, but when it is useful it can save me 5 minutes of tedium here or there, which is quite helpful. I think for a lot of this, you just have to learn over time what works and what doesn't.
Maybe one of my problems is that I tend to jump into writing simple code or tests without actually having the end point clearly in mind. Often that works out pretty well. When it doesn’t, I’ll take a step back and think things through. But when I’m in the midst of it, it feels like being interrupted almost, to go figure out how to say what I want in English.
Will definitely keep toying with it to see where I can find some utility.
The areas that I've found LLMs work well for are usually small simple tasks I have to do where I would end up Googling something or looking at docs anyway. LLMs have just replaced many of these types of tasks for me. But I continue to learn new areas where they work well, or exceptions where they fail. And new models make it a moving target too.
Good luck with it!
Maybe that's why I don't like them. I'm always in a flow state, or reading docs and/or a lot of code to understand something. By the time I'm typing, I already know what exactly to write, and thanks to my vim-fu (and emacs-fu), getting it done is a breeze. Then comes the edit-compile-run, or edit-test cycle, and by then it's mostly tweaks.
I get why someone would generate boilerplate, but most of the time, I don't want the complete version from the get go. Because later changes are more costly, especially if I'm not fully sure of the design. So I want something minimal that's working, then go work on things that are dependent, then get back when I'm sure of what the interface should be. I like working iteratively which then means small edits (unless refactoring). Not battling with a big dump of code for a whole day to get it working.
If I've got a clear idea of what I want to write, there's no way I'm touching an LLM. I'm just going to smash out the code for exactly what I need. However, often I don't get that luxury as I'll need to learn different file system APIs, different sets of commands, new jargon, different standard libraries for the new languages, new technologies, etc...
Mostly I use it for stupid templates stuff for which it isn’t bad. It’s not the second coming but it definitely speeds you up
None of this is particularly unique to software engineering. So if someone can already do this and add the missing component with some future LLM why shouldn’t they think they can become a software engineer?
Did you catch the sarcasm there?
Are you a manager by any chance? The non-coding parts of my job largely require domain experience. How does an LLM provide you with that?
That's okay.
It's not my responsibility to convince or convert them.
I prefer to just let them be and not engage.
It's like showing someone from 1980 a modern smart phone and them saying, yeah but it can't read my mind.
This leads me to believe that the issue is not that llm skeptics refuse to see, but that you are simply unaware of what is possible without them--because that sort of fuzzy search was SOTA for information retrieval and commonplace about 15 years ago (it was one of the early accomplishments of the "big data/data science" era) long before LLMs and deepnets were the new hotness.
This is the problem I have with the current crop of AI tools: what works isn't new and what's new isn't good.
> It's so self-evident, that I don't know how to take the request for examples seriously
Do you see why people are hesitant to believe people with outrageous claims and no examples
They "hallucinate", they "know", they "think".
They're just the result of matrix calculus on which your own pattern recognition capacities fool you into thinking there is intelligence there. There isn't. They don't hallucinate, their output is wrong.
The worst example I've seen of anthropomorphism was the blog from a searcher working on adverse prompting. The tool spewing "help me" words made them think they were hurting a living organism https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-...
Speaking with AI proponents feels like speaking with cryptocurrencies proponents: the more you learn about how things work, the more you understand they don't and just live in lalaland.
Maybe hype is overly beneficial to them but if you promise me 1500 and I get 1100 then I will underwhelmed.
And especially around LLM marketing hype is fairly extreme.
When businessmen sell me "artificial intelligence", I come prepared for lots of fuckery.
Stitching together well-known web technologies and protocols in well-known patterns, probably a good success rate.
Solving issues in legacy codebases using proprietary technologies and protocols, and non-standard patterns. Probably not such a good success rate.
As the parent says, while far from perfect, they're an incredible aid in so many areas. When used well, they help you produce not just faster but also better results. The only trick really is that you need to treat it as a (very knowledgeable but overconfident) collaborator rather than an oracle.
I say "intern" in the sense that its error-prone and kind of inexperienced, but also generally useful. I can ask it to automatically create a lot of the bootstrapping or tedious code that I always dread writing so that I can focus on the fun stuff, which is often the stuff that's pawned off onto interns and junior-level engineers. I think for the most part, when you treat it like that, it lives up to and sometimes even surpasses expectations.
I mean, I can't speak for everyone, but whenever I begin a new project, a large percentage of the first ~3 hours is simply copying and pasting and editing from documentation, either an API I have to call or some bootstrapping code from a framework or just some cruft to make built-in libraries work how you want. I hate doing all that, it actively makes me not want to start a new project. Being able to get ChatGPT to give me stuff that I need to actually get started on my project has made coding a lot more fun for me again. At this point, you can take my LLM from my cold dead hands.
I do think it will keep getting better, but I'm also at a point where even if it never improves I will still keep using it.
As of today, March 27, 2025, the latest stable version of Laravel is Laravel 11, which was released in March 2024. Laravel 12 has not been released yet (it's expected roughly in Q1 2026 based on the usual schedule).
Could you please double-check the exact Laravel version you are using?" So it did not believe me and I had to convince it first that I was using a real version. This went on for a while, with Gemini not only hallucinating stuff, but also being very persistent and difficult to convince of anything else.
Well, in the end it was still certain that this method should exist, even though it could not provide any evidence for it and my searching through the internet and the Git history of the related packages did also not provide any results.
So I gave up and tried it with Claude 3.7 which could also not provide any working solution.
In the end, I found an entirely different solution for my problem, but that wasn't based on anything the AIs told me, but just my own thinking and talking to other software developers.
I would not go that far to call these AIs useless. In software development they can help with simple stuff and boilerplate code, and I found them a lot more helpful in creative work. This is basically the opposite from what I would have expected 5 years ago ^^
But for any important tasks, these LLMs are still far too unreliable. They often feel like they have a lot of knowledge, but no wisdom. They don't know how to apply their knowledge ideally, and they often basically brute-force it with a mix of strange creativity and statistical models that are apparently based on a vast amount of internet content that has big parts of troll content and satire.
But instead, my productivity is hampered by issues with org communication, structure, siloed knowledge, lack of documentation, tech debt, and stale repos.
I have for years tried to provide feedback and get leadership to do something about these issues, but they do nothing and instead ask "How have you used AI to improve your productivity?"
Thing is, the LLMs that I use are all freeware, and they run on my gaming PC. Two to six tokens per second are alright honestly. I have enough other things to take care of in the meantime. Other tools to work with.
I don't see the billion dollar business. And even if that existed, the means of production would be firmly in the hands of the people, as long as they play video games. So, have we all tripled our salaries?
If we haven't, is that because knowledge work is a limited space that we are competing in, and LLMs are an equalizer because we all have them? Because I was taught that knowledge work was infinite. And the new tool should allow us to create more, and better, and more thoroughly. And that should get us all paid better.
Right?
The problems start when people start hyperventilating because they think since LLMs can generate tests for a function for you, that they'll be replacing engineers soon. They're only suitable for generating output that you can easily verify to be correct.
LLM training is designed to distill a massive corpus of facts, in the form of token sequences, into a much, much smaller bundle of information that encodes (somehow!) the deep structure of those facts minus their particulars.
They’re not search engines, they’re abstract pattern matchers.
1. People creating or dealing with imprecise information. People doing SEO spam, people dealing with SEO spam, almost all creative arts people, people writing corporatese- or legalese- documents or mails, etc. For these tasks LLMs are god-like.
2. People dealing with precise information and or facts. For these people LLMs is no better than a parrot.
3. Subset of 2 - programmers. Because of the huge amount of stolen training data, plus almost perfect proofing software is the form of compilers, static analyzers etc. for this case LLMs are more or less usable, the more data was used the better (JS is the best as I understand).
This is why people's reaction is so polarizing. Their results differ.
The crisis in programming hasn’t been writing code. It has been developing languages and tools so that we can write less of it that is easy to verify as correct. These tools generate more code. More than you can read and more than you will want to before you get bored and decide to trust the output. It is trained on the most average code available that could be sucked up and ripped off the Internet. It will regurgitate the most subtle errors that humans are not good at finding. It only saves you time if you don’t bother reading and understanding what it outputs.
I don’t want to think about the potential. It may never materialize. And much of what was promised even a few years ago hasn’t come to fruition. It’s always a few years away. Always another funding round.
Instead we have massive amounts of new demand for liquid methane, infrastructure struggling to keep up, billions of gallons of fresh water wasted, all so that rich kids can vibe code their way to easy money and realize three months later they’ve been hacked and they don’t know what to do. The context window has been lost and they ran out of API credits. Welcome to the future.
- AI is great for disinformation
- AI is great at generating porn of women without their consent.
- Open source projects massively struggle as AI scrapers DDOS them.
- AI uses massive amounts of energy and water, most importantly the expectation is that energy usage will rise when we drastically in a world where we need to lower it. If Sam Altman gets his way, we're toast.
- AI makes us intellectually lazy and worse thinkers. We were already learning less and less in school because of our impoverished attention span. This is even worse now with AI.
- AI makes us even more dependent on cloud vendors and third-parties, further creating a fragile supply chain.
Like AI ostensibly empowers us as individuals, but in reality I think it's a disservice, and the ones it truly empowers are the tech giants, as citizens become dumber and even more dependent on them and tech giants amass more and more power.
I have yet to see an AI-generated image that was "really cool".
AI images and videos strike me as the coffee pods of the digital world -- we're just absolutely littering the internet with garbage. And as a bonus, it's also environmentally devastating to the real world!
I live nearby a landfill, and go there often to get rid of yard waste, construction materials, etc. The sheer volume of perfectly serviceable stuff people are throwing out in my relatively small city (<200k) is infuriating and depressing. I think if more people visited their local landfills, they might get a better sense for just how much stuff humans consume and dispose. I hope people are noticing just how much more full of trash the internet has become in the last few years. It seems like it, but then I read this thread full of people that are still hyped about it all and I wonder.
This isn't even to mention the generated text... it's all just so inane and I just don't get it. I've tried a few times to ask for relatively simple code and the results have been laughable.
I don't have a proposal for what a better name would have been, naming things is hard, but AI carries quite a bit of baggage and expectations with it.
1. Some people are just uncomfortable with it because it “could” replace their jobs.
2. Some people are warning that the ecosystem bubble is significantly out of proportions. They are right and having the whole stock market, companies and US economy attached to LLMs is just down right irresponsible.
What jobs are seriously at risk of being totally replaced by LLM's? Even in things like copywriting and natural language translation, which is somewhat of a natural "best case" for the underlying tech, their output is quite sub par compared to the average human's.
Hossenfelder is a scientist. There's a certain level of rigour that she needs to do her job, which is where current LLMs often fall down. Arguably it's not accelerating her work to have to check every single thing the LLM says.
I think some people just aren't using them correctly or don't understand their limitations.
They are especially helpful for helping me get over thought paralysis when starting new project.
But while they are fun to play with, anything that requires a real answer, but can’t be directly and immediately checked, like customer support, scientific research, teaching, legal advice, identifying humans, correctly summarizing text - LLMs are very bad at these things, make up answers, mix contexts inappropriately, and more.
I’m not sure how you can have played with LLMs so much and missed this. I hope you don’t trust what they say about recipes or how to handle legal problems or how to clean things or how to treat disease or any fact-checking whatsoever.
This is like a GPT3.5 level criticism. o1-pro is probably better at pure fact retrieval than most PhDs in any given field. I challenge you to try it.
In fact take the GPQA test yourself and see how you do then give the same questions to o1. https://arxiv.org/pdf/2311.12022
I wonder if people that are amazed by LLM lack this information gathering skill.
After all I met plenty of architect and senior level people that just… had zero google and research skills.
To someone who doesn't actually check or have the knowledge or experience to check the output, it sounds like they've been given a real, useful answer.
When you tell the LLM that the API it tried to call doesn't exist it says "Oh, you're right, sorry about that! Here's a corrected version that should work!" and of course that one probably doesn't work either.
One takeaway from this is that labelling LLMs as "intelligent" is a total misnomer. They're more like super parrots.
For software development, there's also the problem of how up to date they are. If they could learn on the fly (or be constantly updated) that would help.
They are amazing in some ways, but they've been over-hyped tremendously.
When I saw GPT-3 in action in 2023, I couldn’t believe my eyes. I thought I was being tricked somehow. I’d seen ads for “AI-powered” services and it was always the same unimpressive stuff. Then I saw GPT-3 and within minutes I knew it was completely different. It was the first thing I’d ever seen that felt like AI.
That was only a few years ago. Now I can run something on my 8GB MacBook Air that blows GPT-3 out of the water. It’s just baffling to me when people say LLM’s are useless or unimpressive. I use them constantly and I can still hardly believe they exist!!
Exactly how I feel. I probably write 50 prompts/day, and a few times a week I still think, "I can't believe this is real tech."
It is an impressive technology but is it US$244.22bn [1] impressive (I know this stat is supposed to account for computer vision as well but seeing as to how LLMs are now a big chunk of that I think it's a safe assumption)? It's projected to grow to over US$1tr by 2031. That's higher than the market size of commercial aviation at its peak [2]. I'm sorry if I agree that a cool chatbot is not approximately as important as flying.
[1] https://www.statista.com/outlook/tmo/artificial-intelligence...
[2] https://www.statista.com/markets/419/topic/490/aviation/#sta...
It's bad technology because it wastes a lot of labor, electricity, and bandwidth in a struggle to achieve what most human beings can with minimal effort. It's also a blatant thief of copyrighted materials.
If you want to like it, guess what, you'll find a way to like it. If you try to view it from another persons use case you might see why they don't like it.
You no longer have the console as the primary interface, but a GUI, which 99.9+% of computer users control via a mouse.
You no longer have the screen as the primary interface, but an AUI, which 99.9+% of computer users control via a headset, earbuds, or a microphone and speaker pair.
You mostly speak and listen to other humans, and if you're not reading something they've written, you could have it read to you in order to detach from the screen or paper.
You'll talk with your computer while in the car, while walking, or while sitting in the office.
An LLM makes the computer understand you, and it allows you to understand the computer.
Even if you use smart glasses, you'll mostly talk to the computer generating the displayed results, and it will probably also talk to you, adding information to the displayed results. It's LLMs that enable this.
Just don't focus too much on whether the LLM knows how high Mount Kilimanjaro is; its knowledge of that fact is simply a hint that it can properly handle language.
Still, it's remarkable how useful they are at analyzing things.
LLMs have a bright future ahead, or whatever technology succeeds them.
LLMs are very useful for some (mostly linguistic) tasks, but the areas where they're actually reliable enough to provide more value than just doing it yourself are narrow. But companies really need this tech to be profitable and so they try to make people use it for as many things as possible and shove it in everyone's face[0] in hopes that someone finds a use-case where the benefits are indeed immediately obvious and revolutionary.
[0] For example my dad's new Android phone by default opens a Gemini AI assistant when you hold the power button and it took me minutes of googling to figure out how to make it turn off the damn thing. Whoever at Google thought that this would make people like AI more is in the wrong profession.
It used to be annoying enough just having to clean the trackball, but at least you knew when it wasn't working.
Personally, I look back at how many years ago it was that we were seeing claims that truck drivers were all going to lose there jobs and society would tear itself apart over it within the next few years… and yet here we still are.
That said, I do experience frustrations: - Getting enraged when it messes up perfectly good code it wrote just 10 minutes ago - Constantly reminding it we're NOT using jest to write tests - Discovering it's created duplicate utilities in different folders
There's definitely a lot of hand-holding required, and I've encountered limitations I initially overlooked in my optimism.
But here's what makes it worthwhile: LLMs have significantly eased my imposter syndrome when it comes to coding. I feel much more confident tackling tasks that would have filled me with dread a year ago.
I honestly don't understand how everyone isn't completely blown away by how cool this technology is. I haven't felt this level of excitement about a new technology since I discovered I could build my own Flash movies.
So far the industrial applications haven't been that promising, code writing and documentation is probably the most promising but even there it's not like it can replace a human or even substantially increase their productivity.
But for larger tasks—say, around 2,000 lines of code—it often fails in a lot of small ways. It tends to generate a lot of dead code after multiple iterations, and might repeatedly fail on issues you thought were easy to fix. Mentally, it can get exhausting, and you might end up rewriting most of it yourself. I think people are just tired of how much we expect LLMs to deliver, only for them to fail us in unexpected ways. The LLM is good, but we really need to push to understand its limitations.
If you don’t constantly look for information, they might be less useful.
I did have a eureka moment the other day with deepseek and a very obscure bug I was trying to tackle. One api query was having a very weird, unrelated side effect. I loaded up cursor with a very extensive prompt and it actually figured out the call path I hadn't been able to track down.
Today, I had a very simple task that eventually only took me half an hour to manually track. But I started with cursor using very similar context as the first example. It just kept repeatedly dreaming up non-existent files in the PR and making suggestions to fix code that doesn't exist.
So what's the worth to my company of my very expensive time? Should I spend 10,20,50 percent of my time trying to get answers from a chatbot, or should I just use my 20 years of experience to get the job done?
The quote about books being a mirror reflecting genius or idiocy seems to apply.
I see LLMs a kind of hyper-keyboard. Speeding up typing AND structuring content, completing thoughts, and inspiring ideas.
Unlike a regular keyboard, an LLM transforms input contextually. One no longer merely types but orchestrates concepts and modulates language, almost like music.
Yet mastery is key. Just as a pianist turns keystrokes into a symphony through skill, a true virtuoso wields LLMs not as a crutch but as an amplifier of thought.
In the 70's I read in some science book for kids about how one day we will likely be able to use light emitting diodes for illumination instead of light bulbs, and this "cold light" will save us lots of energy. Waited out that one too; it turned out so.
More like we note the frequency with which these tools produce shallow bordering on useless responses, note the frequency with which they produce outright bullshit, and conclude their output should not be taken seriously. This smells like the fervor around ELIZA, but with several multinational marketing campaigns behind it pushing.
1. I'm aware that LLMs can generate images and video as well. The point applies.
By the way, you don't need to be a 50+ year old nerd. Nerds are a special culture-pen where smart straight-A students from schools are placed so they can work, increase stakeholder revenues, and not even accidentally be able to do anything truly worthwhile that could redistribute wealth in society.
https://www.ycombinator.com/companies/domu-technology-inc/jo...
If we judge a technology by how it transforms our lives, LLMs and GenAI has mostly been a net negative (at least that is how it feels).
Anyone who remembers further back than a decade or so remembers when the height of AI research was chess programs that could beat grandmasters. Yes, LLMs aren't C3PO or the like, but they are certainly more like that than anything we could imagine just a few years ago.
I remember seeing an AI lab in the late 1980's and thinking "that's never going to work" but here we are, 40 years later. It's finally working.
I feel like if teleportation was invented tomorrow, people would complain that it can't transport large objects so it's useless.
Same vibe.
Basically people just doubling down on everything you just described. I can’t quite put a finger on it but it has a tinge of insecurity or something like that, hope that’s not the case and me just misinterpreting
Choose a very narrow domain, that you known well, and you quickly realize they are just repeating the training data.
> And people are like, "Wah, it can't write code like a Senior engineer with 20 years of experience!"
But LLMs should be good enough to resolve this confusion, ask them!
... But I do not believe we're on the cusp of a Lawnmower-Man future where someone's Metaverse eats all the retail-conference-halls and movie-theaters and retail-stores across the entire globe in an unbridled orgy of mind-shattering investor returns.
Similarly, LLMs are neat and have some sane uses, but the fervor about how we're about to invent the Omnimind and usher in the singularity and take over the (economic) world? Nah.
What next, "This Internet thing was just a fad" or "The industrial age was a fad"?
> Ah yes, the "computer graphics, which has generated billions upon billions of revenue and change are just a fad" argument.
in reference to this thing that OP said:
> It's like computer graphics and VR: Amazing advances over the years, very impressive, fun, cool, and by no means a temporary fad...
Either you, yourself are an LLM, or you need to slow the fuck down and read.
As far as breaking our reality and society? Absolutely :(
On the other hand, I saw github recently added Copilot as a code reviewer. For fun I let it review my latest pull request. I hated its suggestions but could imagine a not too distant future where I'm required by upper management to satisfy the LLM before I'm allowed to commit. Similarly, I've asked ChatGPT questions and it's been programmed to only give answers that Silicon Valley workers have declared "correct".
The thing I always find frustrating about the naysayers is that they seem to think how it works today is the end if it. Like I recently listened to an episode of Econtalk interviewing someone on AI and education. See lives in the UK and used Tesla FSD as an example of how bad AI is. Yet I live in California and see Waymo mostly working today and lots of people using it. I believe she wouldn't have used the Tesla FSD example, and would possibly have changed her world view at least a little, if she'd updated on seeing self driving work.
Except this isn't true. The code quality varies dramatically depending on what you're doing, the length of the chat/context, etc. It's an incredible productivity booster, but even earlier today, I wasted time debugging hallucinated code because the LLM mixed up methods in a library.
The problem isn't so much that it's not an amazing technology, it's how it's being sold. The people who stand to benefit are speaking as though they've invented a god and are scaring the crap out of people making them think everyone will be techno-serfs in a few years. That's incredibly careless, especially when as a technical person, you understand how the underlying system works and know, definitively, that these things aren't "intelligent" the way they're being sold.
Like the startups of the 2010s, everyone is rushing, lying, and huffing hopium deluding themselves that we're minutes away from the singularity.
It was, more or less, the same narrative arc as Bitcoin, and was (is) headed for a crash.
That said, I've spent a few weeks with augment, and it is revelatory, certainly. All the marketing - aimed at a suite I have no interest in - managed to convince me it was something it wasn't. It isn't a replacement, any more than a power drill is a replacement for a carpenter.
What it is, is very helpful. "The world's most fully functioning scaffolding script", an upgrade from copilot's "the world's most fully functioning tab-completer". I appreciate it usefulness as a force multiplier, but I am already finding corners and places where I'd just prefer to do it myself. And this is before we get into the craft of it all - I am not excited by the pitch "worse code, faster", but the utility is undeniable in this capitalistic hell planet, and I'm not a huge fan of writing SQL queries anyway, so here we are!
Thank goodness for that too. I want it to help me with my job, not replace me.
Both are Markov chains, that you used to erroneously think Markov chain is a way to make a chatbot rather than a general mathematical process is on you not them.
Chatbots like in the sci-fi of your nostalgia? I never dreamed about that shit, sorry.
it isnt ANY form of intelligence.
Maybe Freud could explain.
That's how I see LLMs and the hype surrounding them.
To quote Joel Spolsky, "When you’re working on a really, really good team with great programmers, everybody else’s code, frankly, is bug-infested garbage, and nobody else knows how to ship on time.", and that's the state we end up if we believe in the hype and use LLMs willy-nilly.
That's why people are annoyed, not because LLMs cannot code like a senior engineer, but because lots of content marketing a company valuation is dependent on making people believe it's the case.
And people keep forgetting how new this stuff is
This is like trashing video games in 1980 because Pong has awful graphics
No, it provides responses. It does not talk.
It's incredibly frustrating when people think they're a miracle tool and blindly copy/paste output without doing any kind of verification. This is especially frustrating when someone who's supposed to be a professional in the field is doing it (copy lasting non working AI generated code and putting it up for review)
That said, on one hand, they multiply productive and useful information. On the other hand, they kill productive and spread misinformation. That said, I still seem them as useful but not a miracle
I can ask Claude the most inane programming question and got an answer. If I were to do that on StackOverflow, I'd get downvoted, rude comments, and my question closed for being off-topic. I don't have to be super knowledgeable about the thing I'm asking about with Claude (or any LLM for that matter).
Even if you ignore the rudeness and elitism of power-users of certain platforms, there's no more waiting for someone to respond to your esoteric questions. Even if the LLM spews bullshit, you can ask it clarifying questions or rephrase until you see something that makes sense.
I love LLMs, I don't care what people say. Even when I'm just spitballing ideas[1], the output is great.
---
[1]: https://blog.webb.page/2025-03-27-spitball-with-claude.txt
Truly amazing technology which is very good at generating and correcting texts is marketed as senior developer, talented artist, and black box that has solution to all your problems. This impression shatters on the first blatant mistake, e.g. counting elephant legs: https://news.ycombinator.com/item?id=38766512
https://youtu.be/aGnMbKwP36U?si=WbXzphhhP8Hak1OQ
It’s a human nature thing - we’re supposed to be collecting nuts in the forest.
But I will admit the dora muckbang feet shit is fucking insane. And that just flat out scares the pants off me.
Sorry but this is a total skill issue lol. 80% code failure rate is just total nonsense. I don't think 1% of the code I've gotten from LLMs has failed to execute correctly.
Almost everytime I've tried using LLMs I've fallen into thepattern on calling out, correcting and argueing with the LLMs which is of course in itself sillyto do, because they don't learn, they don't really "get it" when they are wrong. There's no benefit to talking to a human.
Its also a slow burn issue - you have to use it for a while for what is obvious to users, to become obvious to people who are tech first.
The primary issue is the hype and forecasted capabilities vs actual use cases. People want something they can trust as much as an authority, not as much as a consultant.
If I were to put it in a single sentence? These are primarily narrative tools, being sold as factual /scientific tools.
When this is pointed out, the conversation often shifts to “well people aren’t that great either”. This takes us back to how these tools are positioned and sold. They are being touted as replacements to people in the future. When this claim is pressed, we get to the start of this conversation.
Frankly, people on HN aren’t pessimistic enough about what is coming down the pipe. I’ve started looking at how to work in 0 Truth scenarios, not even 0 trust. This is a view held by everyone I have spoken to in fraud, misinformation, online safety.
There’s a recent paper which showed that GAI tools improved the profitability of Phishing attempts by something like 50x in some categories, and made previously loss making (in $/hour terms) targets, profitable. Schneier was one of the authors.
A few days ago I found out someone I know who works in finance, had been deepfaked and their voice/image used to hawk stock tips. People were coming to their office to sue them.
I love tech, but this is the dystopia part of cyberpunk being built. These are narrative tools, good enough to make people think they are experts..
If you ask it random things the output looks amazing, yes. At least at first glance. That's what they do. It's indeed magical, a true marvel that should make you go: Woooow, this is amazing tech: Coming across as convincing, even if based on hallucinations, is in itself a neat trick!
But is it actually useful? The things they come up with are untrustworthy and on the whole far less good than previously available systems. In many ways, insidiously worse: It's much harder to identify bad information than it was before.
It's almost like we designed a system to pass turing tests with flying colours but forgetting that usefulness is what we actually wanted, not authoritative, human sounding bullshit.
I don't think the LLM naysayers are 'unimpressed', or that they demand perfection. I think they are trying to make statements aimed at balancing things:
Both the LLMs themselves, and the humans parroting the hype, are severely overstating the quality of what such systems produce. Hence, and this is a natural phenomenon you can observe in all walks of life, the more skeptical folks tend to swing the pendulum the other way, and thus it may come across to you as them being overly skeptical instead.
I'm trans, and I don't disagree that this technology has aspects that are problematic. But for me at least, LLMs have been a massive equalizer in the context of a highly contentious divorce where the reality is that my lawyer will not move a finger to defend me. And he's lawyer #5 - the others were some combination of worse, less empathetic, and more expensive. I have to follow up a query several times to get a minimally helpful answer - it feels like constant friction.
ChatGPT was a total game-changer for me. I told it my ex was using our children to create pressure - feeding it snippets of chat transcripts. ChatGPT suggested this might be indicative of coercive control abuse. It sounded very relevant (my ex even admitted in a rare, candid moment that she feels a need to control everyone around her one time), so I googled the term - essentially all the components were there except physical violence (with two notable exceptions).
Once I figured that out, I asked it to tell me about laws related to controlling relationships - and it suggested laws either directly addressing (in the UK and Australia), and the closest laws in Germany (Nötigung, Nachstellung, violations of dignity, etc., translating them to English - my best language). Once you name specific laws broken and provide a rationale for why there's a Tatbestand (ie the criterion for a violation is fulfilled), your lawyer has no option but to take you more seriously. Otherwise he could face a malpractice suit.
Sadly, even after naming specific law violations and pointing to email and chat evidence, my lawyer persists in dragging his feet - so much so that the last legal letter he sent wasn't drafted by him - it was ChatGPT. I told my lawyer: read, correct, and send to X. All he did was to delete a paragraph and alter one or two words. And the letter worked.
Without ChatGPT, I would be even more helpless and screwed than I am. It's far from clear I will get justice in a German court, but at least ChatGPT gives me hope, a legal strategy. Lastly - and this is a godsend for a victim of coercive control - it doesn't degrade you. Lawyers do. It completely changed the dynamics of my divorce (4 years - still no end in sight, lost my custody rights, then visitation rights, was subjected to confrontational and gaslighting tactics by around a dozen social workers - my ex is a social worker -, and then I literally lost my hair: telogen effluvium, tinea capitis, alopecia areata... if it's stress-related, I've had it), it gave me confidence when confronting my father and brother about their family violence.
It's been the ONLY reliable help, frankly, so much so I'm crying as I write this. For minorities that face discrimination, ChatGPT is literally a lifeline - and that's more true the more vulnerable you are.
WhY aRe PeOpLe BuLlIsH
LLMs produce midwit answers. If you are an expert in your domain, the results are kind of what you would expect for someone who isn’t an expert. That is occasionally useful but if I wanted a mediocre solution in software I’d use the average library. No LLM I have ever used has delivered an expert answer in software. And that is where all the value is.
I worked in AI for a long time, I like the idea. But LLMs are seemingly incapable of replacing anything of value currently.
The elephant in the room is that there is no training data for the valuable skills. If you have to rely on training data to be useful, LLMs will be of limited use.
If this were true, no one would hire junior employees and assistants. There's a huge amount of work that requires more time than expertise.
When an AI can say “Here’s how you make better, smaller, more powerful batteries, follow these plans”, then we will have a reason to worship AI.
When AI can bring us wonders like room temperature semiconductors, fast interstellar travel, anti-gravity tech, solutions to world hunger and energy consumption, then it will have fulfilled the promise of what AI could do for humanity.
Until then, LLMs are just fancy search and natural language processors. Puppets with strings. It’s about as impressive as Google was when it first came out.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
The problem is that I feel I am constantly being bombarded by people bullish on AI saying "look how great this is" but when I try to do the exact same things they are doing, it doesn't work very well for me
Of course I am skeptical of positive claims as a result.
The only use case that would beat yours is the type of office worker that cannot write professional sounding emails but has to send them out regularly manually.
I think that HN has a lot of people who are working on large software projects that are incredibly complex and have a huge numbers of interdependencies etc., and LLMs aren't quite to the point that they can very usefully contribute to that except around the edges.
But I don't think that generalizing from that failure is very useful either. Most things humans do aren't that hard. There is a reason that SWE is one of the best paid jobs in the country.
Real programming is on a totally different scale than what you're describing.
I think that's true for most jobs. Superficially an AI looks like it can do good.
But LLMs:
1. Hallucinate all the time. If they were human we'd call them compulsive liars
2. They are consistenly inconsistent, so are useless for automation
3. Are only good at anything they can copy from their data set. They can't create, only regurgitate other people's work
4. AI influencing hasn't happened yet, but will very soon start making AI LLMs useless, much like SEO has ruined search. You can bet there are a load of people already seeding the internet with a load of advertising and misinformation aimed solely at AIs and AI reinforcement
For what it's worth, I mostly work on projects in the 100-200 files range, at 20-40k LoC. When using proper tooling with appropriate models, it boosts my productivity by at least 2x (being conservative). I've experimented with this by going a few days without using them, then using them again.
Definitely far from the massive codebases many on here work on, small beans by HN standards. But also decidedly not just writing one-off scripts.
How "real" are we talking?
When I think of "real programming" I think of flight control software for commercial airplanes and, I can assure you, 1 month != 5,000 LoC in that space.
It's actually extremely irritating that I'm only half talking to the person when I email with these people.
Annoying response of course. But I’d never used an LLM to debug before, so I figured I’d give it a try.
First: it regurgitated a bunch of documentation and basic debugging tips, which might have actually been helpful if I had just encountered this problem and had put no thought into debugging it yet. In reality, I had already spent hours on the problem. So not helpful
Second: I provided some further info on environment variables I thought might be the problem. It latched on to that. “Yes that’s your problem! These environment variables are (causing the problem) because (reasons that don’t make sense). Delete them and that should fix things.” I deleted them. It changed nothing.
Third: It hallucinated a magic numpy function that would solve my problem. I informed it this function did not exist, and it wrote me a flowery apology.
Clearly AI coding works great for some people, but this was purely an infuriating distraction. Not only did it not solve my problem, it wasted my time and energy, and threw tons of useless and irrelevant information at me. Bad experience.
If I give it all my information and add "I think the problem might be X, but I'm not sure", the LLM always agrees that the problem is X and will reinterpret everything else I've said to 'prove' me right.
Then the conversation is forever poisoned and I have to restart an entirely new chat from scratch.
98% of the utility I've found in LLMs is getting it to generate something nearly correct, but which contains just enough information for me to go and Google the actual answer. Not a single one of the LLMs I've tried have been any practical use editing or debugging code. All I've ever managed is to get it to point me towards a real solution, none of them have been able to actually independently solve any kind of problem without spending the same amount of time and effort to do it myself.
I'm seeing this sentiment a lot in these comments, and frankly it shows that very few here have actually gone and tried the variety of models available. Which is totally fine, I'm sure they have better stuff to do, you don't have to keep up with this week's hottest release.
To be concrete - the symptom you're talking about is very typical of Claude (or earlier GPT models). o3-mini is much less likely to do this.
Secondly, prompting absolutely goes a huge way to avoiding that issue. Like you're saying - if you're not sure, don't give hints, keep it open-minded. Or validate the hint before starting, in a separate conversation.
And on "prompting", I think this is a point of friction between LLM boosters and haters. To the uninitiated, most AI hype sounds like "it's amazing magic!! just ask it to do whatever you want and it works!!" When they try it and it's less than magic, hearing "you're prompting it wrong" seems more like a circular justification of a cult follower than advice.
I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively. I buy that. But some more specific advice would be helpful. Cause as is, it sounds more like "LLMs are magic!! didn't work for you? oh, you must be holding it wrong, cause I know they infallibly work magic".
I don't buy it this at all.
At best "learning to prompt" is just hitting the slot machine over and over until you get something close to what you want, which is not a skill. This is what I see when people "have a conversation with the LLM"
At worst you are a victim of sunk cost fallacy, believing that because you spent time on a thing that you have developed a skill for this thing that really has no skill involved. As a result you are deluding yourself into thinking that the output is better.. not because it actually is, but because you spent time on it so it must be
I spent like a week trying to figure out why a livecd image I was working on wasn't initializing devices correctly. Read the docs, read source code, tried strace, looked at the logs, found forums of people with the same problem but no solution, you know the drill. In desperation I asked ChatGPT. ChatGPT said "Use udevadm trigger". I did. Things started working.
For some problems it's just very hard to express them in a googleable form, especially if you're doing something weird almost nobody else does.
if it's "dumb and annoying" i ask the AI, else i do it myself.
since that AI has been saving me a lot of time on dumb and annoying things.
also a few models are pretty good for basic physics/modeling stuff (get basic formulas, fetching constants, do some calculations). these are also pretty useful. i recently used it for ventilation/co2 related stuff in my room and the calculations matched observed values pretty well, then it pumped me a broken desmos syntax formula, and i fixed that by hand and we were good to go!
---
(dumb and annoying thing -> time-consuming to generate with no "deep thought" involved, easy to check)
I had an issue where my Mac would report that my tethered iPhone's batteries were running low when the battery was in fact fine. I had tried googling an answer, and found many similar-but-not-quite-the-same questions and answers. None of the suggestions fixed the issue.
I then asked the 'MacOS Guru' model for chatGPT my question, and one of the suggestions worked. I feel like I learned something about chatGPT vs Google from this - the ability of an LLM to match my 'plain English question without a precise match for the technical terms' is obviously superior to a search engine. I think google etc try synonyms for words in the query, but to me it's clear this isn't enough.
When I google "linux device not initializing correctly", someone suggesting "udevadm trigger" is the 5th result
I may also have accidentally made it harder by using the wrong word somewhere. A good part of the difficulty of googling for a vague problem is figuring out how to even word it properly.
Also of course it's much easier now that I tracked down what the actual problem was and can express it better. I'm pretty sure I wasn't googling for "devices not initializing" at the time.
But this is where I think LLMs offer a genuine improvement -- being able to deal with vagueness better. Google works best if you know the right words, and sometimes you don't.
And it might not have been the first and only thing ChatGPT said. It got there fast but 5th result isn't too slow either.
At many points, the code would have an error; to deal with this, I just supply the error message, as-is to the LLM, and it proposes a fix. Sometimes the fix works, and sometimes I have to intervene to push the fix in the right direction. It's OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three.
A key part of the workflow, imo, was that we were working in the medium of the actual code. If the code is broken, we get an error, and can iterate. Asking for opinions doesn't really help...
It will happily spin forever responding in whatever tone is most directly relevant to your last message: provide an error and it will suggest you change something (it may even be correct every once in a while!), suggest a change and it'll tell you you're obviously right, suggest the opposite and you will be right again, ask if you've hit a dead end and yeah, here's why. You will not learn anything or get anywhere.
A conversation will only be useful if the response you got just needs tweaks. If you can't tell what it needs feel free to let it spin a few times, but expect to be disappointed. Use it for code you can fully test without much effort, actual test code often works well. Then a brief conversation will be useful.
(I've written a fair bit about this: https://simonwillison.net/tags/ai-assisted-programming/ and https://simonwillison.net/2025/Mar/11/using-llms-for-code/ and 80+ examples of tools I've built mostly with LLMs on https://tools.simonwillison.net/colophon )
I've used a whole bunch of techniques.
Most of the code in there is directly copied and pasted in from https://claude.ai or https://chatgpt.com - often using Claude Artifacts to try it out first.
Some changes are made in VS Code using GitHub Copilot
I've used Claude Code for a few of them https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
Some were my own https://llm.datasette.io tool - I can run a prompt through that and save the result straight to a file
The commit messages usually link to either a "share" transcript or my own Gist showing the prompts that I used to build the tool in question.
(I say "most" because GPT-4.5 is 1000x the price of GPT-4o-mini, which implies to me that it burns a whole lot more energy.)
Typing speed is not usually the constraint for programming, for a programmer that knows what they are doing
Creating the solution is the hard work, typing it out is just a small portion of it
(I get boosts from LLMs to a bunch of activities too, like researching and planning, but those are less obvious than the coding acceleration.)
This explains it then. You aren't a software developer
You get a productivity boost from LLMs when writing code because it's not something you actually do very much
That makes sense
I write code for probably between 50-80% of any given week, which is pretty typical for any software dev I've ever worked with at any company I've ever worked at
So we're not really the same. It's no wonder LLMs help you, you code so little that you're constantly rusty
I very much doubt you spend 80% of your working time actively typing code into a computer.
My other activities include:
- Researching code. This is a LOT of my time - reading my own code, reading other code, reading through documentation, searching for useful libraries to use, evaluating if those libraries are any good.
- Exploratory coding in things like Jupyter notebooks, Firefox developer tools etc. I guess you could call this "coding time", but I don't consider it part of that 10% I mentioned earlier.
- Talking to people about the code I'm about to write (or the code I've just written).
- Filing issues, or updating issues with comments.
- Writing documentation for my code.
- Straight up thinking about code. I do a lot of that while walking the dog.
- Staying up-to-date on what's new in my industry.
- Arguing with people about whether or not LLMs are useful on Hacker News.
"I agree, only 10% of what I do is typing code"
"that explains it, you aren't a software developer"
What the hell?
Sure, it often spits out incomplete, non-ideal, or plain wrong answers, but that's where having SWE experience comes in to play to recognize it
In the middle of this thought, you changed the context from "learning new things" to "not being faster than an LLM"
It's easy to guess why. When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything
But yes, you're right. I don't learn new things from scratch very often, because I'm not changing contexts that frequently.
I want to be someone who had 10 years of experience in my domain, not 1 year of experience repeated 10 times, which means I cannot be starting over with new frameworks, new languages and such over and over
Here's some code I threw together without even looking at yesterday: https://github.com/simonw/tools/blob/main/incomplete-json-pr... (notes here: https://simonwillison.net/2025/Mar/28/incomplete-json-pretty... )
Reading it now, here are the things it can teach me:
:root {
--primary-color: #3498db;
--secondary-color: #2980b9;
--background-color: #f9f9f9;
--card-background: #ffffff;
--text-color: #333333;
--border-color: #e0e0e0;
}
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
line-height: 1.6;
color: var(--text-color);
background-color: var(--background-color);
padding: 20px;
That's a very clean example of CSS variables, which I've not used before in my own projects. I'll probably use that pattern myself in the future. textarea:focus {
outline: none;
border-color: var(--primary-color);
box-shadow: 0 0 0 2px rgba(52, 152, 219, 0.2);
}
Really nice focus box shadow effect there, another one for me to tuck away for later. <button id="clearButton">
<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<rect x="3" y="3" width="18" height="18" rx="2" ry="2"></rect>
<line x1="9" y1="9" x2="15" y2="15"></line>
<line x1="15" y1="9" x2="9" y2="15"></line>
</svg>
Clear
</button>
It honestly wouldn't have crossed my mind that embedding a tiny SVG inline inside a button could work that well for simple icons. // Copy to clipboard functionality
copyButton.addEventListener('click', function() {
const textToCopy = outputJson.textContent;
navigator.clipboard.writeText(textToCopy).then(function() {
// Success feedback
copyButton.classList.add('copy-success');
copyButton.textContent = ' Copied!';
setTimeout(function() {
copyButton.classList.remove('copy-success');
copyButton.innerHTML = '<svg xmlns="http://www.w3.org/2000/svg" width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="9" y="9" width="13" height="13" rx="2" ry="2"></rect><path d="M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1"></path></svg> Copy to Clipboard';
}, 2000);
});
});
Very clean example of clipboard interaction using navigator.clipboard.writeTextAnd the final chunk of code on the page is a very pleasing implementation of a simple character-by-character non-validating JSON parser which indents as it goes: https://github.com/simonw/tools/blob/1b9ce52d23c1335777cfedf...
That's half a dozen little tricks I've learned from just one tiny LLM project which I only spent a few minutes on.
My point here is that if you actively want to learn things, LLMs are an extraordinary gift.
I was trying to prototype a system and created a one-pager describing the main features, objectives, and restrictions. This took me about 45 minutes.
Then I feed it into Claude and asked to develop said system. It spent the next 15 minutes outputting file after file.
Then I ran "npm install" followed by "npm run" and got a "fully" (API was mocked) functional, mobile-friendly, and well documented system in just an hour of my time.
It'd have taken me an entire day of work to reach the same point.
However, he has a very black and white approach to things and he also finds interacting with a lot of humans frustrating, weird and uncomfortable.
The more conversations I see about LLMs the more I’m beginning to feel that “LLM-whispering” is a soft skill that some people find very natural and can excel at, while others find it completely foreign, confusing and frustrating.
If you have any reasonable understanding of SQL, I guarantee you could brush up on it and write it yourself in less than a couple of hours unless you're trying to do something very complex
SQL is absolutely trivial to write by hand
Think of it as managing cognitive load. Wandering off to relearn SQL boilerplate is a distraction from my medium-term goal.
edit: I also believe I'm less likely to get a really dumb 'gotcha' if I start from the LLM rather than cobbling together knowledge from some random docs.
You might also consider that you may be over-indexing on your own capabilities rather than evaluating the LLM’s capabilities.
Lets say an llm is only 25% as good as you but is 10% the cost. Surely you’d acknowledge there may be tasks that are better outsourced to the llm than to you, strictly from an ROI perspective?
It seems like your claim is that since you’re better than LLMs, LLMs are useless. But I think you need to consider the broader market for LLMs, even if you aren’t the target customer.
However, whether SQL is "trivial to write by hand" very much depends on exactly what you are trying to do with it.
(The details: I was working with running a Bayesian sampler across multiple compute nodes with MPI. There seemed to be a pathological interaction between the code and MPI where things looked like they were working, but never actually progressed.)
FWIW, I've seen people online refer to this as "vibe coding".
I read the linked article when it was posted, and I suspect a few things that are skewing your own view of the general applicability of LLMs for programming. One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
I think it’s great that it’s a technology you’re passionate about and that it’s useful for you, but my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful. And that’s okay, it doesn’t have to be all things to all people. But it’s not fair to say that we’re just holding it wrong.
It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.
I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.
It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.
Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/
Take a look at the 2024 StackOverflow survey.
70% of professional developer respondents had only done extensive work over the last year in one of:
JS 64.6% SQL 54.1% JTML/CSS 52.9% PY 46.9% TS 43.4% Bash/Shell 34.2% Java 30%
LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.
If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.
Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]
But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.
The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...
I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...
Whatever the amount may be, it definitely fits into 300k tokens.
<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>
But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.
Go look up what happens in history when tons of people are unemployed at the same time with no hope of getting work. What happens when the unemployed masses become desperate?
Naw I'm sure it will be fine, this time will be different
Alien 1: I gave Jeff Dean a giant complex system to build, he crushed it! Humans are so smart.
Alien 2: I gave a random human a simple programming problem and he just stared at me like an idiot. Humans suck.
I see people say, "Look how great this is," and show me an example, and the example they show me is just not great. We're literally looking at the same thing, and they're excited that this LLM can do a college grads's job to the level of a third grader, and I'm just not excited about that.
Treat the AI as a freelancer working on your project. How would you ask a freelancer to create a Kanban system for you? By simply asking "Create a Kanban system", or by providing them a 2-3 pages document describing features, guidelines, restrictions, requirements, dependencies, design ethos, etc?
Which approach will get you closer to your objective?
The same applies to LLM (when it comes to code generation). When well instructed, it can quickly generate a lot of working code, and apply the necessary fixes/changes you request inside that same context window.
It still can't generate senior-level code, but it saves hours when doing grunt work or prototyping ideas.
"Oh, but the code isn't perfect".
Nor is the code of the average jr dev, but their codes still make it to production in thousands of companies around the world.
About 2 weeks ago I started on a streaming markdown parser for the terminal because none really existed. I've switched to human coding now but the first version was basically all llm prompting and a bunch of the code is still llm generated (maybe 80%). It's a parser, those are hard. There's stacks, states, lookaheads, look behinds, feature flags, color spaces, support for things like links and syntax highlighting... all forward streaming. Not easy
Exactly this.
I once had a function that would generate several .csv reports. I wanted these reports to then be uploaded to s3://my_bucket/reports/{timestamp}/.csv
I asked ChatGPT "Write a function that moves all .csv files in the current directory to and old_reports directory, calls a create_reports function, then uploads all the csv files in the current directory to s3://my_bucket/reports/{timestamp}/.csv with the timestamp in YYYY-MM-DD format""
And it created the code perfectly. I knew what the correct code would look like, I just couldn't be fucked to look up the exact calls to boto3, whether moving files was os.move or os.rename or something from shutil, and the exact way to format a datetime object.
It created the code far faster that I would have.
Like, I certainly wouldn't use it to write a whole app, or even a whole class, but individual blocks like this, it's great.
I use it to produce whole classes, large sql queries, terraform scripts, etc etc. I then look over that output, iterate on it, adjust it to my needs. It's never exactly right at first, but that's fine - neither is code I write from scratch. It's still a massive time saver.
I think this is the most important bit many people miss. It is advertised as an autonomous software developer, or something that can take a junior to senior levels, but that's just advertising.
It is actually most useful for senior developers, as it does the grunt work for them, while grunt work is actually useful work for a junior developer as a learning tool.
These are power tools for the mind. We've been working with the equivalent of hand tools, now something new came along. And yeah, a hole hawg will throw you clear off a ladder if you're not careful -- does that mean you're going to bore 6" holes in concrete ceilings by hand? Think not.
By a few currently niche VC players, I guess. I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
Are you sure about that? [1]:
> "I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code," Amodei said at a Council of Foreign Relations event on Monday.
[1] https://www.entrepreneur.com/business-news/anthropic-ceo-pre...
I have capital allocator friends warning me about vibe coding taking my job
Yes, it WON'T produce senior-level code for complex tasks, but it's great at tackling down junior to mid-level code generation/refactoring, with minor adjustments (just like a code review).
So, it's basically the same thing as having a freelancer jr dev at your disposal, but it can generate working code in 5 min instead of 5 hours.
It doesn't just save me a ton of time, it results in me building automations that I normally wouldn't have taken on at all because the time spent fiddling with os.move/boto3/etc wouldn't have been worthwhile compared to other things on my plate.
But if you can do the task well enough to at least recognize likely-to-be-correct output, then you can get a lot done in less time than you would do it without their assistance.
Is that worth the second order effects we're seeing? I'm not convinced, but it's definitely changed the way we do work.
I'm sure spending more time fiddling with the setup of LLM tools can yield better results, but that doesn't mean that it will be worth it for everyone. In my experience LLMs fail often enough at modestly complex problems that they are more hassle than benefit for a lot of the work I do. I'll still use them for simple tasks, like if I need some standard code in a language I'm not too familiar with. At the same time, I'm not at all surprised that others have a different experience and find them useful for larger projects they work on.
As you said, examples where I wouldn't expect LLMs to be good at from people who dismiss the scenarios where LLMs are great at. I don't want to convince anyone, to be honest - I just want to say they are incredibly useful for me and a huge time saver. If people don't want to use LLMs, it's fine for me as I'll have an edge over them in the market. Thanks for the cash, I guess.
I'm growing weary of trying to help people use these tools properly.
One day I came up with a joke and wondered whether people would "get it". I told the joke to ChatGPT and asked it to explain it back to me. ChatGPT did a great job and nailed what's supposedly funny about the joke. I used it in an email so I have no idea whether anyone found it funny, but at least I know it wasn't too obscure. If an AI can understand a joke, there's a good chance people will understand it too.
This might not be super useful but demonstrates that LLMs aren't only about generating text for copy-and-paste or retrieving information. It's "someone" you can bounce ideas with, ask opinions and that's how I use it most frequently.
Which is fine for actual testing you're doing internally, since that cost burden is then remedied by you fixing those issues. However, no feature is as free as you're making it sound, not even the "nice to have" additions that seem so insignificant.
Of course the tradeoffs should be well considered. That's why it may get out of hand real bad if software will be created (or vibe coded) by people with little understanding of these metrics and tradeoffs. I'm absolutely not advocating for that.
The point is more that everyone seems to acknowledge that a) output is spotty, and b) it’s difficult to provide enough context to work on anything that’s not fairly self-contained. And yet we also constantly have people saying that they’re using AI for some ridiculous percentage of their actual job output. So, I’m just curious how one reconciles those two things.
Either most people’s jobs consist of a lot more small, self-contained mini-projects than my jobs generally have, or people’s jobs are more accepting of incorrect output than I’m used to, or people are overstating their use of the tool.
Or something else!
Automating the easy 80% sounds useful, but in practice I'm not convinced that's all that helpful. Reading and putting together code you didn't write is hard enough to begin with.
I’ve never seen it from my students. Why do you think this? It’s trivial to pick a real book/article. No student is generating fake material whole cloth and fake references to match. Even if they could, why would they risk it?
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Perfectly put, IMO.I know arguments from authority aren't primary, but I think this point highlights some important context: Dr. Hossenfelder has gained international renown by publishing clickbait-y YouTube videos that ostensibly debunk scientific and technological advances of all kinds. She's clearly educated and thoughtful (not to mention otherwise gainfully employed), but her whole public persona kinda relies on assuming the exclusively-critical standpoint you mention.
I doubt she necessarily feels indebted to her large audience expecting this take (it's not new...), but that certainly does seem like a hard cognitive habit to break.
"Garbage in, garbage out" as the law says.
Of course, it took a lot of trial and error for me to get to my current level of effectiveness with LLMs. It's probably our responsibility to teach these who are willing.
A Mitre Saw is an amazing thing to have in a woodshop, but if you don't learn how to use it you're probably going to cut off a finger.
The problem is that LLMs are power tools that are sold as being so easy to use that you don't need to invest any effort in learning them at all. That's extremely misleading.
* Know how they work
* Are legally liable for defects in design or manufacture that cause injury, death, or property damage
* Provide manuals that instruct the operator how to effectively and safely use the power tool
Except when you use them for purposes other than declared by them - then it's on you. Similarly, you get plenty of warnings about limitation and suitability of LLMs from the major vendors, including even warnings directly in the UI. The limitations of LLMs are common knowledge. Like almost everyone, you ignore them, but then consequences are on you too.
> Provide manuals that instruct the operator how to effectively and safely use the power tool
LLMs come with manuals much, much more extensive than any power tool ever (or at least since 1960s or such, as back then hardware was user-serviceable and manuals weren't just generic boilerplate).
As for:
> Know how they work
That is a real difference between power tool manufacturers and LLM vendors, but then if you switch to comparing against pharmaceutical industry, then they don't know how most of their products work either. So it's not a requirement for useful products that we benefit from having available.
My favorite example: you ask the LLM for "most recent restaurant opened in California", give it a schema and it tries "select * from restaurants where state = 'California' order by open_date desc" - but that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
There are tricks that can help here - I've tried sending the LLM an example row from each table, or you can set up a proper loop where the LLM gets to see the results and iterate on them - but it reflects the fact that interacting with databases can easily go wrong no matter how "smart" the model you are using is.
As you’ve identified, rather than just giving it the schema you give it the schema and a some data when you tell it what you want.
A human might make exactly the same error - based on misassumption - and would then look at the data to see why it was failing.
If we assume that a LLM would magically realise that when you ask it to find something based on an identifier which you tell it is ‘California’ it would magically assume that the query should be based on ‘CA’ rather than what you told it, then that’s not really the fault of the LLM.
Those people, likely, will never change their opinion.
And that’s fine, because they won’t get the huge benefits that come from spending time learning how to use the tool properly.
Every once in a while I send a query off to ChatGPT and I'm often disappointed and jam on the "this was hallucinated" feedback button (or whatever it is called). I have better luck with Claude's chat interface but nowhere near the quality of response that I get with Cline driving.
What I am seeing is fanboys who offer me examples of things working well that fail any close scrutiny— with the occasional example that comes out actually working well.
I agree that for prototyping unimportant code LLMs do work well. I definitely get to unimportant point B from point A much more quickly when trying to write something unfamiliar.
Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.
When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.
> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.
For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.
> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.
A good example that I use frequently is a reverse dictionary.
It's also useful for suggesting edits to text that I have written. It's easy for me to read its suggestions and accept/reject them.
I asked Gemini for the lyrics to a song that I knew was on all the lyrics sites. To make a long story short, it gave me the wrong lyrics three times, apparently making up new ones the last two times. Someone here said LLMs may not be allowed to look at those sites for copyright reasons, which is fair enough; but then it should have just said so, not "pretended" it was giving me the right answer.
I have a python script that processes a CSV file every day, using DictReader. This morning it failed, because the people making the CSV changed it to add four extra lines above the header line, so DictReader was getting its headers from the wrong line. I did a search and found the fix on Stack Overflow, no big deal, and it had the upvotes to suggest I could trust the answer. I'm sure an LLM could have told me the answer, but then I would have needed to do the search anyway to confirm it--or simply implemented it, and if it worked, assume it would keep working and not cause other problems.
That was just a two-line fix, easy enough to try out and see if it worked, and guess how it worked. I can't imagine implementing a 100-line fix and assuming the best.
It seems to me that some people are saying, "It gives me the right thing X% of the time, which saves me enough developer time (mine or someone else's) that it's worth the other (100-X)% of the time when it gives me garbage that takes extra time to fix." And that may be a fair trade for some folks. I just haven't found situations where it is for me.
This doesn’t include lying and cheating which LLMs can’t.
On the other hand AI is used to solve problems that are already solved. I just recently got an ad about a software for process modeling where one claim was you don’t need always to start from the ground up but can say the AI give me the customer order process to start from that point. That is basically what templates are for with much less energy consumption.
You hit the nail on this one. Around me I noticed that the bashing of LLMs come from the smart people that want others to know they are smart.
It doesn't always correlate with narcissism, but it happens much more than chance.
Yes somewhat. Its good for powershell/bash/cmd scripts and configs, but early models it would hallucinate PowerShell cmdlets especially.
The use cases are vastly different and the first is just _not_ world changing. It’s great, don’t get me wrong, but it won’t change the world.
"I write code all day with LLMs, it's amazing!" is in the exact same category. The code you (general you, I'm not picking on you in particular) write using LLMs, and the code I write apart from LLMs: they are not the same. They are categorically different artifacts.
Yup. Not to mention, we don't even have time to figure out how to effectively work with one generation of models, before the next generation of models get released and rises the bar. If development stopped right now, I'd still expect LLMs to get better for years, as people slowly figure out how to use them well.
"I ask them to give me a source for an alleged quote, I click on the link, it returns a 404 error. I Google for the alleged quote, it doesn't exist. They reference a scientific publication, I look it up, it doesn't exist."
To experienced LLM users that's not surprising at all - providing citations, sources for quotes, useful URLs are all things that they are demonstrably terrible at.
But it's a computer! Telling people "this advanced computer system cannot reliably look up facts" goes against everything computers have been good at for the last 40+ years.
And that’s honestly unfair to you since you do awesome realistic and level headed work with LLM.
But I think it’s important when having discussions to understand the context within which they are occurring.
Without the bulls she might very well be saying what you are in your last paragraph. But because of the bulls the conversation becomes this insane stratified nonsense.
Teachers are there to observe and manage behavior, resolve conflict, identify psychological risks and get in front of fixing them, set and maintain a positive tone (“setting the weather”), lift pupils up to that tone, and to summarize, assess and report on progress.
They are also there to grind through papers, tests, lesson plans, reports, marking, and letter writing. All of that will get easier with machine assistance.
Teaching is one of the most human-nature centric jobs in the world and will be the last to go. If AI can help focus the role of teacher more on using expert people skills and less on drudgery it will hopefully even improve the prospects of teaching as a career, not eliminate it.
Use google AI studio with search grounding. Provides correct links and citations every time. Other companies have similar search modes, but you have to enable those settings if you want good results.
Grounding isn't very different from that.
It’s been a common mantra - at least in my bubble of technologists - that a good majority of the software engineering skill set is knowing how to search well. Knowing when search is the right tool, how to format a query, how to peruse the results and find the useful ones, what results indicate a bad query you should adjust… these all sort of become second nature the longer you’ve been using Search, but I also have noticed them as an obvious difference between people that are tech-adept vs not.
LLMs seems to have a very similar usability pattern. They’re not always the right tool, and are crippled by bad prompting. Even with good prompting, you need to know how to notice good results vs bad, how to cherry-pick and refine the useful bits, and have a sense for when to start over with a fresh prompt. And none of this is really _hard_ - just like Search, none of us need to go take a course on prompting - IMO folks jusr need to engage with LLMs as a non-perfect tool they are learning how to wield.
The fact that we have to learn a tool doesn’t make it a bad one. The fact that a tool doesn’t always get it 100% on the first try doesn’t make it useless. I strip a lot of screws with my screwdriver, but I don’t blame the screwdriver.
On a side note, this lady is a fraud: https://www.youtube.com/watch?v=nJjPH3TQif0&themeRefresh=1
In no way am I credentialing her, lots of people can make astute observations about things they weren't trained in, but she both mastered sounding authoritative and at the same time, presenting things to go the most engagement possible.
This trap reminds me of the Perry Bible Fellowship comic "Catch Phrase" which has been removed for being too dark but can still be found with a search.
Wow, thank you. I rarely good a good cultural recommendation here, but PBF I didn't know about.
I raise you, Joan Cornellà
If you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself, but the process is less demanding of your critical skills.
> you don't have that experience in this domain, you will spend approximately as much effort validating output as you would have creating it yourself,
Not true.
LLMs are amazing tutors. You have to use outside information, they test you, you test them, but they aren't pathologically wrong in the way that they are trying to do a Gaussian magic smoke psyop against you.
Even when you lack subject matter expertise about something, there are certain universal red flags that skeptics key in on. One of the biggest ones is: “There’s no such thing as a free lunch” and its corollary: “If it sounds too good to be true, it probably is.”
Since reasoning models came about I've been significantly more bullish on them purely because they are less bad. They are still not amazing but they are at a poiny where I feel like including them in my workflow isn't an impediment.
They can now reliably complete a subset of tasks without me needing to rewrite large chunks of it myself.
They are still pretty terrible at edge cases ( uncommon patterns / libraries etc ) but when on the beaten path they can actually pretty decently improve productivity. I still don't think 10x ( well today was the first time I felt a 10x improvement but I was moving frontend code from a custom framework to react, more tedium than anything else in that and the AI did a spectacular job ).
Of late, deaf tech forums are taken over by language model debates over which works best for speech transcription. (Multimodal language models are the the state of the art in machine transcription. Everyone seems to forget that when complaining they can't cite sources for scientific papers yet.) The debates are sort of to the point that it's become annoying how it has taken over so much space just like it has here on HN.
But then I remember, oh yeah, there was no such thing as live machine transcription ten years ago. And now there is. And it's going to continue to get better. It's already good enough to be very useful in many situations. I have elsewhere complained about the faults of AI models for machine transcription - in particular when they make mistakes they tend to hallucinate something that is superficially grammatical and coherent instead - but for a single phrase in an audio transcription sporadically that's sometimes tolerable. In many cases you still want a human transcriber but the cost of that means that the amount of transcription needed can never be satisfied.
It's a revolutionary technology. I think in a few years I'm going have glasses that continuously narrate the sounds around me and transcribe speech and it's going to be so good I can probably "pass" as a hearing person in some contexts. It's hard not to get a bit giddy and carried away sometimes.
If everyone is using them wrong, I would argue that says something more about them than the users. Chat-based interfaces are the thing that kicked LLMs into the mainstream consciousness and started the cycle/trajectory we’re on now. If this is the wrong use case, everything the author said is still true.
There are still applications made better by LLMs, but they are a far cry from AGI/ASI in terms of being all-knowing problem solvers that don’t make mistakes. Language tasks like transcription and translation are valuable, but by no stretch do they account for the billions of dollars of spend on these platforms, I would argue.
Yes the costs of training AI models these days are really high too, but now we're just making a quantitative argument, not a qualitative one.
The fact that we've discovered a near-magical tech that everyone wants to experiment with in various contexts, is evidence that the tech is probably going somewhere.
Historically speaking, I don't think any scientific invention or technology has been adopted and experimented with so quickly and on such a massive scale as LLMs.
It's crazy that people like you dismiss the tech simply because people want to experiment with it. It's like some of you are against scientific experimentation for some reason.
What? Then what the hell do you call Dragon NaturallySpeaking and other similar software in that niche?
I have a minor speech impediment because of the hearing loss. They never worked for me very well. I don't speak like a standard American - I have a regional accent and I have a speech impediment. Modern speech recognition doesn't seem to have a problem with that anymore.
IBM's ViaVoice from 1997 in particular was a major step. It was really impressive in a lot of ways but the accuracy rate was like 90 - 95% which in practice means editing major errors with almost every sentence. And that was for people who could speak clearly. It never worked for me very well.
You also needed to speak in an unnatural way [pause] comma [pause] and it would not be fair to say that it transcribed truly natural speech [pause] full stop
Such voice recognition systems before about 2016 also required training on the specific speaker. You would read many pages of text to the recognition engine to tune it to you specifically.
It could not just be pointed at the soundtrack to an old 1980s TV show then produce a time-sync'd set of captions accurate enough to enjoy the show. But that can be done now.
> ...there was no such thing as live machine transcription ten years ago.
Now you're saying that live machine transcription existed thirty years ago, but it has gotten substantially better in the intervening decades.
I agree with your amended claim.
These critics don't seem to have learned the lesson that the perfect is the enemy of the good.
I use ChatGPT all the time for academic research. Does it fabricate references? Absolutely, maybe about a third of the time. But has it pointed me to important research papers I might never have found otherwise? Absolutely.
The rate of inaccuracies and falsehoods doesn't matter. What matters is, is it saving you time and increasing your productivity. Verifying the accuracy of its statements is easy. While finding the knowledge it spits out in the first place is hard. The net balance is a huge positive.
People are bullish on LLM's because they can save you days' worth of work, like every day. My research productivity has gone way up with ChatGPT -- asking it to explain ideas, related concepts, relevant papers, and so forth. It's amazing.
For single statements, sometimes, but not always. For all of the many statements, no. Having the human attention and discipline to mindfully verify every single one without fail? Impossible.
Every software product/process that assumes the user has superhuman vigilance is doomed to fail badly.
> Automation centaurs are great: they relieve humans of drudgework and let them focus on the creative and satisfying parts of their jobs. That's how AI-assisted coding is pitched [...]
> But a hallucinating AI is a terrible co-pilot. It's just good enough to get the job done much of the time, but it also sneakily inserts booby-traps that are statistically guaranteed to look as plausible as the good code (that's what a next-word-guessing program does: guesses the statistically most likely word).
> This turns AI-"assisted" coders into reverse centaurs. The AI can churn out code at superhuman speed, and you, the human in the loop, must maintain perfect vigilance and attention as you review that code, spotting the cleverly disguised hooks for malicious code that the AI can't be prevented from inserting into its code. As qntm writes, "code review [is] difficult relative to writing new code":
-- https://pluralistic.net/2025/03/18/asbestos-in-the-walls/
I mean, how do you live life?
The people you talk to in your life say factually wrong things all the time.
How do you deal with it?
With common sense, a decent bullshit detector, and a healthy level of skepticism.
LLM's aren't calculators. You're not supposed to rely on them to give perfect answers. That would be crazy.
And I don't need to verify "every single statement". I just need to verify whichever part I need to use for something else. I can run the code it produces to see if it works. I can look up the reference to see if it exists. I can Google the particular fact to see if it's real. It's really very little effort. And the verification is orders of magnitude easier and faster than coming up with the information in the first place. Which is what makes LLM's so incredibly helpful.
Well put.
Especially this:
> I can run the code it produces to see if it works.
You can get it to generate tests (and easy ways for you to verify correctness).
And you don't have concerns about that? What kind of damage is that doing to our society, long term, if we have a system that _everyone_ uses and it's just accepted that a third of the time it is just making shit up?
Like, I can ask a friend and they'll mistakenly make up a reference. "Yeah, didn't so-and-so write a paper on that? Oh they didn't? Oh never mind, I must have been thinking of something else." Does that mean I should never ask my friend about anything ever again?
Nobody should be using these as sources of infallible truth. That's a bonkers attitude. We should be using them as insanely knowledgeable tutors who are sometimes wrong. Ask and then verify.
The net benefit is huge.
And I'm talking about references when doing deep academic research. Looking them up is absolutely a productive use of time -- I'm asking for the references so I can read them. I'm not asking for them for fun.
Remember, it's hundreds of times easier to verify information than it is to find it in the first place. That's the basic principle of what makes LLM's so incredibly valuable.
A third of the time is an insane number, if 30% of code that I wrote contained non existent headers I would be fired long ago.
You're really underestimating the difficulty of getting 70% accuracy for general open-ended questions.
And while you might think you're better than 70%, I'm pretty sure if you didn't run your code through compilers and linters, and testing for at least a couple times, your code doesn't get anywhere near 70% correct.
That's exactly what the OP is saying. Verify everything.
Having lived a decent chunk of my life pre-internet, or at least fast and available internet, looking back at those days you realize just how often people were wrong about things. Old wives tales, made up statistics, imagined scenarios, people really do seem to confabulate a lot of information.
Main problem with our society is that two thirds of what _everyone_ says is made up shit / motivated reasoning. The random errors LLMs make are relatively benign, because there is no motivation behind them. They are just noise. Look through them.
Could it end up being a net benefit? will the realistic sounding but incorrect facts generated by A.I. make people engage with arguments more critically, and be less likely to believe random statements they're given?
Now, I don't know, or even think it is likely that this will happen, but I find it an interesting thought experiment.
LLMs will spit out responses with zero backing with 100% conviction. People see citations and assume it's correct. We're conditioned for it thanks to....everything ever in history. Rarely do I need to check a wikipedia entry's source.
So why do people not understand that: this is absolutely going to pour jet fuel on misinformation in the world. And we as a society are allowed to hold a bar higher for what we'll accept get shoved down our throats by corporate overlords that want their VC payout.
The solution is to set expectations, not to throw away one of the most valuable tools ever created.
If you read a supermarket tabloid, do you think the stories about aliens are true? No, because you've been taught that tabloids are sensationalist. When you listen to campaign ads, do you think they're true? When you ask a buddy about geography halfway across the world, do you assume every answer they give is right?
It's just about having realistic expectations. And people tend to learn those fast.
> Rarely do I need to check a wikipedia entry's source.
I suggest you start. Wikipedia is full of citations that don't back up the text of the article. And that's when there are even citations to begin with. I can't count the number of times I've wanted to verify something on Wikipedia, and there either wasn't a citation, or there was one related to the topic but that didn't have anything related to the specific assertion being made.
Imagine there is an probabilistic oracle that can answer any question with a yes/no with success probability p. If p=100% or p=0% then it is apparently very useful. If p=50% then it is absolutely worthless. In other cases, such oracle can be utilized in different way to get the answer we want, and it is still a useful thing.
Unreliability is something we live in. It is the world. Controlling error, increasing signal over noise, extracting energy from the fluctuations. This is life, man. This is what we are.
I can use LLMs very effectively. I can use search engines very effectively. I can use computers.
Many others can’t. Imagine the sheer fortune to be born in the era where I was meant to be: tools transformative and powerful in my hands; useless in others’.
I must be blessed by God.
Its true success rate is by no means 100%, and sometimes is 0%, but it always tries to make you feel confident.
I’ve had to catch myself surrendering too much judgment to it. I worry a high school kid learning to write will have fewer qualms surrendering judgment
So we're trying to use tools like this currently to help solve deeper problems and they aren't up to the task. This is still the point we need to start over and get better tools. Sharpening a bronze knife will never be as sharp or have the continuity as a steel knife. Same basic elements, very different material.
It's completely up to your ability to both find what you need without them and verify the information they give you to evaluate their usefulness. If you put that on a matrix, this makes them useful in the quadrant of information that is both hard to find, but very easy to verify. Which at least in my daily work is a reasonable amount.
There’s no question that we’re in a bubble which will eventually subside, probably in a “dot com” bust kind of way.
But let me tell you…last month I sent several hundred million requests to AI, as a single developer, and got exactly what I needed.
Three things are happening at once in this industry… (1) executives are over promising a literal unicorn with AGI, that is totally unnecessary for the ongoing viability of LLM’s and is pumping the bubble. (2) the technology is improving and delivery costs are changing as we figure out what works and who will pay. (3) the industry’s instincts are developing, so it’s common for people to think “AI” can do something it absolutely cannot do today.
But again…as one guy, for a few thousand dollars, I sent hundreds of millions of requests to AI that are generating a lot of value for me and my team.
Our instincts have a long way to go before we’ve collectively internalized the fact that one person can do that.
There are 2.6 million seconds in a month. You are claiming to have sent hundreds of requests per second to AI.
It is trivial for a server to send/receive 150 requests per second to the API.
This is what I mean by instincts...we're used to thinking of developers-pressing-keys as a fundamental bottleneck, and it still is to a point. But as soon as the tracks are laid for the AI to "work", things go from speed-of-human-thought to speed-of-light.
If you have a lot of GPU's and you're doing massive text processing like spam detection for hundreds of thousands of users, sure.
But "as a single developer", "value for me and my team"... I'm confused.
In general terms, we had to call the OpenAI API X00,000,000 times for a large-scale data processing task. We ended up with about 2,000,000 records in a database, using data created, classified, and cleaned by the AI.
There were multiple steps involved, so each individual record was the result of many round trips between the AI and the server, and not all calls are 1-to-1 with a record.
None of this is rocket science, and I think any average developer could pull off a similar task given enough time...but I was the only developer involved in the process.
The end product is being sold to companies who benefit from the data we produced, hence "value for me and the team."
The real point is that generative AI can, under the right circumstances, create absurd amounts of "productivity" that wouldn't have been possible otherwise.
1. Write python code for a new type of loss function I was considering
2. Perform lots of annoying CSV munging ("split this CSV into 4 equal parts", "convert paths in this column into absolute paths", "combine these and then split into 4 distinct subsets based on this field.." - they're great for that)
3. Expedite some basic shell operations like "generate softlinks for 100 randomly selected files in this directory"
4. Generate some summary plots of the data in the files I was working with
5. Not to mention extensive use in Cursor & GH Copilot
The tool (Claude 3.7 mostly, integrated with my shell so it can execute shell commands and run python locally) worked great in all cases. Yes I could've done most of it myself, but I personally hate CSV munging and bulk file manipulations and its super nice to delegate that stuff to an LLM agent
edit: formatting
I guess the author can understand now?
When something was impossible only 3 years ago, barely worked 2 years ago, but works well now, there are very good reasons to be bullish, I suppose?
The hypes cut both way.
Are you talking of what exactly? What are you stating works well now and did not years ago? Claude as a milestone of code writing?
Also in that case, if there are current apparent successes coming from a realm of tentative responses, we would need proof that the unreliable has become reliable. The observer will say "they were tentative before, they often look tentative now, why should we think they will pass the threshold to a radical change".
llm cmd extract first frame of movie.mp4 as a jpeg using ffmpeg
I use that all the time, it works really well (defaulting to GPT-4o-mini because it's so cheap, but it works with Claude too): https://simonwillison.net/2024/Mar/26/llm-cmd/There are third party tools that do the same, though
Please dispense with anyone's "expectations" when critiquing things! (Expectations are not a fault or property of the object of the expectations.)
Today's models (1) do things that are unprecedented. Their generality of knowledge, and ability to weave completely disparate subjects together sensibly, in real time (and faster if we want), is beyond any other artifact in existence. Including humans.
They are (2) progressing quickly. AI has been an active field (even through its famous "winters") for several decades, and they have never moved forward this fast.
Finally and most importantly (3), many people, including myself, continue to find serious new uses for them in daily work, that no other tech or sea of human assistants could replace cost effectively.
The only way I can make sense out of anyone's disappointment is to assume they simply haven't found the right way to use them for themselves. Or are unable to fathom that what is not useful for them is useful for others.
They are incredibly flexible tools, which means a lot of value, idiosyncratic to each user, only gets discovered over time with use and exploration.
That that they have many limits isn't surprising. What doesn't? Who doesn't? Zeus help us the day AI doesn't have obvious limits to complain about.
Very well said. That’s perhaps the area where I have found LLMs most useful lately. For several years, I have been trying to find a solution to a complex and unique problem involving the laws of two countries, financial issues, and my particular individual situation. No amount of Googling could find an answer, and I was unable to find a professional consultant whose expertise spans the various domains. I explained the problem in detail to OpenAI’s Deep Research, and six minutes later it produced a 20-page report—with references that all checked out—clearly explaining my possible options, the arguments for and against each, and why one of those options was probably best. It probably saved me thousands of dollars.
I tried using AI coding assistants. My longest stint was 4 months with Copilot. It sucked. At its best, it does the same job as IntelliSense but slower. Other times it insisted on trying to autofill 25 lines of nonsense I didn't ask for. All the time I saved using Copilot was lost debugging the garbage Copilot wrote.
Perplexity was nice to bounce plot ideas off of for a game I'm working on... until I kept asking for more and found that it'll only generate the same ~20ish ideas over and over, rephrased every time, and half the ideas are stupid.
The only use case that continues to pique my interest is Notion's AI summary tool. That seems like a genuinely useful application, though it remains to be seen if these sorts of "sidecar" services will justify their energy costs anytime soon.
Now, I ask: if these aren't the "right" use cases for LLMs, then what is, and why do these companies keep putting out products that aren't the "right" use case?
Thia might appear to ba a shallow answer but I do not think it is. AI has taken a very long road from early conceptions, by Turing and others, to a tool whose value we can argue about, but which is getting attention and use everywhere.
The mere fact that "are they progressing rapidly" is a question, is a testament to an incredible uptick in speed of progression.
"Is AI progressing quickly?" is the new "Are we there yet?"
For example, the other day I was chatting with it about the health risks associated with my high consumption of grown salmon. It then generated a small program to simulate the accumulation of PCB in my body. I could review the program, ask questions about the assumptions, etc. It all seemed very reasonable. A toxicokinetic analysis it called it.
It then struck me how immensely valuable this is to a curious and inquisitive mind. This is essentially my gold standard of intelligence: take a complex question and break it down in a logical way, explaining every step of the reasoning process to me, and be willing to revise the analysis if I point out errors / weaknesses.
Now try that with your doctor. ;)
Can it make mistakes? Sure, but so can your doctor. The main difference is that here the responsibility is clearly on you. If you do not feel comfortable reviewing the reasoning then you shouldn’t trust it.
With an LLM it should be "Don't trust, verify" - but it isn't that hard to verify LLM claims, just ask it for original sources.
Compare to ye olde scientific calculators (90s), they were allowed in tests because even though they could solve equations, they couldn't show the work. And showing the work was 90% of the score. At best you could use one to verify your solution.
But then tech progressed and now calculators can solve equations step by step -> banned from tests at school.
Have you tried actually doing this? Most of the time it makes up urls that don't exist or contradict the answer it just gave.
There's a bunch of scientific papers talking about that.
Bet that plenty of those researchers have written python programs they've uploaded to GitHub and you just got one of their programs regurgitated
It's not intelligence mate, it's just copying an existing program.
Isn't the 'intelligence' part, the bit that gets a previously-constructed 'thing', and makes it work in 'situation'. Pretty sure that's how humans work, too.
On the other hand, for trivial technical problems with well known solutions, LLMs are great. But those are in many senses the low value problems; you can throw human bodies against that question cheaply. And honestly, before Google results became total rubbish, you could just Google it.
I try to use LLMs for various purposes. In almost all cases where I bother to use them, which are usually subject matters I care about, the results are poorer than I can quickly produce myself because I care enough to be semi-competent at it.
I can sort of understand the kinds of roles that LLMs might replace in the next few years, but there are many roles where it isn’t even close. They are useless in domains with minimal training data.
This is also my experience. My day job isn't programming, but when I can feed an LLM secretarial work, or simple coding prompts to automate some work, it does great and saves me time.
Most of my day is spent getting into the details on things for which there's no real precedent. Or if there is, it hasn't been widely published on. LLMs are frustrating useless for these problems.
In programming circles it's also annoying when you try to help and you get fed garbage outputted by LLMs.
I belive models for generating visuals (image, video sound generation) is much more interesting as it's area where errors do not matter as much. Though the ethicality of how these models have been trained is another matter.
I feel humans should be held to account for the work they produce irrespective of the tools they used to produce it.
The junior engineer who copied code he didn't understand from Stackoverflow should face the consequences as much as the engineer who used LLM generated code without understanding it.
I’m sure that in the future there will be a really good search tool that utilises an LLM but for now a plain model just isn’t designed for that. There are a ton of other uses for them, so I don’t think that we should discount them entirely based on their ability to output citations.
Category 1: people who don't like to admit that anything trendy can also be good at what it does.
Category 2: people who don't like to admit that anything made by for-profit tech companies can also be good at what it does.
Category 3: people who don't like to admit that anything can write code better than them.
Category 4: people who don't like to admit that anything which may be put people out of work who didn't deserve to be put out of work, and who already earn less than the people creating the thing, can also be good at what it does
Category 5: people who aren't using llms for things they are good at
Category 6: people who can't bring themselves to communicate with AIs with any degree of humility
Category 7: people to whom none of the above applies
I once was in an environment where A got along with everyone, and B was hated by everyone else except for A. This wasn't because B saw qualities in A that no one else recognized; it was just that A was oblivous to/wasn't personally affected by all the valid reasons why everyone else disliked B. A to an extent thought of themselves as being able to see the good in B, but in reality they simply lacked the understanding of the effects of B's behavior on others.
Write code to pull down a significant amount of public data using an open API. (That took about 30 seconds - I just gave it the swagger file and said “here’s what I want”)
Get the data (an hour or so), clean the data (barely any time, gave it some samples, it wrote the code), used the cleaned data to query another API, combined the data sources, pulled down a bunch of PDFs relating to the data, had the AI write code to use tesseract to extract data from the PDFs, and used that to build a dashboard. That’s a mini product for my users.
I also had a play with Mistral’s OCR and have tested a few things using that against the data. When I was out walking my dogs I thought about that more, and have come up with a nice workflow for a problem I had, which I’ll test in more detail next week.
That was all whole doing an entirely different series of tasks, on calls, in meetings. I literally checked the progress a few times and wrote a new prompt or copy/pasted some stuff in from dev tools.
For the calls I was on, I took the recording of those calls, passed them into my local instance whisper, fed the transcript into Claude with a prompt I use to extract action points, pasted those into a google doc, circulated them.
One of the calls was an interview with an expert. The transcript + another prompt has given me the basis for an article (bulleted narrative + key quotes) - I will refine that tomorrow, and write the article, using a detailed prompt based on my own writing style and tone.
I needed to gather data for a project I’m involved in, so had Claude write a handful of scrapers for me (HTML source > here is what I need).
I downloaded two podcasts I need to listen to - but only need to listen to five minutes of each - and fed them into whisper then found the exact bits I needed and read the extracts rather than listening to tedious podcast waffle.
I turned an article I’d written into an audio file using elevenlabs, as a test for something a client asked me about earlier this week.
I achieved about three times as much today as I would have done a year ago. And finished work at 3pm.
So yeah, I don’t understand why people are so bullish about LLMs. Who knows?
Where do you think expert analysis comes from?
Talk to experts, gather data, synthesize, output. Researchers have been doing this for a long time. There's a lot of grunt work LLM's can really help with, like writing scripts to collect data from webpages.
However, as this thread demonstrates repeatedly, using LLMs effectively is about knowing what questions to ask, and what to put into the LLM alongside the questions.
The people who pay me to do what I do could do it themselves, but they choose to pay me to do it for them because I have knowledge they don’t have, I can join the dots between things that they can’t, and I have access to people they don’t have access to.
AI won’t change any of that - but it allows me to do a lot more work a lot more quickly, with more impact.
So yeah, at the point that there’s an AI model that can find and select the relevant datasets, and can tell the user what questions to ask - when often they don’t know the questions they need to have answered, then yes, I’ll be out of a job.
But more likely I’ll have built that tool for my particular niche. Which is more and more what I’m doing.
AI gives me the agency to rapidly test and prototype ideas and double down on the things that work really well, and refine the things that don’t work so brilliantly.
The data extraction via tesseract worked too.
The whisper transcript was pretty good. Not perfect, but when you do this daily you are easily able to work around things.
The summaries of the calls were very useful. I could easily verify those because I was on the calls.
The interview - again, transcript is great. The bulleted narrative was guided - again - by me having been on the call. I certify he quotes against the transcript, and audio if I’ve got any doubts.
Scrapers - again, they worked fine. The LLM didn’t misinterpret anything.
Podcasts - as before. Easy.
Article to voice - what’s to misinterpret?
Your criticism sounds like a lot of waffle with no understanding of how to use these tools.
T even if I was, because I do this multiple times a day and have been for quite sone time I know how to check for errors.
One part of that is a “fact check” built into the prompt, another part is feeding the results of that prompt back into the API with a second prompt and the source material and asking it to verify that the output of the first prompt is accurate.
However the level of hallucination has dropped massively over time, and when you’re using LLMs all the time you quickly become attuned to what’s likely to cause them and how to mitigate them.
I don’t mean this in an unpleasant way but this question - and many of the other comments responding to my initial description of how I use LLMs - feel like the story is things that people who have slightly hand wavey experience of LLMs think, having played with the free version of ChatGPT back in the day.
Claude 3.7 is far removed from ChatGPT at launch, and even now ChatGPT feels like a consumer facing procure while Claude 3.7 feels like a professional tool.
And when you couple that with detailed tried and tested prompts via the api in a multistage process, it is incredibly powerful.
It's a hammer -- sometimes it works well. It summarizes the user reviews on a site... cool, not perfect, but useful.
And like every tool, it is useless for 90% of life's situations.
And I know when it's useful because I've already tried a hammer on 1000 things and have figured out what I should be using a hammer on.
Forget bug fixes and new feature rollouts, every department and product team at Microsoft needs to add Copilot. Microsoft customers MUST jump on the AI-bandwagon!
If someone says, "This new type of hammer will increase productivity in the construction industry by 25%", it's something else in addition to being a tool. It's either a lie, or it's an incredible advance in technology.
Why do we burden ourselves with such expectations on us? Look at cities like Dallas. It is designed for cars. Not for human walking. The buildings are far apart, workplaces are far away from homes and everything looks like designed for some king-kong like creatures.
The burden of expectations on humans is driven by technology. Technology makes you work harder than before. It didn't make your life easier. Check how hectic life has become for you now vs a laid-back village peasant a century back.
The bullishness on LLMs is betting on this trend of self-inflicted human agony and dependency on tech. Man is going back to the craddle. LLMs give the milk feeder.
It's 100x easier to see how LLM's change everything. It takes very little vision to see what an advancement they are. I don't understand how you can NOT be bullish about LLM's (whether you happen to like them or not is a different question).
Or how about computer graphics? Early efforts to move 3D graphics hardware into the PC realm were met with extreme skepticism by my colleagues who were “computer graphics researchers” armed with the latest Silicon Graphics hardware. One researcher I was doing some work with in the mid-nineties remarked about PC graphics at the time: “It doesn’t even have a frame buffer. Look how terrible the refresh rate is. It flickers in a nauseating way.” Etc.
It’s interesting how people who are actual experts in a field where there is a major disruption going on often take a negative view of the remarkable new innovation simply because it isn’t perfect yet. One day, they all end up eating their words. I don’t think it’s any different with LLMs. The progress is nothing short of astonishing, yet very smart people continue to complain about this one issue of hallucination as if it’s the “missing framebuffer” of 1990s PC graphics…
On one hand, LLMs are overhyped and not delivering on promises made by their biggest advocates.
On the other hand, any other type of technology (not so overhyped) would be massively celebrated in significantly improving a subset of niche problems.
It’s worth acknowledging that LLMs do solve a good set of problems well, while also being overhyped as a silver bullet by folks who are generally really excited about its potential.
Reality is that none of us know what the future is, and whether LLMs will have enough breakthroughs to solve more problems then today, but what they do solve today is still very impressive as is.
But even if it does work, you still need to doublecheck everything it does.
Anyway, my RPG group is going to try roleplaying with AI generated content (not yet as GM). We'll see how it goes.
> Eisegesis is "the process of interpreting text in such a way as to introduce one's own presuppositions, agendas or biases". LLMs feel very smart when you do the work of making them sound smart on your own end: when the interpretation of their output has a free parameter which you can mentally set to some value which makes it sensible/useful to you.
> This includes e. g. philosophical babbling or brainstorming. You do the work of picking good interpretations/directions to explore, you impute the coherent personality to the LLM. And you inject very few bits of steering by doing so, but those bits are load-bearing. If left to their own devices, LLMs won't pick those obviously correct ideas any more often than chance.
I think LLMs work best when they are used as a "creative" tool. They're good for the brainstorming part of a task, not for the finishing touches.
They are too unreliable to be put in front of your users. People don't want to talk to unpredictable chatbots. Yes, they can be useful in customer service chats because you can put them on rails and map natural language to predetermined actions. But generally speaking I think LLMs are most effective when used _by_ someone who's piloting them instead of wrapped in a service offered _to_ someone.
I do think we've squeezed 90%+ of what we could from current models. Throwing more dollars of compute at training or inference won't make much difference. The next "GPT moment" will come from some sufficiently novel approach.
LLMs did the "styling" first. They generate high-quality language output of the sort most of us would take as a sign of high intelligence and education in a human. A human who can write well can probably reason well, and probably even has some knowledge of facts they write confidently about.
> I use GPT, Grok, Gemini, Mistral etc every day in the hope they'll save me time searching for information and summarizing it.
Even worse, you're continually waiting for it to get better. If the present is bright and the future is brighter, bullishness is justified.
She became more of an influencer than a scientist. And that is nothing wrong with that unless she doesn't try to pose as an authority on subjects she doesn't have a clue. It's OK to have an opinion as an outsider but it's not OK to pretend you are right and that you are an expert on every scientific or technical subject that happens to make you want to make a tweet about.
I don't believe OP's thesis is properly backed by the rest of his tweet, which seems to boil down to "LLM's can't properly cite links".
If LLM's performing poorly on an arbitrary small-scoped test case makes you bearish on the whole field, I don't think that falls on the LLM's.
As soon as LLMs were introduced into the IDE it began to feeling like LLM autocomplete was almost reading my mind. With some context built up over a few hundred lines of initial architecture, autocomplete now sees around the same corners I am. It’s more than just “solve this contrived puzzle” or “write snake”. It combines the subject matter use case (informed by variable and type naming) underlying the architecture and sometimes produces really breathtaking and productive results. Like I said, it took some time but when it happened, it was pretty shocking.
If I see what Copilot suggests most of the time, I would be very uncomfortable using it for vibe coding though. I think it's going to be... entertaining watching this trend take off. I don't really fear I'm going to lose my job soon.
I'm skeptical that you can build a business on a calculator that's wrong 10% of the time when you're using it 24/7. You're gonna need a human who can do the math.
New technologies typically require multiple generations of refinement—iterations that optimize hardware, software, cost-efficiency, and performance—to reach mainstream adoption. Similarly, AI, Large Language Models (LLMs), and Machine Learning (ML) technologies are poised to become permanent fixtures across industries, influencing everything from automotive systems and robotics to software automation, content creation, document review, and broader business operations.
Considering the immense volume of new information generated and delivered to us constantly, it becomes evident that we will increasingly depend on automated systems to effectively process and analyze this data. Current challenges—such as inaccuracies and fabrications in AI-generated content—parallel the early imperfections of digital photography. These issues, while significant today, represent evolutionary hurdles rather than permanent limitations, suggesting that patience and continuous improvement will ultimately transform these AI systems into indispensable tools.
I don't think you can even be bullish or bearish about this tech. It's here and it's changing pretty much every sector you can think of. It would be like saying you're not bullish about the Internet.
I honestly can't imagine life without one of these tools. I have a subscription to pretty much all of them because I get so excited to try out new models.
The Internet is great, but it did not usher in a golden age utopia for mankind. So it was certainly possible to overhype it.
Is that "bullish on LLMs" or not?
Joining the mess of freeform, redundant, and sometimes self contradicting data into JSON lines, and feeding it into AI with a big explicit prompt containing example conversions and corrections for possible pitfalls has resulted in almost magically good output. I added a 'notes' field to the output and instructed the model to call out anything unusual and it caught lots of date typos by context, ambiguously attributed notes, and more.
It would have been a man month or so of soul drowningly tedious and error prone intern level work, but now it was 40 minutes and $15 of Gemini usage.
So, even if it's not a galaxy brained super intelligence yet, it is a massive change to be able to automate what was once exclusively 'people' work.
When I use LLM’s to explore applications of cutting edge nonlinear optics, I too am appalled about the quality of the output. When I use an LLM to implement a React program, something that has been done hundreds of times before by others, I find it performs well.
The present path of IA is nothing short of revolutionary, a lot of jobs and industries are going to suffer a major upheaval and a lot of people are just living in some wishful thinking moment where it will all go away.
I see people complaining it gives them bad results. Sure it does, so all other parsed information we get. It’s our job to check it ourselves. Still , the amount of time it saves me, even if I have to correct it is huge.
A can give an example that has nothing to do with work. I was searching for the smallest miniATX computer cases that would accept at least 3 HDDs (3.5”). The amount of time LLMs saved me is staggering.
Sure, there was one wrong result in the mix, and sure, I had to double check all the cases myself, but, just not having to go through dozens of cases, find the dimensions, calculate the volume, check the HDDs in difficult to read (and sometimes obtain) pages, saved days of work - yes I had done a similar search completely manually about 5 years ago.
This is a personal example, I also have others at work.
It’s truly revolutionary and it’s just starting.
This isn't a technology problem, it's a product problem - and one that may not be solvable with better models alone.
Another issue: people communicate uncertainty naturally. We say "maybe", "it seems", "I'm not sure, but...". LLMs suppress that entirely, for structural reasons. The output sounds confident and polished, which warps perception - especially when the content is wrong.
Also my gf who's not particularly tech savvy relies heavily on ChatGPT for her work. It's very useful for a variety of text (translation, summaries, answering some emails).
Maybe Sabine Hossenfelder tries to use them for things they can't do well and she's not aware that they work for other use cases.
Yeah. That was my first thought. There’s probably orders of magnitude more training data for software engineering than for theoretical physics (her field). Also, how much of software engineering is truly novel? Probably someone else has already come up with a decent solution to your problem, it’s “just” a matter of finding it.
This is the most common 'smart person' fallacy out there.
As for my 2 cents, LLMs can do sequence modeling and prediction tasks, so as long as a problem can be reduced to sequence modeling (which is a lot of them!), they can do the job.
This is like saying that the Fourier Transform is played out because you can only do so much with manipulating signal frequencies.
But even its creators, who acknowledge it is not AGI, are trying to use it as if it were. They want to sell you LLMs as "AI" writ large, that is, they want you to use it as your research assistant, your secretary, your lawyer, your doctor, and so on and so forth. LLMs on their own simply cannot do those tasks. They is great for other uses: troubleshooting, assisting with creativity and ideation, prototyping concepts of the same, and correlating lots of information, so long as a human then verifies the results.
LLMs right now are flour, sugar, and salt, mixed in a bowl and sold as a cake. Because they have no reasoning capability, only rote generation via prediction, LLMs cannot process contextual information the way required for them to be trustworthy or reliable for the tasks people are trying to use them for. No amount of creative prompting can resolve this totally. (I'll note that I just read the recent Anthropic paper, which uses terms like "AI biology" and "concept" to imply that the AI has reasoning capacity - but I think these are misused terms. An LLM's "concept" of something bears no referent to the real world, only a set of weights to other related concepts.)
What LLMs need is some sort of intelligent data store, tuned for their intended purpose, that can generate programmatic answers for the LLMs to decipher and present. Even then, their tendency to hallucinate makes things tough - they might imagine the user requested something they didn't, for instance. I don't have a clear solution to this problem. I suspect whoever does will have solved a much bigger, more complex than the already massive one that LLMs have solved, and if they are able to do so, will have brought us much much closer to AGI.
I am tired of seeing every company under the sun claim otherwise to make a buck.
I haven't tried using LLMs for much else, but I am curious as long as I can run it on my own hardware.
I also totally get having a problem with the massive environmental impact of the technology. That's not AIs fault per se, but its a valid objection.
I’ve recently been using Gemini (mostly 2.0 flash) a lot and I’ve noticed it sometimes will challenge me to try doing something by myself. Maybe it’s something in my system prompt or the way I worded the request itself. I am a long time user of 4o so it felt annoying at first.
Since my purpose was to learn how to do something, being open minded I tried to comply with the request and I can say that… it’s being a really great experience in terms of retention of knowledge. Even if I’m making mistakes Gemini will point them out and explain it nicely.
One is - Google, Facebook, OpenAI, Anthropic, Deepseek etc. have put a lot of capital expenditure into train frontier large language models, and are continuing to do so. There is a current bet that growing the size of LLMs, with more or maybe even synthetic data, with some minor breakthroughs (nothing as big as the Alexnet deep learning breakthrough, or transformers), will have a payoff for at least the leading frontier model. Similar to Moore's law for ICs, the bet is that more data and more parameters will yield a more powerful LLM - without that much more innovation needed. So the question for this is whether the capital expenditure for this bet will pay off.
Then there's the question of how useful current LLMs are, whether we expect to see breakthroughs at the level of Alexnet or transformers in the coming decades, whether non-LLM neural networks will become useful - text-to-image, image-to-text, text-to-video, video-to-text, image-to-video, text-to-audio and so on.
So there's the business side question, of whether the bet that spending a lot of capital expenditure training a frontier model will be worth it for the winner in the next few years - with the method being an increase in data, perhaps synthetic data, and increasing the parameter numbers - without much major innovation expected. Then there's every other question around this. All questions may seem important but the first one is what seems important to business, and is connected to a lot of the capital spending being done on all of this.
It's cliche at this point to say "you're using it wrong" but damn... it really is a thing. It's kind of like how some people can find something online in one Google query and others somehow manage to phrase things just wrong enough that they struggle. It really is two worlds. I can have AI pump out 100k tokens with a nearly 0% error rate, meanwhile my friends with equally high engineering skill struggle to get AI to edit 2 classes in their codebase.
There are a lot of critical skills and a lot of fluff out there. I think the fluff confuses things further. The variety of models and model versions confuses things EVEN MORE! When someone says "I tried LLMs and they failed at task xyz" ... what version was it? How long was the session? How did they prompt it? Did they provide sufficient context around what they wanted performed or answered? Did they have the LLM use tools if that is appropriate (web/deepresearch)?
It's never a like-for-like comparison. Today's cutting-edge models are nothing like even 6-months ago.
Honestly, with models like Claude 3.7 Sonnet (thinking mode) and OpenAI o3-mini-high, I'm not sure how people fail so hard at prompting and getting quality answers. The models practically predict your thoughts.
Maybe that's the problem, poor specifications in (prompt), expecting magic that conforms to their every specification (out).
I genuinely don't understand why some people are still pessimistic about LLMs.
We are going through a societal change. There will always be the people who reject AI no matter the capabilities. I'm at the point where if ANYTHING tells me that it's conscious... I just have to believe them and act accordingly to my own morals
Their strengths and flaws differ from our brains, to be sure, but some of these flaws are being mitigated and improved on by the month. Similarly, unaided humans cannot operate successfully in many situations. We build tools, teams, and institutions to help us deal with them.
Including the arrogance to confidently deliver a wrong answer. Which is the opposite of the reasons we use computers in the first place. Why this is worth billions of dollars is utterly beyond me.
> unaided humans cannot operate successfully in many situations
Absolute nonsense driven by a total lack of historical perspective or knowledge.
> We build tools, teams, and institutions to help us deal with them.
And when they lie to us we immediately correct that problem or disband them recognizing that they are more trouble than they could be worth.
> Absolute nonsense driven by a total lack of historical perspective or knowledge.
An LLM can give you a list of examples:
Historical Examples:
- During historical epidemics, structured record-keeping and statistical analysis (such as John Snow’s cholera maps in 1854) significantly improved outcomes.
- Development of physics, architecture, and engineering depended heavily on tools such as abacus, logarithmic tables, calculators, slide rules, to supplement human cognitive limitations.
- Astronomical calculations in ancient civilizations (Babylonian, Greek, Mayan) depended heavily on abacuses, tables, and other computational tools.
- The pyramids in ancient Egypt required extensive use of tools, mathematics, coordinated human labor, and sophisticated organization.
For people interested in understanding the possibilities of LLM for use in a specific domain see The AI Revolution in Medicine: GPT-4 and Beyond by Peter Lee (Microsoft Research VP), Isaac Kohane (Harvard Biomedical Informatics MD) et al. It is an easy read showing the authors systematic experiments with using the OpenAI models via the ChatGPT interface for the medical/healthcare domain.
For a current-status follow-up to the above book, here is Peter Lee's podcast series The AI Revolution in Medicine, Revisited - https://www.microsoft.com/en-us/research/story/the-ai-revolu...
Instead of reading trivial blogs/tweets etc. which are useless, read the above to get a much better idea of an LLM's actual capabilities.
https://www.nerdwallet.com/article/investing/bullish-vs-bear...
The current discussion about LLMs guarantees that both positive and negative expectations are a valid title for an article xD
'But you're such a killjoy.'
Yes, it is an evil technology in its current shape. So we should focus on fixing it, instead of making it worse.
But I don't understand how you can come to this conclusion when using SOTA models like Claude Sonnet 3.7, it's response has always been useful and when it doesn't get it right first time you can keep prompting it with clarifications and error responses. On the rare occasion it's unable to get it right, I'm still left with a bulk of useful code that I can manually fix and refactor.
Either way my interactions with Sonnet is always beneficial. Maybe it's a prompt issue? I only ask it to perform small, specific deterministic tasks and provide the necessary context (with examples when possible) to achieve it.
I don't vibe code or unleash an LLM on an entire code base since the context is not large enough and I don't want it to refactor/break working code.
I just updated my company commercial PPT. ChatGPT helped me with: - Deep Research great examples and references of such presentations. - Restructure my argument and slides according to some articles I found on the previous step, and thought were pretty good. - Come up with copy for each slide. - Iterate new ideas as I was progressing.
Now, without proper context and grounding, LLMs wouldn't be so helpful at this task, because they don't know my company, clients, product and strategy, and would be generic at best. The key: I provided it with my support portal documentation and a brain dump I recorded to text on ChatGPT with key strategic information about my company. Those are two bits of info I keep always around, so ChatGPT can help me with many tasks in the company.
From that grounding to the final PPT, it's pretty much a trivial and boring transformation task that would have cost me many, many hours to do.
An LLM can do some pretty interesting things, but the actual applicability is narrow. It seems to me that you have to know a fair amount about what you're asking it to do.
For example, last week I dusted off my very rusty coding skills to whip up a quick and dirty Python utility to automate something I'd done by hand a few too many times.
My first draft of the script worked, but was ugly and lacked any trace of good programming practices; it was basically a dumb batch file, but in Python. Because it worked part of me didn't care.
I knew what I should have done -- decompose it into a few generic functions; drive it from an intelligent data structure; etc -- but I don't code all the time anymore, and I never coded much in Python, so I lack the grasp of Python syntax and conventions to refactor it well ON MY OWN. Stumbling through with online references was intellectually interesting, but I also have a whole job to do and lack the time to devote to that. And as I said, it worked as it was.
But I couldn't let it go, and then had the idea "hey, what if I ask ChatGPT to refactor this for me?" It was very short (< 200 lines), so it was easy to paste into the Chat buffer.
Here's where the story got interesting. YES, the first pass of its refactor was better, but in order to get it to where I wanted it, I had to coach the LLM. It took a couple passes through before it had made the changes I wanted while still retaining all the logic I had in it, and I had to explicitly tell it "hey, wouldn't it be better to use a data structure here?" or "you lost this feature; please re-add it" and whatnot.
In the end, I got the script refactored the way I wanted it, but in order to get there I had to understand exactly what I wanted in the first place. A person trying to do the same thing without that understanding wouldn't magically get a well-built Python script.
The tech isn't there yet, clearly. And stock valuations are over the board way too much. But, LLMs as a tech != the stock valuations of the companies. And, LLMs as a tech are here to stay and improve and integrate into everyday life more and more - with massive impacts on education (particularly K-12) as models get better at thinking and explaining concepts for example.
1. AI inventing false information that is being used to built on their foundational knowledge.
2. There is a lot less problem solving for them once they are used to AI.
I think the fundamental of Eduction needs to look at AI or current LLM chatbot seriously and start asking or planning how to react to it. We have already witness Gen Z, with era of Google thinking they know everything and if not google it. Thinking of "They Know it ALL" only to be battered in the real world.
AI may make it even worst.
300/5290 functions decompiled and analyzed in less than three hours off of a huge codebase. By next weekend, a binary that had lost source code will have tests running on a platform it wasn't designed for.
On one hand, every new technology that comes about unregulated creates a set of ethical and in this particular case, existential issues.
- What will happen to our jobs?
- Who is held accountable when that car navigation system designed by an LLM went haywire and caused an accident?
- What will happen with education if we kill all entry level jobs and make technical skills redundant?
In a sense they're not new concerns in science, we research things to make life easier, but as technology advances, critical thinking takes a hit.
So yeah, I would say people are still right to be weary and 'bullish" of LLMs as it's the normal behaviour for disruptive technology, and one will help us create adequate regulations to safeguard the future.
https://www.youtube.com/watch?v=70vYj1KPyT4 https://www.youtube.com/watch?v=6P_tceoHUH4 https://www.youtube.com/watch?v=nJjPH3TQif0
She's youtuber who is happy to take whatever position she thinks will get her the most views and ad revenue, all while crying woe is me about how the scientific establishment shuns her.
I do not understand how you can be bearish on LLMs. Data analysis, data entry, agents controlling browsers, browsing the web, doing marketing, doing much of customer support, writing BS React code for a promo that will be obsolete in 3 months anyway.
The possibilities are endless, and almost every week, there is a new breakthrough.
That being said, OpenAI has no moat, and there definitely is a bubble. I'm not bullish on AI stocks. I'm bullish on the tech.
LLMs are the most impactful technology we've had since the internet, that is why people are bullish on them, anyone who fails to see that cannot probably tie its own shoes without a "peer-reviewed" mechanism, lol.
Also, I like to think for myself. Writing code and thinking through what I am writing often exposes edge cases that I wouldn’t otherwise realize.
1) LLMs are a wonderous technology, capable of doing some really ingenious things
2) The hundreds of billions spent on them will not meet a positive ROI
Bonus 3) they are not good at everything & they are very bad at some things, but they are sold as good at everything.
Sure, it does middle-of-the-road stuff, but it comments the code well, I can manually tweak things at the various levels of granularity to guide it along, and the design doc is on par something a senior principal would produce.
I do in a week what a team of four would take a month and a half to do. It's insane.
Sure, don't be bullish. I'm frantically piecing together enough hardware to run a decent sized LLM at home.
It's just another version of someone who is relativity incompetent but can produce something vaguely convincing.
Her primarily work interest is in the truth, not the statically plausible.
Her point is that using LLM to generate truth is pointless, and that people should stop advertising llms as "intelligent", since, to a scientist, being "intelligent" and being "dead wrong" are polar opposite.
Other use cases have feedback loops - it does not matter so much if Claude spuits wrong code, provided you have a compiler and automated tests.
Scientists _are_ acting as compilers to check truth. And they rely on truths compiled by other scientists, just like your programs rely on code written by other people.
What if I tell you that, from now on, any third party library that you call will _stastically_ work 76% of the time, and I have no clue what it does is the remaining X % ?(I don't know what X is, I haven't hast chatgpt yet.)
In the meantime, I still have to see a headline "AI X discovered life-changing new Y on its own" (the closest thing I know of is alpha fold, which I both know is apparently "changing the world of scientists", and yet feel has "not changed the world of your average joe, so far" - emphasis on the "so far" ) ; but I've already seen at least one headline of "dumb mistake made because an AI hallucinated".
I suppose we have to hope the trend will revert at some point ? Hope, on a Friday...
They try to add a new feature or change some behavior in a large existing codebase and it does something dumb and they write it off as a waste of time for that use case. And that's understandable. But if they had tweaked the prompt just a bit it actually might've done it flawlessly.
It requires patience and learning the best way to guide it and iterate with it when it does something silly.
Although you undoubtedly will lose some time re-attempting prompts and fixing mistakes and poor design choices, on net I believe the frontier models can currently make development much more productive in almost any codebase.
Thank you Sabine. Every time I have mentioned Gemini is the worst, and not even worth of consideration, I have been bombarded with downvotes, and told I am using it wrong.
One is the worker's view, looking at AI to be a powerful tool that can leverage one's productivity. I think that is looking promising.
I don't really care for the chat bot to give me accurate sources. I care about an AI that can provide likely places to look for sources and I'll build the tool chain to lookup and verify the sources.
vs
"Bullish" The LLM is going to revolutionise human behaviour and thought and bring about a new golden age.
Former is justifiable, the latter is just reinforcing the bubble.
People are looking for perfect instead of better.
For those who have Twitter blocked.
From a sub-tweet:
>> no LLM should ever output a url that gives a 404 error. How hard can it be?
As a developer, I'm just imagining a server having to call up all the URLs to check that they still exist (and the extra costs/latency incurred there)... And if any URLs are missing, getting the AI to re-generate a different variant of the response, until you find one which does not contain the missing links.
And no, you can't do it from the client side either... It would just be confusing if you removed invalid URLs from the middle of the AI's sentence without re-generating the sentence.
You almost need to get the LLM to engineer/pre-process its own prompts in a way which guesses what the user is thinking in order to produce great responses...
Worse than that though... A fundamental problem of 'prompt engineering' is that people (especially non-tech people) often don't actually fully understand what they're asking. Contradictions in requirements are extremely common. When building software especially, people often have a vague idea of what they want... They strongly believe that they have a perfectly clear idea but once you scope out the feature in detail, mapping out complex UX interactions, they start to see all these necessary tradeoffs and limitations rise to the surface and suddenly they realize that they were asking for something they don't want.
It's hard to understand your own needs precisely; even harder to communicate them.
If I recall correctly, that is one of Dilbert's management axioms: if I don't understand it it cannot be difficult
And I have used the following response to pointy haired bosses on a couple of occasions ( though I don't recommend it ).
'If it's so easy - feel free to do it yourself'.
Very easy. We can replace the error handler with a bullshit generator, and these people will be satisfied, as the whole idea is bullshit by the way.
I use them almost daily in my job and get tremendous use out of them. I guess you could accuse me of lying, but what do I stand to gain from that?
I've also seem people claim that only people who don't know how to code or people doing super simple done a million times apps can get value out of LLMs. I don't believe that applies to my situation, but even if it did, so what? I do real work for a real company delivering real value, and the LLM delivers value to me. It's really as simple as that.
1. Asked ChatGPT for a table showing monthly daily max temp, rainfall in mm and numbers of rain days, for Vietnam, Cambodia and Thailand. And colour coded based on the temperatures. Then suggest two times of year, and a route direction, to hit the best conditions on a multi-week trip.
It took a couple of seconds, and it helpfully split Vietnam at Hanoi and HCM given their weather differences.
2. I'm trying to work out how I will build a chicken orchard - post material, spacing, mesh, etc. I asked ChatGPT for a comparison table of steel posts versus timber, and then to cost it out with varying scales of post spacing. Plus pros and cons of each, and likely effort to build. Again, it took a few seconds, including browsing local stores for indicative pricing.
On top of that, I've been even more impressed by a first week testing Cursor.
If the o3 based 3 month old strongest model is the best one, it's a proof that there were quite significant improvements in the last 2 years.
I can't name any other technology that improved as much in 2 years.
O1 and o1 pro helped me with filing tax returns and answered me questions that (probably quite bad) tax accountants (and less smart models) weren't able to (of course I read the referenced laws, I don't trust the output either).
I am hoping that the LLM approach will face increasingly diminished returns however. So I am biased toward Sabine's griping. I don't want LLM to go all the way to "AGI".
I hope she is aware of the limited context window and ability to retrieve older tokens from conversations.
I have used llms for the exact same purpose she has, summerize chapters or whole books and find the source from e quote, both with success.
I think they key to a successful output lies in the way you prompt it.
Hallucinations should be expected though, as we all hopefully know, llms are more of a autocomplete than intelligence, we should stick to that mindset.
LLMs are a little bit magical but they are still a square peg. The fact they don't fit in a round hole is uninteresting. The interesting thing to debate is how useful they are at the things they are good at, not at the things they are bad at.
If people aren't linking the conversation, it's really hard to take the complaint seriously.
The keyword in title is "bullish". It's about the future.
Specifically I think it's about the potential of the transformer architecture & the idea that scaling is all that's needed to get to AGI (however you define AGI).
> Companies will keep pumping up LLMs until the day a newcomer puts forward a different type of AI model that will swiftly outperform them.
If I were to be cynical, I think we've seen over the last decade the descent of most of academia, humanities as much as natural sciences, to a rather poor state, drawing entirely on a self-contained loop of references without much use or interest. Especially in the natural sciences, one can today with little effort obtain an infinitely more insightful and, yes, accurate synthesis of the present state of a field from an LLM than 99% of popular science authors.
The best solution to hallucination and inaccuracy is to give the LLM mechanisms for looking up the information it lacks. Tools, MCP, RAG, etc are crucial for use cases where you are looking for factual responses.
If you look at any company on earth, especially large ones, they all share the same line item as their biggest expense: labor. Any technology that can reduce that cost represents an overwhelmingly huge opportunity.
Also great for brainstorming and quick drafting grant proposals. Anything prototyping and quickly glued together I'll go for LLMs (or LLM agents). They are no substitute for your own brain though.
I'm also curious about the hallucinated sources. I've recently read some papers on using LLM-agents to conduct structured literature reviews and they do it quite well and fairly reproducible. I'm quite willing to build some LLM-agents to reproduce my literature review process in the near future since it's fairly algorithmic. Check for surveys and reviews on the topic, scan for interesting papers within, check sources of sources, go through A tier conference proceedings for the last X years and find relevant papers. Rinse, repeat.
I'm mostly bullish because of LLM-agents, not because of using stock models with the default chat interface.
Chat is such a “leaky” abstraction for LLMs
I think most people share the same negative experience as they only interact with LLMs through the chat UI by OpenAI and Anthropic. The real magic moment for me was still the autocompletion moment from the gh copilot.
We really should stop reinforcing our echo bubbles and learn from other people. And sometimes be cool in the face of criticism.
When I evaluate against areas I possess professional expertise I become convinced LLMs produce the Gell Mann amnesia effect for any area I don't know.
What is a good tutorial / training, to learn about LLMs from scratch ?
I guess that’s the main problem, even more so for non-developers and -tech people. The learning curve is too steep, and people don’t know where to start.
They are impressive for what they are… but do you know what they are? I do, and that’s why I’m not that hyped about them.
I've been looking at these recently but not specifically to solve "LLM issues".
Actually I think they work really well, in any case where you can detect errors, which the OP apparently can.
If you have tried to use LLMs and find them useless or without value, you should seriously consider learning how to correctly use them and doing more research. It is literally a skill issue and I promise you that you are in the process of being left behind. In the coming years Human + AI cooperatives are going to far surpass you in terms of efficiency and output. You are handicapping yourself by not becoming good at using them. These things can deliver massive value NOW. You are going to be a grumbling gray beard losing your job to 22 year old zoomers who spend 10 hours a day talking to LLMs.
If the future is only LLMs then we’re all cooked.
If the future is a hybrid, then those “grey beard” skills are a much higher barrier to entry than a few months tinkering with ChatGPT.
Please stop spreading FUD.
Just like in the 2000-2010's knowing how to effectively Google things (while undoubtedly a skill) wasn't what made someone economically valuable.
I mean I challenge you to show you're as skilled at using LLMs as Janus or Pliny the Liberator.
the bull case very obviously speaks for itself!
It’s like a parrot and if you know what you’re doing you can catch tons of mistakes.
He's a person with money and he wants AI programmers. I bet there are millions like him.
Don't get me wrong though, I do believe in a future with LLMs. But I believe they will become more and more specialized for specific tasks. The more general an AI is, the more it's likely to fail.
Also, people in general quickly adapt. LLMs are absolute sci-fi magic but you forget that easily. Here's a comedian's view on that phenomenon https://www.youtube.com/watch?v=nUBtKNzoKZ4
They excel in spitballing more than accurate citations.
With that said, I find that they are very helpful for a lot of tasks, and improve my productivity in many ways. The types of things that I do are coding and a small amount of writing that is often opinion-based. I will admit that I am somewhat of a hacker, and more broad than deep. I find that LLMs tend to be good at extending my depth a little bit.
From what I can tell, Sabine Hossenfelder is an expert in physics, and I would guess that she already is pretty deep in the areas that she works in. LLMs are probably somewhat less useful at this type of deep, fact-based work, particularly because of the issue where LLMs don't have access to paywalled journal articles. They are also less likely to find something that she doesn't know (unlike with my use cases, where they are very likely to find things that I don't know).
What I have been hearing recently is that it will take a long time for LLMs will be better than humans at everything. However, they are already better than many many humans at a lot of things.
1. Any low hanging fruits that could easily be solved by an LLM easily probably would have been solved by someone already using standard methods.
2. Humans and LLMs have to spend some particular amount of energy to solve problems. Now, there are efficiencies that can lower/raise that amount of energy but at the end of the day TANSTAAFL. Humans spend this in a lifetime of learning and eating, and LLMs spend this in GPU time and power. Even when AI gets to human level it's never going to abstract this cost away, energy still needs spent to learn.
I have a very specific esoteric question like: "What material is both electrically conductive and good at blocking sound?" I could type this into google and sift through the titles and short descriptions of websites and eventually maybe find an answer, or I can put the question to the LLM and instantly get an answer that I can then research further to confirm.
This is significantly faster, more informative, more efficient, and a rewarding experience.
As others have said, its a tool. A tool is as good as how you use it. If you expect to build a house by yelling at your tools I wouldn't be bullish either.
I don't see how an LLM is significantly faster or more informative, since you still have to do the legwork to validate the answer. I guess if you're google-phobic (which a lot of people seem to be, especially on HN) then I can see how it's more rewarding to put it off until later in the process.
The validity of the answers is not 1:1 with its potential profitability.
Like James Baldwin said "people love answers, but hate questions."
getting an answer faster is exponentially better than getting the more precise, more right, more nuanced answer for most people every time. Doing the due dilligence is smart but its also after the fact.
idk if you have noticed, but google is clearly using LLM technology in conjunction with its search results, so the assumption they are just using traditional tech and not LLM's to inform or modify its result set I think is unlikely.
There are, but I haven’t found LLMs to be useful for them either.
Today, I had the question of “where can I buy firewood within a ten minute drive from my house, and what’s the cost at each place?”
There’d no real way to get that info without going for a drive, or calling every potential location to ask.
The Google AI summary suggests MLV which is wrong.
ChatGPT suggests using copper which is also wrong.
I call bullshit on the entire affair.
A material that is both electrically conductive and good at blocking sound is:
Lead (Pb) • Electrical conductivity: Lead is a metal, so it conducts electricity, although it’s not the most conductive (lower than copper or silver).
• Sound blocking: Lead is excellent at blocking sound due to its high density and mass, which help attenuate airborne sound effectively.
Other options depending on application:
Composite materials:
• Metal-rubber composites or metal-polymer composites can be engineered to conduct electricity (via embedded conductive metal layers or fillers) and block sound (due to the damping properties of the polymer/rubber layer).
Graphene or carbon-filled rubber:
• Electrically conductive due to graphene/carbon content.
• Sound damping from rubber base.
• Used in some specialized industrial or automotive applications.
Let me know if you need it optimized for a specific use case (e.g., lightweight, flexible, non-toxic).
...........
This took me less than 10 seconds.
Pretty damn good if you ask me.
That’s… weird. Definitely doesn’t inspire confidence.
If that disappoints you to such a degree that you simply won't use them, you might find yourself in a position some years ahead - could be 1...could be 2...could be 5...could be 10 - who knows, but when the time comes, you might just be outdated and replaced yourself.
When you closely follow the incremental improvements of tech, you don't really fall for the same hype hysteria. If you on the other hand only look into it when big breakthroughs are made, you'll get caught in the hype and FOMO.
And even if you don't want to explicitly use the tools, at least try to keep some surface-level attention to the progress and improvements.
I honestly believe that there are many, many senior engineers / scientists out there that currently just scoff at these models, and view them as some sort of toy tech that is completely overblown and overhyped. They simply refuse to use the tools. They'll point to some specific time a LLM didn't deliver, roll their eyes, and call it useless.
Then when these tools progress, and finally meet their standards, they will panic and scramble to get into the loop. Meanwhile their non-tech bosses and executives will see the tech as some magic that can be used to reduce headcount.
Just today, I was thinking of making changes to my home theater audio setup and there are many ways to go about that, not to mention lots of competing products, so I asked ChatGPT for options and gave it a few requirements. I said I want 5.1 surround sound, I like the quality and simplicity of Sonos, but I want separate front left and right speakers instead of “virtual” speakers from a soundbar. I waited years thinking Sonos would add that ability, but they never did. I said I’d prefer to use the TV as the hub and do audio through eARC to minimize gaming latency and because the TV has enough inputs anyway, so I really don’t need a full blown AV receiver. Basically just a DAC/preamp that can handle HDMI eARC input and all of the channels.
It proceeded to tell me that audio-only eARC receivers that support surround sound don’t really exist as an off-the-shelf product. I thought, “What? That can’t be right, this seems like an obvious product. I can’t be the first one to have thought of this.” Turns out it was right, there are some stereo DAC/preamps that have an eARC input and I could maybe cobble together one as a DIY project, but nothing exactly like what I wanted. Interesting!
ChatGPT suggested that it’s probably because by the time a manufacturer fully implements eARC and all of the format decoding, they might as well just throw in a few video inputs for flexibility and mass-market appeal, plus one less SKU to deal with. And that kind of makes sense, though it adds excess buttons and bothers me from a complexity standpoint.
It then suggested WISA as a possible solution, which I had never heard of, and as a music producer I pay a lot of attention to speaker technology, so that was interesting to me. I’m generally pretty skeptical of wireless audio, as it’s rarely done well, and expensive when it is done well. But WISA seems like a genuine alternative to an AV receiver for someone who only wants it to do audio. I’m probably going to go with the more traditional approach, but it was fun learning about new tech in a brainstorming discussion. Google struggles with these sorts of broad research queries in my experience. I may or may not have found out about it if I had posted on Reddit, depending on whether someone knowledgeable happened to see my post. But the LLM is faster and knows quite a bit about many subjects.
I also can’t remember the last time it hallucinated when having a discussion like this. Whereas, when I ask it to write code, it still hallucinates and makes plenty of mistakes.
What worked:
- generated a mostly working PoC with minimal input and hallucinated UI layout, Color scheme, etc. this is amazing because it did not bombard me with detailed questions. It just carried on to provide me with a baseline that I could then finetune
- it corrected build issues by me simply copy pasting the errors from Xcode - got APIs working - added debug code when it could not fix an issue after a few rounds
- resolved an API issue after I pointed it to a typescript SDK to the API (I literally gave a link to the file and told it, try to use this to work out where the problem is) - it produces code very fast
What is not working great yet:
- it started off with one large file and crashed soon after because it hit a timeout when regenerating the file. I needed to ask it to split the file into a typical project order
- some logic I asked it to implement explicitly got changed at some point during an unrelated task. To prevent this in future I asked it mark this code part as important and that it should only be changed at explicit request. I don’t know yet how long this code will stay protected for
- by the time enough context got build up usage warnings pop up in Claude
- only so many files are supported atm
So my takeaway is that it is very good at translating, I.e. API docs into code, errors into fixes. There is also a fine line between providing enough context and running out of tokens.
I am planning to continue my project to see how far I can push it. As I am getting close to the limit of the token size now, I am thinking of structuring my app in a Claude friendly way:
- clear internal APIs. Kind of like header files so that I can tell Claude what functions it can use without allowing it to change them or needing to tokenize the full source code
- adversarial testing. I don’t have tests yet, but I am thinking of asking one dedicated instance of Claude to generate tests. I will use other Claude instances for coding and provide them with failing test outputs like I do now with build errors. I hope it will fix itself similarly.
https://euromaidanpress.com/2025/03/27/russian-propaganda-ne...
I will cite myself as Exhibit A. I am the sort of person who takes almost nothing at face value. To me, physiotherapy, and oenology, and musicology, and bed marketing, and mineral-water benefits, and very many other such things, are all obviously pseudoscience, worthy of no more attention than horoscopes. If I saw a ghost I would assume it was a hallucination caused by something I ate.
So it seems like no coincidence that I reflexively ignore the AI babble at the top of search results. After all, an LLM is a language-rehashing machine which (as we all know by now) does not understand facts. That's terribly relevant.
I remember reading, a couple of years back, about some Very Serious Person (i.e. a credible voice, I believe some kind of scientist) who, after a three-hour conversation with ChatGPT, had become convinced that the thing was conscious. Rarely have I rolled my eyes so hard. It occurred to me then that skepticism must be (even) less common a mindset than I assumed.
Maybe that's why?
Also I find it disingenuous that apologists are stating thing close to "you are using it wrong". Where it is advertised that LLM based AI should be more and more trusted (because more accurate, based on some arbitrary metrics) and might save some time ( on some undescribed task).
Of course in that use case most would say to use your judgement to verify whatever is generated, but for the generation that is using AI LLM as a source of knowledge ( like some people are using Wikipedia as source of truth, or stack overflow) it will be difficult to verify, when all they knew is LLM generated content as source of knowledge.
I hope that realization happens before "vibe coding" is accepted as standard practice by software teams (especially when you consider the poor quality of software before the LLM era). If not, it's only a matter of time before we refer to the internet as "something we used to enjoy."
The article makes a lot of good points. I get a lot of slop responses to both coding and non-coding prompts, but I've also gotten some really really good responses, especially code completion from Copilot. Even today, ChatGPT saved me a ton of Google searches.
I'm going to continue using it and taking every response with a grain of salt. It can only get better and better.
Wait until the investors want their returns
uhm, I dismiss this statement here? if you call 4o the best, that means you haven't genuinely explored other models before making such claims...
saying "why are people bullish" only to continue with bullying does not add any clarity to this world
I just hope they keep feeling that way and avoid LLMs. Less competition for those of us who are using them to make our jobs/lives easier every day.
...that was the LLM responding, and it did not set an alarm.
Same with the top reply from Teortaxes containing zero relevant information, which Twitter in its infinite wisdom has decided is the “most relevant” reply. (The second “most relevant” reply is some ad for some bs crypto newsletter.)
It's a tool. It can be useful, it doesn't always work. Some people claim it's better than it is, some people claim it's worse. This isn't exactly rocket science.
Regardless of the tweet in question, Sabine is a grifter. Her novel takes on academia being some kind of conspiracy of people milking the system, and of physicists not being interested in making new discoveries are nonsensical and only serves to increase her own profile. Look at this video of her trying to convince the world she received an email that apparently proves all her points correct. My BS detector tells me she wrote that email herself, but you be the judge: https://www.youtube.com/watch?v=shFUDPqVmTg
I think it is unethical of you to post comments like this one without disclosing you have no expertise to judge and personal stakes that make you desire the article to be wrong.