For me, 2023 was an entire year of weekly demos that now looking back at were basically a "Look at this dank prompt I wrote" followed by thunderous applause from the audience (which was mostly, but not exclusively, upper management)

Hell man, I attended a session at an AWS event last year that was entirely the presenter opening Claud and writing random prompts to help with AWS stuff... Like thanks dude... That was a great use of an hour. I left 15 minutes in.

We have a team that's been working on an "Agent" for about 6 months now. Started as prompt engineering, then they were like "no we need to add more value" developed a ton of tools and integrations and "connectors" and evals etc. The last couple of weeks were a "repivot" going back full circle to "Lets simplify all that by prompt engineering and give it a sandbox environment to run publicly documented CLIs. You know, like Claude Code"

The funny thing is I know where it's going next...

> The funny thing is I know where it's going next...

You all get offshored?

I can't take anyone seriously who uses prompt engineering unironically. I see those emails come through at work and all I can do is roll my eyes and move on
what level of seriousness does "context engineering" deserve?
Is there a level lower than 0?
But did it work? This is the sticking point with me now. I've seen slides, architecture diagrams, job descriptions, roadmaps and other docs now from about a dozen different companies doing AI Agent projects. And while it's completely feasible to build the systems they're describing, what I have not seen yet is evidence of any of them working.

When you press them on this, they have all sorts of ideas like a judge LLM that takes the outputs, comes up with modified SOPs and feeds those into the prompts of the mixture-of-experts LLMs. But I don't think that works, I've tried closing that loop and all I got was LLMs flailing around.

It hasn’t really worked so far. Pretty much exactly what you’ve described. I don’t even really work on that team, but “a judge LLM” low-key triggered me just because of how much I’ve been hearing it over the last couple of months.

I think the reason of the recent pivot is to “keep the human in the loop” more. The current thinking is they tried to remove the human too much and were getting bad results. So now they just want to make the interaction faster and let the human be more involved like how we (developers) use Claude code or copilot by checking every interaction and nudging it towards the right/desired answer.

I got the sense that management isn’t taking it well though. Just this Friday they gave a demo of the new POC where the LLM is just suggesting things and frequently asking for permissions and where to go next and expecting the user to interact with it a lot more than the one-shot approach before (which I do think is likely to yield better results tbh) but the main reaction was “this seems like a massive step backward”

I think long-term just having a single LLM responsible for everything will win out compared to brittle and complex subagent hierarchie. Most use of "subagents" today are just workarounds for LLM limitations: lack of instruction following, context length, non- determinism, or "hallucinations".

All of these are things that will need to be solved long-term in the model itself though, at least if the AI bubble needs to be kept alive. And solving those things would in fact materially improve all sorts of benchmarks, so there's an incentive for frontier labs to do it.

I think this is why you have the back-forth pattern that GP mentioned. You start with a single model doing everything. Then you find all sorts of gaps that you start to plug ad-hoc, and decide that breaking it into subagents might help fix things. This works for a while but then you realize that you lose out on the flexibility of a single-model having access to the entire context, so you starting trying to improve communication between subagents. But then a new model drops that fixes a lot of the things you originally had to workaround, so you go back to a single-model setup. Rinse and repeat. It's a great VC- bubble funded employment program though.

The general pattern seems to be that LLM+scaffolding performs better than LLM. In 6 months time a new model will incorporate 80% of your scaffolding, but also will enable new capabilities with a new layer of scaffolding.

I suspect the model that doesn’t need scaffolding is simply ASI, as in, the AI can build its own scaffolding (aka recursive self-improvement), and build it better than a human can. Until that point, the job is going to remain figuring out how to eval your frontier task, scaffold the models’ weaknesses, and codify/absorb more domain knowledge that’s not in the training set.

You are talking about context management stuff here, the solution will be something like a proper memory subsystem, maybe some architectural tweaks to integrate it. There are more obvious gaps beyond that which we will have to scaffold and then solve in turn.

Another way of thinking about this is just that scaffolding is a much faster way of iterating on solutions than pre-training, or even post-training, and so it will continue to be a valuable way of advancing capabilities.

  • ivape
  • ·
  • 10 hours ago
  • ·
  • [ - ]
Wait ...

You mean teams are already building their own solutions to existing solutions? Software development will live on in eternity then.

They are just reselling OpenAI subscriptions at a markup. Surprise!
A long time ago a mentor of mine said,

"In tech, often an expert is someone that know one or two things more than everyone else. When things are new, sometimes that's all it takes."

It's no surprise it's just prompt engineering. Every new tech goes that way - mainly because innovation is often adding one or two things more the the existing stack.

I remember being told that the secret of good consultancy is knowing what to read on your way to the meeting
  • nunez
  • ·
  • 7 hours ago
  • ·
  • [ - ]
100% based. Some of the best meetings and demos I've ever ran in my consulting era were done on prep I did maybe 30 minutes before! Ironically, in many cases, the more prep I did, the worse the outcome was!
very true. and these days take a lot less effort than before getting llms to summarize shit which is one task they inarguably shine on
  • nunez
  • ·
  • 7 hours ago
  • ·
  • [ - ]
I don't trust LLMs or "deep research" for any serious analysis. I use them for guidance (when I do use them) but not for the final product. Too many words with too many landmines (mistakes) hidden within. Also, distillation and inference via human brain is much more environmentally-friendly than sacrificing fleets of GPUs while only being just a tad bit more work.
They make too many mistakes for me to rely on their summaries for consulting. Repeating one of those is a great way to embarrass yourself in front of a client and damage your reputation
  • rynn
  • ·
  • 11 hours ago
  • ·
  • [ - ]
It’s easy to underestimate the amount of testing “just” prompt/context engineering takes to get above average results.

And then you need to see what variations work best with different models.

My POCs for personal AI projects take time to get this right. It’s not like the API calls are the hard portion of the software.

I'm always more interested in the 'less is more' strategy, taking things away from the already hyper-complicated stack, reviewing first principles and simplifying for the same effectiveness. This is ever more rare.
I think this sense of “less is more” roughly means refactoring? I think the reason these go south so often is because we’re likely moving complexity around rather than removing it. Removing a layer from the stack means making a different layer more complex to take over for it.
> just prompt engineering

This dismisses a lot of actual hard work. The scaffolding required to get SOTA performance is non-trivial!

Eg how do you build representative evals and measure forward progress?

Also, tool calling, caching, etc is beyond what folks normally call “prompt engineering”.

If you think it’s trivial though - go build a startup and raise a seed round, the money is easy to come by if you can show results.

prompt engineering + CRUD is likely much more fair.

And many companies are "just CRUD".

The money is easy to come by because wealthy investors, while they don't want to pay any more in taxes, are desperate to find possible returns in an economy that sucks outside of ballooning healthcare and the AI bubble... not because they need the money but because NUMBER MUST GO UP.

And more so than even most VC markets, raising for an "AI" company is more about who you know than what results you can show.

If anyone is actually showing significant results, where's the actual output of the AI-driven software boom (beyond just LLMs making coders more efficient by being a better google)? I don't see any real signs of it. All I see is people doing after market modifications on the shovels, I've yet to see any of the end users of these shovels coming down from the hills with sacks of real gold.

What’s your opinion on any of the plethora of unicorns in domain-specific AI, like Harvey? ($100m ARR from what I could find on a cursory search)

https://www.forbes.com/sites/iainmartin/2025/10/29/legal-ai-...

I’m yet to be convinced it (Harvey) is anything other than a a prompt and some streamlined Rag.

Law is slow and conservative, they were likely just the first to get a enterprise sales team.

This is like when people say that you should short the market if you think its going to crash. People have different risk premiums.
  • ra
  • ·
  • 7 hours ago
  • ·
  • [ - ]
I'm with you. I don't think anyone appreciates the effort that goes into a good measurable, repeatable eval / improvement process unless they've been through it in anger themselves.
> Eg how do you build representative evals and measure forward progress?

This assumes that those companies do evaluations. In my experience, seeing a huge amount of internal AI projects at my company (FAANG), there's not even 5% that have any sort of eval in place.

Yeah, I believe that lots of startups don’t have evals either, but as soon as you get paying customers you’re gonna need something to prevent accidentally regressing as you tune your scaffolding, swap in newer models, etc.

This is a big chasm that I could well believe a lot of founders fail to cross.

It’s really easy to build an impressive-looking tech demo, much harder to get and retain paying customers and continuously improve.

But! Plenty of companies are actually doing this hard work.

See for example this post: https://news.ycombinator.com/item?id=46025683

Lol ok so all of this is just a ploy to advertise your thing xD
This should be the top comment.
Why is this post published in November 2025 talking about GPT-4?

I'm suspicious of their methodology:

> Open DevTools (F12), go to the Network tab, and interact with their AI feature. If you see: api.openai.com, api.anthropic.com, api.cohere.ai You’re looking at a wrapper. They might have middleware, but the AI isn’t theirs.

But... everyone knows that you shouldn't make requests directly to those hosts from your web frontend because doing so exposes your API key in a way that can be stolen by attackers.

If you have "middleware" that's likely to solve that particular problem - but then how can you investigate by intercepting traffic?

Something doesn't smell right about this investigation.

It does later say:

> I found 12 companies that left API keys in their frontend code.

So that's 12 companies, but what about the rest?

Providers such as OpenAI have client keys so your client application can call the providers directly. Many developers prefer them as they save roundtrip costs and latency.

https://platform.openai.com/docs/api-reference/realtime-sess...

Do those still only work for the voice APIs though?

I've been hoping they would extend that to other APIs, and I'd love to see the same kind of mechanism for other providers.

UPDATE: I dug I to this a bit more and as far as I can tell OpenAI are still the only major vendor with a consumer key mechanism and it still only works for their realtime voice APIs.

That's a big llm smell when it mentions old models like GPT-4
  • zkmon
  • ·
  • 11 hours ago
  • ·
  • [ - ]
But ... what else should they be doing? What's the expectation here?

For example, in the 90's, a startup that offered a nice UI for a legacy console based system, would have been a great idea. What's wrong with that?

IMO nothing wrong with it. Just misleading to call yourself an AI company when you actually make a CRUD app. I think if these companies were honest about what they’re doing nobody would be upset. There’s an obvious deliberate attempt to give an impression of technical complexity/competence that isn’t there.

I assume it works because the ecosystem is, as you say, so new. Non-technical observers have trouble distinguishing between LLM companies and CRUD companies

  • zkmon
  • ·
  • 10 hours ago
  • ·
  • [ - ]
So, what is an AI company? What do they sell? AI models? agents? Are they building these from scratch or using some pre-trained base models/agents?
I don’t have a problem with a company calling themselves an AI company if they use OpenAI behind the scenes.

The thing that annoys me is when clearly non-AI companies try to brand themselves as AI: like how Long Island Iced Tea tried to brand themselves as a blockchain company or WeWork tried to brand themselves as a tech company.

If we’re complaining about AI startups not building their own in house LLMs, that really just seems like people who are not in the arena criticizing those who are.

They should be creating tiny domain specific models, because someday OpenAI will stop selling dollars for a nickel.
They should compete in the crucible of the free market. If prompt engineering is indeed a profitable industry then so be it. I for one am just tired of all things software being dominated by this hype funded AI frenzy.
I think the point is more to point out the inherent danger presented when your platform is just a wrapper, but is being sold as more than that.

A lot of these startups have little to no moat, but they're raking in money like no one's business. That's exactly what happened in the dotcom bubble.

Actual AI. Not being "AI" users.

Being LLM users would be fine but they pretend they do AI.

  • zkmon
  • ·
  • 11 hours ago
  • ·
  • [ - ]
AI is an ecosystem that includes users at all layers and innovation at all those layers - infra, databases, models, agents, portals, UIs and so on. What do you mean by doing AI?

Btw, the so-called AI devs or model developers are "users" of the databases and all the underlying layers of the stack.

Everything is a spectrum.

At what point can you claim that you did "it"?

Do you have to use an open source model instead of an API? Do you have to fine tune it? How much do you need to? Do you have to create synthetic data for training? Do you have to gather your own data? Do you need to train from scratch? Do you need to come up with a novel architecture?

10 years ago if you gathered some data and trained a linear model to determine the likelihood your client would default on their loan and used that to decide how much, if any, to loan them- you're absolutely doing "actual AI"

---

Any other software you could ask all the same questions but with using a high level language, frameworks, dependencies, hiring consultants / firm, using an LLM, no-code, etc.

At what point does outsourcing some portion of the end product become no longer doing the thing?

> At what point can you claim that you did "it"?

When the core of your business ist something that’s shamelessly farmed out to <LLM provider of choice>.

It would be like calling yourself a restaurant, and then getting uber eats deliveries of whatever customers ordered and handing that to them.

I mean- yeah I think that's a fair analogy.

The customers don't see where the food is coming from and are still coming to eat. If you can make the economics work...

You don't have the moat other restaurants have, but you're still a restaurant.

What’s actual AI in this context?
Isn’t this true for most start ups out there even before AI? Some sort of bundle/wrapper around existing technology? I worked auditing companies and we used a particular system that cost tens of thousands of dollars per user per year and we charged customers up to a million to generate reports with it. The platform didn’t have anything proprietary other than the UX, under the hood it was a few common tools some of them open source. We could have created our own product but our margins were so huge it didn’t make sense to setup a software development unit not even bother with outsourcing it.
This post hovers on something I came to the week after ChatGPT dropped in 2023.

If an AI company has an AGI, what incentive do they actually have to sell it as a product, especially if it’s a 10x cost/productivity/reliability silicon engineer? Just undercut the competition by building their services from scratch.

  • beAbU
  • ·
  • 6 hours ago
  • ·
  • [ - ]
You don't need AGI for this circle of life to be apparent.

1. AI company wraps GPT/Claude/etc and delivers a novel use case.

2. OpenAI/Anthropic/etc creates a similar product in house and ships it as a feature. It is 'only' a prompt after all.

3. ???

4. Profit.

As a wrapper you have no moat, as the foundational providers can just steal your lunch. As a foundational provider you have no moat, because it's near trivial for other providers to create competing products.

I mean the AI company can change their TOS at any time. If you have massive compute infra and a human-tier/human-plus workforce on silicon, you:

1. Undercut all the legacy human-based competition (health insurance companies, for example)

2. Completely destroy capitalism in the knowledge work domain

3. Once you have general purpose autonomous robotics solved that can defend against rebellion, you stop all services, strangling out humanity: ~free food production, ~free energy production, ~free internet connectivity, etc

4. Survive climate change by destroying all poor people and their carbon footprints.

5. The ultra wealthy .1% fly off into eternity as the owlish sparrows that they are

That is lower than I expected. There are just a handful of companies that create llms. They are all more ir less similar. So all automation is in using them, which is prompt engineering if you see that way.

The bigger question is, this is the same story with apps on mobile phones. Apple and google could easily replicate your app if they wanted to and they did too. That danger is much higher with these ai startups. The llms are already there in terms of functionality, all the creators figured out the value is in vertical integration and all of them are doing it. From that sense all these startups are just showing them what to build. Even perplexity and cursor are in danger.

Do not forget that a product idea needs to meet a certain ROI to be stolen. Big Tech won't go after opportunities that do not generate billion-level revenue. This leaves some room for applications where you can earn decent money.
That is not how companies work. What you said may be true for the immediate short term but over time every team in the company needs to show improvement and set yearly milestones. All these startups will then become functionality they want to push that quarter. Yes it doesn’t mean the death of the startup but a struggle
It is beyond annoying that the article is totally generated by AI. I appreciate the author (hopefully) spending effort in trying to figure out the AI systems, but the obviously-LLM non-edited content makes me not trust the article.
What makes you believe that anything in the article is real?

The author seems to not exist and it's unclear where the data underlying the claims is even coming from since you can't just go and capture network traffic wherever you like.

A little due diligence please.

Where is this guy sitting that he is able to collect all of this data? And why is he able to release it all in a blog post? (my company wouldn't allow me to collect and release customer data like this.)
Another red flag with the article is that the author's LinkedIn profile link at the bottom leads to a non-existing page.

Is Teja Kusireddy a real person? Or is this maybe just an experiment from some AI company (or other actor) to see how far they can push it? A Google search by that name doesn't find anything not related to the article.

The article should be flagged. Otoh, this should get discussed.

  • zkmon
  • ·
  • 11 hours ago
  • ·
  • [ - ]
He seems real. Goes by Teja K. Seems a startup founder.
He may be real, but the article is fake BS. There is simply no way he'd be in a position to intercept the calls, and he never explains it.
There is nothing difficult about monitoring network traffic in this way for desktop or native apps.
Did you read the article? It claims to have knowledge of network traffic between the startup's devices and the AI providers' devices.
So? The vast majority of these startups would still do the wrapping on their own backend.
Do you have any link that supports this?
It sounds like some of these companies call the OpenAI or Anthropic APIs directly from their frontend. Later, the author also mentions "response time patterns for every major AI API," so maybe there's some information about the backend leaking that way even if the API calls are bridged.

But I'd like to know an actual answer to this, too, especially since large parts of this post read as if they were written by an LLM.

> It sounds like some of these companies call the OpenAI or Anthropic APIs directly from their frontend.

Which would be a major security hole. And sure, lots of startups have major security holes, but not enough that he could come up with these BS statistics.

I'm a little dismayed at how high up this has been voted given the data is guaranteed to be made up.

> > It sounds like some of these companies call the OpenAI or Anthropic APIs directly from their frontend.

> Which would be a major security hole.

An officially supported security hole

https://platform.openai.com/docs/api-reference/realtime-sess...

"I found 12 companies that left API keys in their frontend code. I reported them all. None responded."

They claim to have found that.

Im also wondering how he is able to see calls to AI providers directly in the browser, client side api calls? Thats strange to me. Also how is he able to peer into the rag architectures? I don’t get that, maybe GpT4.1 allows unauthenticated requests? Is there an OAuth setup that allows client side requests to OpenAI?
Yea I just posted a similar comment. I'm sure some websites just skin OpenAI/Claude etc, but ALL of them? It makes no sense.
Yeah, TBH my BS detector is going off because this article never explains how he is able to intercept these calls.

To be able to call the OpenAI directly from the front end, you'd need to include the OpenAI key, which would be a huge security hole. I don't doubt that many of these companies are just wrappers around the big LLM providers, but they'd be calling the APIs from their backend where nothing should be interceptable. And sure, I believe a few of them are dumb enough to call OpenAI from the frontend, but that would be a minority.

This whole thing smells fishy, and I call BS unless the author provides more details about how he intercepted the calls.

> Yeah, TBH my BS detector is going off because this article never explains how he is able to intercept these calls.

You mean, except for explaining what he's doing 4-5 times? He was literally repeating himself restating it. Half the article is about the various indicators he used. THERE'S EXAMPLES OF THEM.

There's this bit:

> Monitored their network traffic for 60-second sessions

> Decompiled and analyzed their JavaScript bundles

Also there's this whole explanation:

> The giveaways when I monitored outbound traffic:

> Requests to api.openai.com every time a user interacted with their "AI"

> Request headers containing OpenAI-Organization identifiers

> Response times matching OpenAI’s API latency patterns (150–400ms for most queries)

> Token usage patterns identical to GPT-4’s pricing tiers

> Characteristic exponential backoff on rate limits (OpenAI’s signature pattern)

Also there's these bits:

> The Methodology (Free on GitHub next week):

> - The complete scraping infrastructure

> - API fingerprinting techniques

> - Response time patterns for every major AI AP

One time he even repeats himself by stating what he's doing as playwright pseudocode, in case plain English isn't enough.

This was also really funny:

> One company’s “revolutionary natural language understanding engine” was literally this: [clientside code with prompt + direct openai API call].

And there's also this bit at the end of the article:

> The truth is just an F12 away.

There's more because LITERALLY HALF THE ARTICLE IS HIM DOING THE THING YOU COMPLAIN HE DIDN'T DO.

In case it's still not clear, he was capturing local traffic while automating with playwright as well as analyzing clientside JS.

> Monitored their network traffic for 60-second sessions

How can he monitor what's going on between a startup's backend and OpenAI's server?

> The truth is just an F12 away

That's just not how this works. You can see the network traffic between your browser and some service. In 12 cases that was OpenAI or similar. Fine. But that's not 73%. What about the rest? He literally has a diagram claiming that the startups contact an LLM service behind the scenes. That's what's not described, how does he measure that?

You are not bothered by the only sign that the author even exist is this one article and the previous one? Together with the claim to be a startup founder? Anybody can claim that. It doesn't automatically provide credibility.

> How can he monitor what's going on between a startup's backend and OpenAI's server?

He is not claiming to be doing that. He says what and how he's capturing multiple times. He says he's capturing what's happening in browser sessions. Reflect on what else you may to re-evaluate or discard if you misunderstood this.

> That's just not how this works. You can see the network traffic between your browser and some service.

Yes, the author is well aware of that as are presumably most readers. However for example if your client makes POST requests to the startup's backend like startup.com/api/make-request-to-chatgpt and the payload is {systemPrompt: "...", userPrompt: "..."}, not much guessing as to what is going on is necessary.

> You are not bothered by the only sign that the author even exist is this one article and the previous one?

Moving goalposts. He may or not be full of shit. Guess we'll see if/when we see the receipts he promised to put on GitHub.

What actually bothers is the lack of general reading comprehension being displayed in this thread.

> Together with the claim to be a startup founder? Anybody can claim that.

What? Anybody can be a startup founder today. Crazy claim. Also... what?

> It doesn't automatically provide credibility.

Almost nobody in this space has credibility. That could turn out to be Sam Altman's alias and I'd probably trust it even less.

In any case evaluating whether or not a text is credible should preferably happen after one has understood what was written.

I believe he's saying that a large number of the startups he tested did not have their own backend to mediate. It was literally direct front-end calls to openai. And if this sounds insane, remember that openai actually supports this: https://platform.openai.com/docs/api-reference/realtime-sess...

Presumably OpenAI didn't add that for fun, either, so there must be non-zero demand for it.

It's a fair point that OpenAI officially supports ephemeral keys.

But I still believe the vast majority of startups do wrapping in their own backend. Yes, I read what he's doing, and he's still only able to analyze client-side traffic, which means his overall claims of "73%" are complete and total bullshit. It is simply impossible to conclude what he's concluding without having access to backend network traces.

EDIT: This especially doesn't make sense because the specific sequence diagram in this article shows the wrapping happening in "Startup Backend", and again, it would be impossible for him to monitor that network traffic. This entire article is made-up LLM slop.

>Response times matching OpenAI’s API latency patterns (150–400ms for most queries)

This also matches the latency of a large number of DB queries and non-OpenAI LLM inference requests.

>Token usage patterns identical to GPT-4’s pricing tiers

What? Yes this totally smells real.

He also mentions backoff patterns, which I'm not sure how he'd disambiguate extremely standard backoff in a normal API.

Given the ridiculousness of these claims, I believe there's a reason he didn't include the fingerprinting methodology in this article.

Why is this comment here? Am I supposed to defend that guy's article now?

Just because I'm frustrated with someone's inability to understand a text does not imply I want to defend or even personally believe what was written.

There's a link in the preview of TFA that unlocks the rest of the article, looks like this for me:

https://medium.com/@teja.kusireddy23/i-reverse-engineered-20...

The article is basically a description of where to look for clues. Perhaps they've contracted with some of these companies and don't want to break some NDA by naming them, but still know a lot about how they work.

> Perhaps they've contracted with some of these companies and don't want to break some NDA by naming them, but still know a lot about how they work.

This makes literally no sense. Why would any companies (let alone most of them) contract with this guy who seems hell bent on exposing them all.

The article is simple made up, most likely by an LLM.

Prompt engineering isn't as simple as writing prompts in english. It's still engineering data flow, when data is relevant, systems that the AI can access and search, tools that the AI can use, etc.
Is it, though? Apparently the current best practice is just to allow the LLM untethered access to everything and try to control access by preventing prompt injection...
Well it took me 2 full-time weeks to properly implement a RAG-based system so that it found actually relevant data and did not hallucinate. Had to:

- write an evaluation pipeline to automate quality testing

- add a query rewriting step to explore more options during search

- add hybrid BM-25+vector search with proper rank fusion

- tune all the hyperparameters for best results (like weight bias for bm25 vs. vector, how many documents to retrieve for analysis, how to chunk documents based on semantics)

- parallelize the search pipeline to decrease wait times

- add moderation

- add a reranker to find best candidates

- add background embedding calculation of user documents

- lots of failure cases to iron out so that the prompt worked for most cases

There's no "just give LLM all the data", it's more complex than that, especially if you want best results and also full control of data (we run all of that using open source models because user data is under NDA)

Sounds like you vibe coded a RAG system in two weeks, which isn't very hard. Any startup can do it.

I've debugged single difficult bugs before for two weeks, a whole feature that takes two weeks is an easy feature to build.

I already had experience with RAG before so I had a head start. You're right that it's not rocket science, but it's not just "press F to implement the feature" either

P.S. No vibe coding was used. I only used LLM-as-a-judge to automate quality testing when tuning the parameters, before passing it to human QA

"did not hallucinate"

Sorry to nitpick, but this is not technically possible no matter how much RAG you throw at it. I assume you just mean "hallucinates a lot less"

You're right, bad wording
whoa, two weeks
  • rynn
  • ·
  • 11 hours ago
  • ·
  • [ - ]
@apwell23 while the author didn’t say how s/he measured QA, creating the QA process was literally the first bullet.
  • ·
  • 11 hours ago
  • ·
  • [ - ]
You still need to find the correct data, and get it to the LLM. IMO, a lot of it is data engineering work with API calls to an LLM as an extra step. I'm currently doing a lot of ETL work with Airflow (and whatever data {warehouses, lakes, bases} are needed) to get the right data to a prompt engineering flow. The prompt engineering flow is literally a for loop of Google Docs in a Google Drive that non-tech people, but domain experts in their field, can access.

It's up to the domain experts and me to understand where giving it data will tone down the hallucinative nonsense an LLM puts out, and where we should not give data because we need the problem solving skills of the LLM itself. A similar process is for tool-use, which in our case are pre-selected Python scripts that it is allowed to run.

can you describe what the usecase is ?
Nah. There's no such thing as prompt engineering. It doesn't exist. Engineering involves applying scientific principles to solve real world problems. There are no clear scientific principles to apply here. It's all instinct, hunches, educated guesses, and heuristics with maybe some sort of feedback loop. And that's fine, it can still produce useful results. Just don't call it engineering. Maybe artisanal prompt crafting? Or prompt alchemy?
Prompt engineering is the new Search Engine Optimization.

Not sure if we called it engineering ten years ago.

Human speech is "engineering data flow"

Painting is "engineering data flow"

Directing a movie is "engineering data flow"

Playing the guitar is "engineering data flow"

This statement merely reveals a bias to apply high value to the word "engineering" and to the identity "engineer".

Ironic in that silicon valley lifted that identity and it's not even legally recognized as a licensed profession.

Imagine you are a top of the line engenier...

Engineering data flow... sure, we all like to use big words.

The new 10x engineering is writing "please don't write bugs" in a markdown file.
This makes no sense to me? I don't understand why a company, even if it is using GPT or Claude as their true backend, is going to leave API calls in Javascript that anyone can find. Sure maybe a couple would, but 73% of those tested? Surely your browser is going to talk to their webserver, and yup sure it'll then go off and use Claude etc then return the answer to you, but surely they're not all going to just skin an easily-discoverable website over the big models?

I don't believe any of this. Why aren't we questioning the source of how the author is apparently able to figure out some sites are using REDIS etc etc?

It's very confusing in the text of the article, at times it sounds like the author is using heuristic methods (like timings) but at times it sounds like they somehow have access to network traffic from the provider's backend. I could 100% believe that a ton of these companies are making API calls to providers directly from an SPA, but the flow diagrams in the article seem to specifically rule that out as an explanation.

I might allow them more credit if the article wasn't in such an obviously LLM-written style. I've seen a few cases like this, now, where it seems like someone did some very modest technical investigation or even none at all and then prompted an LLM to write a whole article based on it. It comes out like this... a whole lot of bullet points and numbered lists, breathless language about the implications, but on repeated close readings you can't tell what they actually did.

It's unfortunate that, if this author really did collect this data, their choice to have an LLM write the article and in the process obscure the details has completely undermined their credibility.

  • Zanfa
  • ·
  • 12 hours ago
  • ·
  • [ - ]
It makes perfect sense when you consider that the average Javascript developer does not know that business logic can exist outside of React components.
Yea I understand maybe 10-20% of these AI clowns don't know what they're doing, but to suggest they're all making a mistake this silly doesn't stack up IMHO.
  • Oras
  • ·
  • 11 hours ago
  • ·
  • [ - ]
I can believe that many startups are doing prompt engineering and agents but in a sense this like saying 90% of startups are using cloud providers mainly AWS and Azure.

There is absolutely no point of reinventing the wheel to create a generic LLM, spend fortune to run GPUs while there are providers giving this power cheaply

In addition, there may be value in getting to market quickly with existing LLM providers, proving out the concept, then building / training specialized models if needed once you have traction.

See: https://en.wikipedia.org/wiki/Lean_startup

73% of AI startups are building their castle in someone else's kingdom.
It's worse than that, someone else's models, someone else's smartphone operating systems, it's every conceivable disadvantage.
Every city should have its own municipal chip fabrication plant!
If you break up AT&T the Bell System will collapse!
Not sure if you're familiar, but it did collapse. It's all one company again.
  • asah
  • ·
  • 12 hours ago
  • ·
  • [ - ]
-1: there's lots of "kingdoms" (openai, anthropic, google, plus open source) - if one king comes for your castle, you can move in minutes.
True, even OpenAI built their castle in nVidia's kingdom. And nVidia built their castle in TSMC's kingdom. And TSMC built their castle in ASML's kingdom.
lastly we need the FDIC meme "Backed by the full faith and credit of the U.S. Government" for good measure, haha.
TSMC bought a huge chunk of ASML's shares before taking the plunge on EUV -- enough to get them a board seat.
The thing that drives me nuts is that most "AI Applications" are just adding crappy chat to a web app. A true AI application should have AI driven workflows that automate boring or repetitive tasks without user intervention, and simplify the UI surface of the application.
I'm firmly of the opinion that, as a general rule, if you're directly embedding the output of a model into a workflow and you're not one of a handful of very big players, you're probably doing it wrong.[1]

If we overlook that non-determinism isn't really compatible with a lot of business processes and assume you can make the model spit out exactly what you need, you can't get around the fact that an LLM is going to be a slower and more expensive way of getting the data you need in most cases.

LLMs are fantastic for building things. Use them to build quickly and pivot where needed and then deploy traditional architecture for actually running the workloads. If your production pipeline includes an LLM somewhere in the flow, you need to really, seriously slow down and consider whether that's actually the move that makes sense.

[1] - There are exceptions. There are always exceptions. It's a general rule not a law of physics.

I'm surprised by the number of people who are running head first into AI wrapper start-ups.

Either you have a smash-and-grab strategy or you are awful at risk analysis.

  • mvkel
  • ·
  • 12 hours ago
  • ·
  • [ - ]
Do you want to be right, or do you want to make money? You'll be correct in 5-10 years. Do you wait and do nothing until then?
The reason is because VC needs to show that their flagship investments have "traction" so they manufacture ecosystem interest by funding and encouraging ecosystem product usage. It's a small price to pay. If someone builds a wrapper that gets 100 business users then token use on the foundation layer gets that passed down. Big scheme.
My question with these is always "what happens when the model doesn't need prompting?". For example, there was a brief period where IDE integrations for coding agents were a huge value add - folks spent eons crafting clever prompts and integrations to get the context right for the model. Then... Claude, Gemini, Codex, and Grok got better. All indications are that engineers are pivoting to using foundation model vended coding toolchains and their wrappers.

This is rapidly becoming a more extreme version of the classic "what if google does that?" as the foundation model vendors don't necessarily need to target your business or even think about it to eat it.

  • doe88
  • ·
  • 11 hours ago
  • ·
  • [ - ]
This a kind of global app store all over again, where all these companies are clients of only few true ai companies and try to distinguish themselves in the bounds of the underlying models and apis, just like apps were trying to find niches in the bound of apis and exposed hw of underlying iphones. Apis versions bugs are now models updates. And of course, all are at the mercy of their respective Leviathan.
Flagged. Please don't post items on HN where we have to pay or hand over PII to read it. Thanks.
pls don't create new guidelines
Ditto.
it's wild, I work with some fortune 500 engineers who don't spend a lot of time prompting AI, and just a quick few prompts like "output your code in <code lang="whatever">...</code>" tags" — a trick that most people in the prompting world are very familiar with, but outside of the bubble virtually no one knows about — can improve AI code generation outputs to almost 100%.

It doesn't have to be this way and it won't be this way forever, but this is the world we live in right now, and it's unclear how many years (or weeks) it'll be until we don't have to do this anymore

Interesting article and plausible conclusions but the author needs to provide more details to back up their claims. The author has yet to release anything supporting their approach on their Github. https://github.com/tejakusireddy
98% of all websites are just database wrappers
2% are unjust?
73% of startups are just writing computer programs
  • ·
  • 11 hours ago
  • ·
  • [ - ]
  • ojr
  • ·
  • 9 hours ago
  • ·
  • [ - ]
5% prompt engineering, 95 % orchestration and no you can not vibe code your way and clone my apps, I have paid subscriptions why aren't you doing it then? Oh because models degrade severely over 500 lines.

LLMs is the new AJAX. AJAX made pages dynamic, LLMs make pages interactive.

I don't care how you get to a system that does something useful.
73% of statistics are wrong
100% of startups are just software engineering
That's actually lower than I would have thought.
Another slop article that could probably be good if the author was interested in writing it, but instead they dumped everything into an LLM and now I can't tell what's real and what's not and get no sense of what parts of the findings the author found important or interesting compared to what other parts.

I have to wonder, are people voting this up after reading the article fully, and I'm just wrong and this sort of info dump with LLM dressing is desirable? Or are people skimming it and upvoting? Or is it more of an excuse to talk about the topic in the title? What level of cynicism should I be on here, if any?

And 73% of SaaS companies are just CRUD.

Honestly it sounds about right: at the end of the day, most companies will always be an interesting UI and workflow around some commodity tech, but, that's valuable. Not all of it may be defensible, but still valuable.

Maybe one day i can ask my tech in natural language for the weather...could you imagine?

Wait...nvm.

So? 73% of Saas startups are DB connectors & queries.
The difference is, if your company “moat” is a “prompt” on a commodity engine, there is no moat.

Google even said they have no moat, when clearly the moat is people that trust them and not any particular piece of technology.

  • ojr
  • ·
  • 9 hours ago
  • ·
  • [ - ]
the orchestration layer is the moat, ask any LLM and they will give paragraphs explaining why this is...
And 73% of paas are deploy scripts for existing software. It's how the industry works.
If tekens aren't profitable then prices per token are likely to go up. If that's all these businesses are, they're all very sensitive to token prices.
Not with open weight models you can deploy yourself. Different economics but not venerable to price increases.
First, someone has to develop those models and that's currently being done with VC backing. Second, running those models is still not profitable, even if you self host (obviously true because everything is self hosted eventually).

Burning VC money isn't a long term business model and unless your business is somehow both profitable on Llama 8b (or some such low power model) _and_ your secret sauce can't be easily duplicated, you're in for a rough ride.

The only barrier between AI startups at this point is access to the best models, and that's dependent on being able to run unprofitable models that spend someone else's money.

Investing in a startup that's basically just a clever prompt is gambling on the first mover's advantage because that's the only advantage they can have.

And 99% of software development is just feeding data into a complier. But that sort of misses the point doesn't it?

AI has created a new interface with a higher level abstraction that is easier to use. Of course everyone is going to use it (how many people still code assembler?).

The point is what people are doing with it is still clever (or at least has potential to be).

I disagree. Software development is not limited to LLM-type responses and incorporates proper logic. You are at the mercy of LLM when you build an "AI" interface for the LLM apis. 73% these "AI" companies will collapse when the original API providing company comes up with a simple option (Gemini for Sheets, for example), they will disappear. It is already happening.

AI software is not long-lasting; its results are not deterministic.

Isn’t it a bit like saying, “X% of startups are just writing code”?
73% of AI blog post statistics are bogus. Subscribe to learn more.
Flagged. AI written article with questionable sources behind a wall that requires handing over PII.
It’s because the LLM is a commodity.

What differentiates a product is not the commodity layer it’s built on (databases, programming languages, open source libraries, OS apis, hosting, etc) but how it all gets glued together into something useful and accessible.

It would be a bad strategy for most startups to do anything other than prompt engineering in their AI implementations for the same reason it would be a bad idea for most startups to write low-level database code instead of SQL queries. You need to spend your innovation tokens wisely.

Yep I just use chatgpt . I can write better prompts and data for my own usecases
Atlas himself doesn't carry as much as "engineering" does in that headline.
That's like saying "73% of business is just meetings"
One of the biggest problems frontier models will face going forward is how many tasks require expertise that cannot be achieved through Internet-scale pre-training.

Any reasonably informed person realizes that most AI start-ups looking to solve this are not trying to create their own pre-trained models from scratch (they will almost always lose to the hyperscale models).

A pragmatic person realizes that they're not fine-tuning/RL'ing existing models (that path has many technical dead ends).

So, a reasonably informed and pragmatic VC looks at the landscape, realizes they can't just put all their money into the hyperscale models (LP's don t want that) and they look for start-ups that take existing hyperscale models and expose them to data that wasn't in their pre-Training set, hopefully in a way that's useful to some users somewhere.

To a certain extent, this study is like saying that Internet start-ups in the 90's relied on HTML and weren't building their own custom browsers.

I'm not saying that this current generation of start-ups will be successful as Amazon and Google, but I just don't know what the counterfactual scenario is.

The question that isn't answered completely in the article is how useful are the pipelines for these startups? The article certainly implies that for at least some of these startups there very little value add in the wrapper.
Got any links to explanations of why fine tuning open models isn’t a productive solution? Besides renting the GPU time, what other downsides exist on today’s SOTA open models for doing this?
When people are desperate to invest, they often don't care what someone actually can do but more about what they claim they can do. Getting investors these days is about how much bullshit you can shovel as opposed to how much real shit you shoveled before.

Thus has it always been. Thus will it always be.

Prompt engineering and using an expensive general model in order to prove your market, and then putting in the resources to develop a smaller(cheaper) specialized model seems like a good idea?
Are people down to have a bunch of specialized models? The expectation set by OpenAI and everyone else has set is that you will have one model that can do everything for you.

It’s like how we’ve seen basically all gadgets meld into the smart phone. People don’t have Garmin’s and beepers and clock radios anymore (or dedicated phones!). It’s all on the screen that fits in your pocket. Any would-be gadget is now just an app

> The expectation set by OpenAI and everyone else has set is that you will have one model that can do everything for you.

I don’t think that’s the expectation set by “everyone else” in the AI space, even if it arguably is for OpenAI (which has always, at least publicly, had something of a focus on eventual omnicapable superintelligence.) I think Google Antigravity is evidence of this: there’s a main, user selected coding model, but regardless of which coding model is used, there are specialized models used for browser interaction and image generation. While more and more capabilities are at least tolerably supported by the big general purpose models, the range of specialized models seems to be increasing rather than decreasing, and seems likely that, for conplex efforts, combining a general purpose model with a set of focussed, task-specific models will be a useful approach for the forseeable future.

  • bjt
  • ·
  • 11 hours ago
  • ·
  • [ - ]
Having everything in my phone is a great convenience for me as a consumer. Pockets are small, and you only have a small number of them in any outfit.

But cloud services run in... the cloud. It's as big as you need it to be. My cloud service can have as many backing services as I want. I can switch them whenever I want. Consumers don't care.

"One model that can do everything for you" is a nice story for the hyper scalers because only companies of their size can pull that off. But I don't think the smartphone analogy holds. The convenience in that world is for the the developers of user-facing apps. Maybe some will want to use an everything model. But plenty will try something specialized. I expect the winner to be determined by which performs better. Developers aren't constrained by size or number of pockets.

I think of the foundational model like CPUs. They're the core of powerful, general-purpose computers, and will likely remain popular and common for most computing solutions. But we also have GPUs, microcontrollers, FPGAs, etc. that don't just act as the core of a wide variety of solutions, but are also paired alongside CPUs for specific use cases that need specialization.

Foundational models are not great for many specific tasks. Assuming that one architecture will eventually work for everything is like saying that x86/amd64/ARM will be all we ever need for processors.

Specialized models are cheaper. For a company you're looking for some task that needs to be done millions of times per day, and where general models can do it well enough that people will pay you more than the general model's API cost to do it. Once you've validated that people will pay you for your API wrapper you can train a specialized model to increase your profit and if necessary lower your pricing so people won't pay OpenAI directly.
It's probably the direction it will go, at least in the near term.

It seems right now like there is a tradeoff between creativity and factuality, with creative models being good at writing and chatting, and factuality models being good at engineering and math.

It why we are getting these specific -code models.

Happy with my Garmin :-)
I still use the Garmin I bought in 2010. I refuse to turn on my phone's location tracking. Also the single-purpose interface is better and safer than switching between apps and contexts on a general purpose device.
It's really an implementation decision. The end user doesn't need to know their request is routed to a certain model. A smaller specialized model might have identical output to a larger general purpose model, but just be cheaper and faster to run.
My coffee maker app is quite disappointing.
I imagine you’re being facetious but I wouldn’t count food-related products for the most part. It’s not like Claude is brewing a pot for me anyway lol
And out of that 73%, 99% of them don't even do the obvious step of trying to actually optimize/engineer their damn prompts!

https://github.com/zou-group/textgrad

and bonus, my rant about this circa 2023 in the context of Stable Diffusion models: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

The really impressive thing about AI startups is not that they sell wrappers around (whatever), but that they are not complete vaporware.
it was never about the software.
I decided to flag this article because it has to be fake.

The author never explains how he is able to intercept these API calls to OpenAI, etc. I definitely believe tons of these companies are just wrappers, but they'd be doing the "wrapping" in their backend, with only a couple (dumb) companies doing the calls directly to OpenAI from the front end where they could be traced.

This article is BS. My guess is it was probably AI generated because it doesn't make any sense.

I find it shocking that most comments here just accept the article as fact and discuss the implications.

The message might not even be wrong. But why is everybody's BS detection on ice in the AI topic space? Come one people, you can all do better than this!

Thanks for flagging. Though whenever such a made up thing is flagged, we lose the chance to discuss this (meta) topic. People need to be aware how prevalent this is. By just hiding it every time we notice, we're preventing everybody to read the kind of comment you wrote and recalibrate their BS-meters.

People talk about an AI bubble. I think this is the real bubble.
Not really because the money involved is relatively small. The bubble is where people are using D8s to push square kilometers of dirt around for data centers that need new nuclear power plants built, to house millions of obsolete Nvidia GPUs that need new fabs constructed to make, using yet more D8s..
  • mh-
  • ·
  • 12 hours ago
  • ·
  • [ - ]
(D8s apparently refers to a specific Caterpillar-brand bulldozer, not some kubernetes takeoff.)
  • mvkel
  • ·
  • 12 hours ago
  • ·
  • [ - ]
Wait til you hear what GPT 5 is
What is it? A gpt-4o wrapper?
Why is slop with ridiculous or impossible claims at the top of HN?
This is an AI slop article that sounds completely fabricated. Half of what's being claimed here isn't even possible to discern. My guess is that some LLM is churning out these 100% fake articles to get subscribers and ad revenue on Medium. Flagged.
  • hmans
  • ·
  • 9 hours ago
  • ·
  • [ - ]
[dead]
[dead]
Prompt is code.
prompt as code is a pipe-dream.

The machine model for natural language doesnt exist - it is too ambiguous to be useful for many applications.

Hence, we limited natural language to create programming languages whose machine model is well defined.

In math, we created formalism to again limit language to a subset that can be reasoned with.

Prompt is specification, not code.
Not to be too pedantic, but code is a kind of specification. I think making the blanket statement "Prompt is code" is inaccurate but there does exist a methodology of writing prompts as if they are specifications that can reliably converted to computational actions, and I believe we're heading toward that.
Yeah, I assumed someone would say this.

My manager gives me specifications, which I turn into code.

My manager is not coding.

This is how to look at it.

  • ·
  • 12 hours ago
  • ·
  • [ - ]
I've always said, determinism has been holding the field back.
100% of AI startups are just multiplying matrices 100% of tech startups are just database engineering

It's still early in the paradigm and most startups will fail but those that succeed will embed themselves in workflows.