So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.
Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!
What I've learned from this is that often times it is better to do absolutely nothing.
That's a risky bet. It is more likely that the user interface of AI will evolve. Some things will stick, some will not. Three years from now, many things that are clunky now will be replaced by more intuitive things. But some things that already work now will still be in place. People who have been heavy users of AI between now and then will definitely have a leg start on those who will just start then.
On the “wasting three months” remark (GP), if it’s a key value proposition, just do it. Don’t wait. If it’s not a key value prop, then don’t do it at all. Often times what I’ve built has been better tailored to our product than what AWS built.
I wagger the same for AI agent techniques.
You have a positive cash flow from sales of agents? Your revenue exceeds your operating costs?
I've been very skeptical that it is possible to make money from agents, having seen how difficult it was for the current well-known players to do so.
What is your secret sauce?
Imo the key is to serve one use case really well rather than overgeneralize.
Here's a nice article about it: https://www.oneusefulthing.org/p/the-lazy-tyranny-of-the-wai...
It all comes down to trying to predict what will be your vendors' roadmap (or if youre savvy, get a peek into it) and whether the feature you want to create is fundamental to your applications behavior (I doubt encryption is unless youre a storage company).
Any framework you build around the model is just behaviour that can be trained into the model itself
What technology shifts have happened for LLMs in the last 2 years?
Now, I'm open to the idea that I am just using it wrong, but I have seen several reports around the web that the most that people got in tool calling accuracy is 80%, which is unusable for any production system, also for info retrieval I have seen it lose coherence the more data is available overall.
Is there a model that actually achieved 100% tool calling accuracy?
So far I built systems for that myself, surrounding the LLM, and only like this it worked well in production.
When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.
For all intents and purposes, the size of current context windows obviates all of that work.
What else changed?
- Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.
- Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]
Large context windows can make some problems easier or go away for sure. But you may still have the same issue of getting the right information to the model. If your data is much larger than e.g. 256k tokens you still need to filter it. Either way, it can still be beneficial (cost, performance, etc.) to filter out most of the irrelevant information.
>Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline
This has been obvious from the beginning for anyone familiar with information retrieval (R in RAG). It's very common that search queries are looking for exact matches, not just anything with similar meaning. Your linked example is code search. Exact matches/regex type of searches are generally what you are looking for there.
These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.
EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.
It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.
To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.
It started with better autocomplete, and now, well, agents are writing new features as I write this comment.
Both sides have valid observations in their experiences and circumstances. And perhaps this is simply another engineering "it depends" phenomenon.
My skepticism and intuition that AI innovations are not exponential, but sigmoid are not because I don't understand what gradient-descent, transformers, RAG, CoT, or multi-head attention are. My statement of faith is: the ROI economics are going to catch up with the exuberance way before AGI/ASI is achieved; sure, you're getting improving agents for now, but that's not going to justify the 12- or 13-digit USD investments. The music will stop, and improvements slow to a drip
Edit: I think at it's root, the argument is between folk who think AI will follow the same curve as past technological trends, and those who believe "It's different this time".
I did neither of these two things... :) I personally could not care about
- (over)hype
- 12/13/14/15 ... digit USD investment
- exponential vs. sigmoid
There are basically two groups of industry folk:
1. those that see technology as absolutely transformational and are already doing amazeballs shit with it
2. those that argue how it is bad/not-exponential/ROI/...
If I was a professional (I am) I would do everything in my power to learn everything there is to learn (and then more) and join the Group #1. But it is easier to be in Group #2 as being in Group #1 requires time and effort and frustrations and throwing laptop out the window and ... :)
>> ...there are people that are professionals and are willing to put the time in to learn and then there’s vast majority of others who don’t...
tl;dr version: having negative view of the industry is decoupled from one's familiarity with, and usage of the tools, or the willingness to learn.
I hack for a living. I could hardly give two hoots about “false promises” or “hucksters” or some “impeding economic reckoning…” I made a general comment that a whole lot of people simple discount technology on technical grounds (favorite here on HN)…
I suppose this is the crux of our misunderstanding: I deeply care about the long-term health and future of the field that gave me a hobby that continues to scratch a mental itch with fractal complexity/details, a career, and more money than I ever imagined.
> or some “impeding economic reckoning…”
I'm not going to guess if you missed the last couple of economic downturns or rode them out, but an economic reckoning may directly impact your ability to hack for a living, that's the thing you prize.
Your whole point isn't supported by anything but ... a guess?
If given the chance to work with an AI who hallucinates sometimes or a human who makes logical leaps like this
I think I know what I'd pick.
Seriously, just what even? "I can imagine a scenario where AI was involved, therefore I will treat my imagination as evidence."
I’d make the distinction between these systems and what they’re used for. The systems themselves are amazing. What people do with them is pretty mundane so far. Doing the same work somewhat faster is nice, and it’s amazing that computers can do it, but the result is just a little more of the same output.
Remember, a logistic curve is an exponential (so, roughly, a process whose outputs feed its growth, the classic example being population growth, where more population makes more population) with a carrying capacity (the classic example is again population, where you need to eat to be able to reproduce).
Singularity 2026 is open and honest, wearing its heart on its sleeve. It's a much more respectable wrong position.
Just in my office, I have seen “small tools” like Charles Proxy almost entirely disappear. Everyone writes/shares their AI-generated solutions now rather than asking cyber to approve a 3rd party envfile values autoloader to be whitelisted across the entire organization.
I do lower level operating systems work. My bread and butter is bit-packing shenanigans, atomics, large-scale system performance, occasionally assembly language. It’s pretty bad at those things. It comes up with code that looks like what you’d expect, but doesn’t actually work.
It’s good for searching code big codebases. “I’m crashing over here because this pointer has the low bit set, what would do that?” It’s not consistent, but it’s easy to check what it finds and it saves time overall. It can be good for making tests, especially when given an example to work from. And it’s really good for helper scripts. But so far, production code is a no-go for me.
I wouldn't want to have to review the output of an agent going wild for an hour.
It’s the worst kind of disposable software.
But all the times I tried using LLMs to help me coding, the best it performs is when I give it a sample code (more or less isolated) and ask it for a certain modification that I want.
More often than not, it does make seemingly random mistakes and I have to be looking at the details to see if there’s something I didn’t catch, so the smallest scope there better.
If I ask for something more complex or more broad, it’s almost certain it will make many things completely wrong.
At some point, it’s such a hard work to detail exactly what you want with all context that it’s better to just do it yourself, cause you’re writing a wall of text to have a one time thing.
But anyway, I guess I remain waiting. Waiting until FreeBSD catches up with Linux, because it should be easy, right? The code is there in the Linux kernel, just tell an agent to port it to FreeBSD.
I’m waiting for the explosion of open source software that aren’t bloated and that can run optimized, because I guess agents should be able to optimize code? I’m waiting for my operating system to get better over time instead of worse.
Instead I noticed the last move from WhatsApp was to kill the desktop app to keep a single web wrapper. I guess maintaining different codebases didn’t get cheaper with the rise of LLMs? Who knows. Now Windows releases updates that break localhost. Ever since the rise of LLMs I haven’t seen software release features any faster, or any Cambrian explosion of open source software copying old commercial leaders.
The usefulness of your comment, on the other hand, is beyond any discussion.
"Anyone who disagrees with me is dishonest" is some kindergarten level logic.
Ridiculous statement. Is Google also not good for humanity as a whole? Is Internet not good for humanity as a whole? Wikipedia?
It seems pretty clear to me that culture, politics and relationships are all objectively worse.
Even remote work, I am not completely sure I am happier than when I use to go to the office. I know I am certainly not walking as much as I did when I would go to the office.
Amazon is vastly more efficient than any kind of shopping in the pre-internet days but I can remember shopping being far more fun. Going to a store and finding an item I didn't know I wanted because I didn't know it existed. That experience doesn't exist for me any longer.
Information retrieval has been made vastly more efficient so I instead of spending huge amounts of time at the library, I get that all back in free time. What I would have spent my free time doing though before the internet has largely disappeared.
I think we want to take the internet for granted because the idea that the internet is a long term, giant mistake is unthinkable to the point of almost having a blasphemous quality.
Childhood? Wealth inequality?
It is hard to see how AI as an extension of the internet makes any of this better.
It's a defensible claim I think. Things that people want are not always good for humanity as a whole, therefore things can be useful and also not good for humanity as a whole.
I don't think it's because the audience is different but because the moderators are asleep when Europeans are up. There are certain topics which don't really survive on the frontpage when moderators are active.
This would mean it is because the audience is different.
The by far more common action is for the mods to restore a story which has been flagged to oblivion by a subset of the HN community, where it then lands on the front page because it already has sufficient pointage
What I'm pointing out is just that moderation isn't the same at different times of the day and that this sometimes can explain what content you see during EU and US waking hours. If you're active during EU daytime hours and US morning hours, you can see the pattern yourself. Tools like hnrankings [1] make it easy to watch how many top-10 stories fall off the front page at different times of day over a few days.
This is what you said. There has only been one until this year, so now we have two.
The moderation patterns you see are the community and certainly have significant time factors that play into that. The idea that someone is going into the system and making manual changes to remove content is the conspiracy theory
This fru-fru about how "we all play a part" is only serving to obscure the reality.
There's dang who I've seen edit headlines to match the site rules. Then there's the army of users upvoting and flagging stories, voting (up and down) and flagging comments. If you have some data to backup your sentiments, please do share it - we'd certainly like to evaluate it.
My email exchanges with Dang, as part of the moderation that happens around here, have all been positive
1. I've been moderated, got a slowdown timeout for a while
2. I've emailed about specific accounts, (some egregious stuff you've probably never seen)
3. Dang once emailed me to ask why I flagged a story that was near the top, but getting heavily flagged by many users. He sought understanding before making moderation choices
I will defend HN moderation people & policies 'til the cows come home. There is nothing close to what we have here on HN, which is largely about us being involved in the process and HN having a unique UX and size
Emphasis mine. The question is does the paid moderation team disappear unfavorable posts and comments, or are they merely downranked and marked dead (which can still be seen by turning on showdead in your profile).
When paul graham was more active and respected here, I spoke negatively about how revered he was. I was upvoted.
I also think VC-backed companies are not good for society. And have expressed as much. To positive response here.
We shouldn't shit on one of the few bastions of the internet we have left.
I regret my negativity around pg - he was right about a lot and seems to be a good guy.
On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.
MCP's, or rather standardized function calling is another one.
Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.
I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.
It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.
The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.
I think more companies would be wise to ignore text for now and focus on visual domain problems.
Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.
Don't sleep on image and video. That's where the growth salient is.
I am so far removed from multimedia spaces that I truly can't imagine a universe where this could be true. Agents have done incredible things for me and Nano Banana has been a cool gimmick for making memes.
Anyone have a use case for media models that'll expand my mind here?
As someone in the film space, here's just one example: we are getting extremely close to being able to make films with only AI tools.
Nano Banana makes it easy to create character and location consistent shots that adhere to film language and the rules of storytelling. This still isn't "one shot", and considerable effort still needs to be put in by humans. Not unlike AI assistance in IDEs requiring a human engineer pilot.
We're entering the era of two person film studios. You'll undoubtedly start seeing AI short films next year. I had one art school professor tell me that film seems like it's turning into animation, and that "photorealism" is just style transfer or an aesthetic choice.
The film space is hardly the only space where these models have utility. There are so many domains. News, shopping, gaming, social media, phone and teleconference, music, game NPCs, GIS, design, marketing, sales, pitching, fashion, sports, all of entertainment, consumer, CAD, navigation, industrial design, even crazy stuff like VTubing, improv, and LARPing. So much of what we do as humans is non-text based. We haven't had effective automation for any of this until this point.
This is a huge percentage of the economy. This is actually the beating heart of it all.
The two are kind of orthogonal concepts.
AI still can’t reliably write text on background details. It can’t get shadows right. If you ask it to shoot things from a head on perspective, for example a bookshelf, it fails to keep proportions accurate enough. The bookshelf will not have parallel shelves. The books won’t have text. If in a library, the labels will not be in Dewey decimal order.
It still lacks a huge amount of understanding about how the world works necessary to make a film. It has its uses, but pretending like it can make a whole movie is laughable.
I'd say the image and video tools are much further along and much more useful than AI code gen (not to dunk on code autocomplete). They save so much time and are quite incredible at what they can do.
In terms of cinema tech, it took us arguably until the early 1940s to achieve "deep focus in artificial light". About 50 years!
The last couple of years of development in generative video looks, to me, like the tech is improving more quickly than the tech it is mimicking did. This seems unsurprising - one was definitely a hardware problem, and the other is most likely a mixture of hardware and software problems.
Your complaints (or analogous technical complaints) would have been acceptable issues - things one had to work around - for a good deal of cinema history.
We've already reached people complaining about "these book spines are illegible", which feels very close to "it's difficult to shoot in focus, indoors". Will that take four or five decades to achieve, based on the last 3 - 5 years of development?
The tech certainly isn't there yet, nor am I pretending like it is, and nor was the comment you replied to. To call it close is not laughable, though, in the historical context.
The much more interesting question is: At what point is there an audience for the output? That's the one that will actually matter - not whether it's possible to replicate Citizen Kane.
I still occasionally see a blip of activity but I can't say it's anything like what we witnessed in the past.
Though I will agree that gen AI trends feel reminiscent of that period of JS dev history.
I settled on what seemed like the most “standard” set of things (marketable skills blabla) and every week I read an article about how that stack is dead, and everybody supposedly uses FancyStack now.
Adding insult to injury, I have relearned the fine art of inline styles. I assume table layouts are next.
To lurch back on topic: I’m doing this for AI-related stuff and yes, the AI pace of change is much worse, but they sure do make a nice feedback loop.
You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.
I've never had the downtime or service busy situations I've heard others complain about with other vendors.
They did pricing based on chars back in the day, but now they are token based like everyone else.
I like that they are building custom hardware that is industry leading in terms of limiting how much my AI usage impacts the environment.
What do you think I shouldn't be enthusiastic about?
Here's what I found:
- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(
- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.
- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.
- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?
- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.
Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?
I'll write a detailed post about my learnings from architectures (fingers crossed) soon.
As for the actual agent I just do the following:
- Get metadata from initial query
- Pass relevant metadata to agent
- Agent is a reasoning model with tools and output
- Agent runs in a loop (max of n times). It will reason which tool calls to use
- If there is a tool call, execute it and continue the loop
- Once the agent outputs content, the loop is effectively finished and you have your output
This is effectively a ReAct agent. Thanks to the reasoning being built in, you don't need an additional evaluator step.
Tools can be anything. It can be a subagent with subagents, a database query, etc. Need to do an agent handoff? Just output the result of the agent into a different agent. You don't need an sdk to do a workflow.
I've tried some other SDKs/frameworks (Eino and langchaingo), and personally found it quicker to do it manually (as described above) than fight against the framework.
A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)
If you want to rewrite the behavior per instance you totally can, but there is a definite concept here that is different than “get_weather”.
I think that existing tools don’t work very well here or leave much of this as an exercise for the user. We have tasks that can take a few days to finish (just a huge volume of data and many non deterministic paths). Most people are doing way too much or way too little. Having subagents with traits that can be vended at runtime feels really nice.
Up to a point. You're obviously right in principle, but if that task itself has the ability to call into "adjacent" tools then the behavior changes quite a bit. You can see this a bit with how the Oracle in Amp surfaces itself to the user. The oracle as sub-agent has access to the same tools as the main agent, and the state changes (rare!) that it performs are visible to itself as well as the main agent. The tools that it invokes are displayed similarly to the main agent loop, but they are visualized as calls within the tool.
I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor
https://google.github.io/adk-docs/agents/workflow-agents/
I haven't gotten there yet, still building out the basics like showing diffs instead of blind writing and supporting rewind in a session
In the past, the agent type flows would work better if you prompted the LLM to write down a plan, or reasoning steps on how to accomplish the task with the available tools. These days, the new models are trained to do this without promoting
It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.
ADK supports working with all the vendors and local LLMs
I thought the same thing about the artifact service, which could have a nice local FS option.
I'm pretty new to ADK, so we'll see how long the honeymoon phase lasts. Generally very optimistic that I found a solid foundation and framework
edit: opened an issue to track it
You will be happy you did
save it for more interesting tasks
For example, instead of a JSON schema in your prompt, use an Open API subagent with API tools to keep your primary contexts clean
https://google.github.io/adk-docs/tools-custom/openapi-tools...
(Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)
This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.
https://huggingface.co/docs/smolagents/conceptual_guides/int...
For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.
1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.
2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.
3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.
4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.
0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.
For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.
Codex is stronger out of the box but properly customized Claude can't be matched at the moment
1. Poor long context performance compared to GPT5.1, so Claude gets confused about things when it has to do exploration in the middle of a task.
2. Claude is very completion driven, and very opinionated, so if your codebase has its own opinions Claude will fight you, and if there are things that are hard to get right, rather than come back and ask for advice, Claude will try to stub/mock it ("let's try a simpler solution...") which would be fine, except that it'll report that it completed the task as written.
focus on the tools and context, let claude handle the execution.
this will surely end up better than where big tech has already brought our current society...
For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?
My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn
anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent
I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.
I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.
I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents
It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.
1. A way to create function tools
2. A way to create specialised subagents that can use their own tool or their own model. The main agent can delegate to subagent exposed as a tool. Subagents don’t get confused because they have their own context window, tools and even models (mix and match Remote LLM with Local LLM if needed)
3. Don’t use all tools of the MCP servers you’ve added. Filter out and select only the most relevant ones for the problem you’re trying to solve
4. HigherOrderTool is a way to callMCPTool(toolName, input) in places where the Agent to MCP interface can be better suited for the problem than what’s exposed as a generic interface by the MCP provider - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to Anthropic’s recent blogpost on Code tools being better than MCP - https://www.anthropic.com/engineering/code-execution-with-mc...
5. MetaTool is a way to use ready made patterns like OpenAPI and not having to write a tool or add more MCP servers to solve a problem - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to a recent HN post on Bash tools being better for context and accuracy than MCP - https://mariozechner.at/posts/2025-11-02-what-if-you-dont-ne...
Other than AgentBuilder, CustomTool, HigherOrderTool, MetaTool, SubagentBuilder the framework does not try to control PydanticAI’s main agent behaviour. The high level approach is to use fewer, more relevant tools and let LLM orchestration and prompt tool references drive the rest. This approach has been more reliable and predictable for a given Agent based problem.
Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?
If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.
tl;dr, Both sides are broadcasting messages and listening for the ones they care about.
My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.
The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.
It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.
Read: Gather context (user input + tool outputs). Eval: LLM inference (decides: do I need a tool, or am I done?). Print: Execute the tool (the side effect) or return the answer. Loop: Feed the result back into the context window.
Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.
1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".
2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.
My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.
So that's my point (and that of the article): it's not "just a loop", it quickly gets much more complicated than that. I haven't used any framework, so I can't tell if they're good or not; but for sure I ended up building my own. Calling tools in a loop is enough for a cool demo but doesn't work well enough for production.
I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.
I just want to work.
Paying an agentic coding company to do this makes perfect sense for me.
I'm curious about the solutions the op has tried so far here.
In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.
We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
https://google.github.io/adk-docs/evaluate/
tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)
One example is input/output types of function tools. Frameworks offer some flexiblity and seemingly I can use fundamental types and simple data structures (list, dict/map). But on the other hand I know all data types are eventually stringified and this has implications.
I have recently observed two issues when my agent calls a function that simply takes some int64 numeric IDs: (1) when the IDs are presented as hexadecimal in the context, the LLM attempts to convert them to decimal itself but mess it up because it doesn't really calculate; (2) some big IDs are not passed precisely in Google ADK framework [1], presumbly because its JSON serialization failed to keep the precision. I ended up changing the function to take string args instead. I also wasn't sure if the tool should return the data as original as possible in a moderately deeply nested dict, or step further to properly organize the output in a more human-readable text format for model ingestion.
OpenAI's doc [2] writes: "A result must be a string, but the format is up to you (JSON, error codes, plain text, etc.). The model will interpret that string as needed." -- But that clearly contradicts with the framework's capability and some official examples where dict/numbers are returned.
[1] https://github.com/google/adk-python/issues/3592 [2] https://platform.openai.com/docs/guides/function-calling
https://github.com/wrale/mcp-server-tree-sitter
To talk about where it's _actually_ at:
Agentic IDEs have LSP support built in, which is better that just tree-sitter -such as Copilot in VSCode, which contrary to what you might expect, can actually use arbitrary models & has BYOK.
There's also OpenCode from the CLI side etc. etc.
From the MCP side, there are large community efforts such as https://github.com/oraios/serena
> Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for.
source thread: https://news.ycombinator.com/item?id=43163011#43164253
I would guess the biggest reason is that there is no RL happening on the base models with tree-sitter as tool. But there is a lot of RL with bash and so it knows how to grep. I did experiment with giving tree sitter and ast-grep to agents and my experience the results are mixed.
We ended up open sourcing that runtime if anyone is interested:
(1) LLM forgets to call a tool (and instead outputs plain text). Contrary to some of the comments here saying that these concerns will disappear as frontier models improve, there will always be a need for having your agent scaffolding work well with weaker LLMs (cost, privacy, etc).
(2) Determining when a task is finished. In some cases, we want the LLM to decide that (e.g search with different queries until desired info found), but in others, we want to specify deterministic task completion conditions (e.g., end the task immediately after structured info extraction, or after acting on such info, or after the LLM sees the result of that action etc).
After repeatedly running into these types of issues in production agent systems, we’ve added mechanisms for these in the Langroid[1] agent framework (I’m the lead dev), which has blackboard-like loop architecture that makes it easy to incorporate these.
For issue (1) we can configure an agent with a `handle_llm_no_tool` [2] set to a “nudge” that is sent back to the LLM when a non-tool response is detected (it could also be set as a lambda function to take other possible actions)
For issue (2) Langroid has a DSL[3] for specifying task termination conditions. It lets you specify patterns that trigger task termination, e.g.
- "T" to terminate immediately after a tool-call,
- "T[X]" to terminate after calling the specific tool X,
- "T,A" to terminate after a tool call, and agent handling (i.e. tool exec)
- "T,A,L" to terminate after tool call, agent handling, and LLM response to that
[1] Langroid https://github.com/langroid/langroid
[2] Handling non-tool LLM responses https://langroid.github.io/langroid/notes/handle-llm-no-tool...
[3] Task Termination in Langroid https://langroid.github.io/langroid/notes/task-termination/
https://langroid.github.io/langroid/reference/agent/tools/or...
So I started to actually build something to solve most of my own problems reliably and pushing deterministic outputs with help of LLMs (e.g. imagine finding the right columns/sheets in massive spreadsheets to create tool execution flows and fine tuning finding a range of data sources). My idea and solution which helped not only me but also quite a few business folks so far to fix and test agents is visualizing flows, connect and extract data visually, test and deploy changes in real time while keeping it very close to static types and predictable output (other than e.g. llama flow).
Would love to hear your thoughts about my approach: https://llm-flow-designer.com
What they want is to get things done. The model is simply means to an end. As long as the task is completed, everything else is secondary.
Tbh I agree with your approach and use it when building stuff for myself. I'm so over yak shaving that topic.
So for now the choice is, "all in one for great prototypes and better hope it has everything you need" or roll your own.
If someone knows of a library that's good for quick prototypes and malleable for power users please share.
Better to step back and just look at what exists for what it is? Strive for a less biased take.
The pieces are rapidly changing, non-standard, fragmented, and exploratory. If true, jumping in involves risks from chaos, churn, rushing, unproveness (sp?), etc
There is hype around AI and agentic dev. It's not subjective. My opinion is that it's a valid factor to consider if you evaluate tech. We've seen with i.e. microservices that people followed the hype because it seemed sound, but it was eventually terrible for many. Not everything is for everyone just because it's there.
Can people explore things they care about? Yes. Can I have an opinion about what they do if they make a public post about it. Yes.
I don’t see how this is true. By what metric might “most” lead to a true statement?
The post expresses a very low amount of hype, if any. It predominantly an account of trying to make agents work.
The above comment also seems dismissive. I suggest giving the author some credit. The article is much more valuable than most of what I read on AI from various pundits who seem to repeat the same narrative. Thankfully, this article teaches without preaching or positing some predefined narrative.
Caching is unrelated to memory, it's about how to not do the same work over and over again due to the distributed nature of state. I wrote a post that goes into detail from first principles here with my current thoughts on that topic [1].
> Are these things really that forgetful?
No, not really. They can get side-tracked which is why most agents do a form of reinforcement in-context.
> Why is there a virtual file system?
So that you don't have dead-end tools. If a tool creates or manipulates state (which we represent on a virtual file system), another tool needs to be able to pick up the work.
> Why can't the agent just know where the data is?
And where would that data be?
I’ve found that by running Claude Code within Manus sandbox I can inspect the reasoning traces and edit the Agents/Skills with the Manus agent.
I wonder what this means for the agents that people are deploying into production? Are they tested at all? Or just manual ad-hoc testing?
Sounds risky!
> Sounds risky!
One of first attempts at building file system tools for my custom agent called `tree` and caught a few node_models. Blew up my context and cost me $5 in 60s. Fortunately I triggered the TPM rate-limit and the thing stopped
I’ve encountered a number of errors dealing with LLMs so would be wary of the results.
I also think there’d be an incentive to enshittify by having vendors pay to get preferential prioritization from the LLM. This could result in worse products being delivered for higher prices.
Where are you getting this impression?
Having used AI quite extensively I have come to realize that to get quality outputs, it needs quality inputs... this is especially true with general models. It seems to me that many developers today just start typing without planning. That may work to some degree for humans, but AI needs more direction. In the early 2000s, Rational Unified Process (RUP) was all the rage. It gave way to Agile approaches, but now, I often wonder if we didn't throw the baby out with the bath water. I would wager that any AI model could produce high-quality code if provided with even a light version of RUP documentation.
But I was actually playing with a few frameworks yesterday and struggling--I want what I want without having to write it. ;). Ended up using pydantic_ai package, literally just want tools w/ pydantic validation--but out of the box it doesn't have good observability, you would have to use their proprietary SaaS; and it comes bundled with Temporal.io (yo odio eso proyecto). I had to write my own observability which was annoying, and it sucks.
If anyone has any things they've built, I would love to know, and TypeScript is an option. I want: - ReAct agent with tools that have schema validation - built in REALTIME observability w/ WebUI - customizable playground ChatUI (This is where TypeScript would shine) - no corporate takeover tentacles
p.s.s: I know... I purposely try to avoid hard recommendations on HN, to avoid enshittification. "reddit best X" has been gamed. And generally skip these subtle promotional posts..
The Logfire SDK that Pydantic AI uses is a generic OTel client that defaults to sending data to Pydantic Logfire, our SaaS observability platform: https://pydantic.dev/logfire, but can easily be pointed at any OTel sink: https://ai.pydantic.dev/logfire/#logfire-with-an-alternative...
Temporal is one of multiple durable execution solutions (https://ai.pydantic.dev/durable_execution/overview/) we support and its SDK is indeed included by default in the "fat" `pydantic-ai` package, as are the SDKs for all model providers. There's also a `pydantic-ai-slim` package that doesn't include any optional dependencies: https://ai.pydantic.dev/install/#slim-install
One of the things he talked about was issues with reliable tool calling by the model. I recommend he try the following approach. Have the agent perform a self calibration exercise that makes the agent use his tools at the beginning of the context. Make him perform some complex stuff. Do it many times to test for coherence and accuracy while adjusting the system prompt towards more accurate tool calls. Once the agent had performed that calibration process successfully, you "freeze" that calibration context/history by broadening the --keep n to include not just the system prompt in the rolling window but also up to the end of this calibration session. then no matter how far the context window drifts the conditioned tokens generated by that calibration session steer the agent towards proper tool use. From then on your "system prompt" includes those turns. Note that this is probably not possible on cloud based models as you don't have access to the inference engine directly. A hacky way around that is emulate the conversation turns inside the system prompt.
On the note of benchmark's. The calibration test is your benchmark from then on. When introducing new tools to the system prompt or adjusting any important variable, you must always rerun the same test to make sure the new adjustments you made don't negatively affect the system stability.
On context engineering. That is a must as a bloated history will decohere the whole system. So its important to device an automated system that compresses the context down but retains overall essence of the history. there are about a billion ways you could do this and you will need to experiment a lot. LLM's are conditioned quite heavily from their own outputs, so having the ability to remove error tool calls from the context is a big boon as now the model is less likely to repeat its same mistakes. There are trade offs though, like he said caching is a no go when going this route but you gain a lot more control and stability within the whole system if you do this right. its basically reliability vs cost here. I tend to lean towards reliability. Also i don't recommend using the whole context size of the model. Most llms perform very poorly past a specific amount and I find that using maximum of 50% of the whole context window is recommended for cohesion. Meaning that if lets say max context window is 100k tokens, treat 50k as the max limit and start compressing the history around 35k tokens. Granular and step wise system can be set up. Where the most recent context is most detailed and uncompressed but as it goes further from the current time it gets less and less detailed. Obviously you want to store the full uncompressed history for a subagent that uses rag. This allows the agent to see in detail the whole thing if it finds the need to.
ahh also on the matter of output. I found great success with making input and output channels for my agent. there are many channels that the model is conditioned in using for specific interactions. <think> channel for cot and reasoning. <message_to_user> channel for explicit messages to user. <call_agent> channel for calling agents and interacting with them. <call_tool> for tool use. and then a few other environment and system channels that are input channels from error scripts and environment towards the agent. This channel segmentation also allows for better management of internal automated scripts, and focus the model. Oh also one important thing is the fact that you need at least 2 separate output layers. meaning you need to separate your llm outputs from what is displayed to the user. and they have their own rules they follow. what that allows you to do is display information in a very human readable way to the real human while hiding all the noise but also retaining the crucial context thats needed for the model to function appropriately.
bah anyways i rambled for long enough. good luck folks. hope this info helps someone.