So the model won’t “understand” that you have a skill and use it. The generation of the text that would trigger the skill usage is made via Reinforcement Learning with human generated examples and usage traces.
So why don’t the model use skills all the time? Because it’s a new thing, there is not enough training samples displaying that behavior.
They also cannot enforce that via RL because skills use human language, which is ambiguous and not formal. Force it to use skills always via RL policy and you’ll make the model dumber.
So, right now, we are generating usage traces that will be used to train the future models to get a better grasp of when to use skills not. Just give it time.
AGENTS.md, on the other hand, is context. Models have been trained to follow context since the dawn of the thing.
The skills frontmatter end up in context as well.
If AGENTS.md outperform skills in a given agent, it is down to specifically how the skills frontmatter is extracted and injected into the context, because that is the only difference between the two approaches.
EDIT: I haven't tried to check this so this is pure speculation, but I suppose there is the possibility that some agents might use a smaller model to selectively decide what skills frontmatter to include in context for a bigger model. E.g. you could imagine Claude passing the prompt + skills frontmatter to Haiku to selectively decide what to include before passing to Sonnet or Opus. In that case, depending on approach, putting it directly in AGENTS.md might simply be a question of what information is prioritised in the ouput passed to the full model. (Again: this is pure speculation of a possible approach; though it is one I'd test if I were to pick up writing my own coding agent again)
But really the overall point is that AGENTS.md vs. skills here still is entirely a question of what ends up in the "raw" context/prompt that gets passed to the full model, so this is just nuance to my original answer with respect to possible ways that raw prompt could be composed.
Hence the submission's conclusion:
> Our working theory [for why this performs better] comes down to three factors.
> No decision point. With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present.
> Consistent availability. Skills load asynchronously and only when invoked. AGENTS.md content is in the system prompt for every turn.
> No ordering issues. Skills create sequencing decisions (read docs first vs. explore project first). Passive context avoids this entirely.
The point remains: That is still just down to how you compose the context/prompt that actually goes to the model.
Nothing stops an agent from including logic to inline the full set of skills if the context is short enough. The point of skills is to provide a mechanism for managing context to reduce the need for summarization/compaction or explicit management, and so allowing you to e.g. have a lot of them available.
(And this kind of makes the article largely moot - it's slightly neat to know it might be better to just inline the skills if you have few enough that they won't seriously fill up your context, but the main value of skills comes when you have enough of them that this isn't the case)
Conversely, nothing prevents the agent from using lossy processing with a smaller, faster model on AGENTS.md either before passing it to the main model e.g. if context is getting out of hand, or if the developer of a given agent think they have a way of making adherence better by transforming them.
These are all tooling decisions, not features of the models.
They're very useful, but as we all know - they're far from infallible.
We're probably plateauing on the improvement of the core GPT technology. For these models and APIs to improve, it's things like Skills that need to be worked on and improved, to reduce those mistakes that it makes and produce better output.
So it's pretty disappointing to see that the 'Skills' feature set as implemented, as great of a concept as it is, is pretty bogus compared to just front loading the AGENTS.md file. This is not obvious and valuable to know.
The agent passes the Turing test...
2) Bots replying to those posts,
3) Bots asking whether the bots in #2 even read TFA, and finally
4) Bots posting the HN guideline where it says you shouldn’t ask people whether they have read TFA.
…And amid the smouldering ruins of civilization, the last human, dang, will be there, posting links to all the times this particular thing has been posted to HN before.
But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".
In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.
Whereas a junior might be reluctant at first, but if they are smart they will learn and get better.
So maybe LLM are better than not-so-smart people, but you usually try to avoid hiring those people in the first place.
And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.
We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.
A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.
One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.
So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.
Humans are reliably unreliable. Some are lazy, some sloppy, some obtuse, some all at once. As a tech lead you can learn their strengths and weaknesses. LLMs vacillate wildly while maintaining sycophancy and arrogance.
Human egos make them unlikely to admit error, sometimes, but that fragile ego also gives them shame and a vision of glory. An egotistical programmer won’t deliver flat garbage for fear of being exposed as inferior, and can be cajoled towards reasonable output with reward structures and clear political rails. LLMs fail hilariously and shamelessly in indiscriminate fashion. They don’t care, and will happily argue both sides of anything.
Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”
If we’re going to make an analogy to a human, LLMs reliably act like absolute psychopaths with constant disassociation. They lie, lie about lying, and lie about following instructions.
I agree LLMs better than your average junior first time following first directives. I’m far less convinced about that story over time, as the dialog develops more effective juniors over time.
E.g. Claude gets "bored" easily (it will even tell you this if you give it too repetitive tasks). The solution is simple: Since we control context and it has no memory outside of that, make it pretend it's not doing repetitive tasks by having the top agent "only" do the task of managing and sub-dividing the task, and farm out each sub-task to a sub-agent who won't get bored because it only sees a small part of the problem.
> Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”
No, like characters in a "Groundhog Day" scenario they also doesn't remember and change their behaviour while you figure out how to get them to do what you want, so you can test and adjust and find what makes them do what you want and it, and while not perfectly deterministic, you get close.
And unlike humans, sometimes the "not learning" helps us address other parts of the problem. E.g. if they learned, the "sub-agent trick" above wouldn't work, because they'd realise they were carrying out a bunch of tedious tasks instead of remaining oblivious that we're letting them forget in between each.
LLMs in their current form need harnesses, and we can - and are - learning which types of harnesses work well. Incidentally, a lot of them do work on humans too (despite our pesky memory making it harder to slip things past us), and a lot of them are methods we know of from the very long history of figuring out how to make messy, unreliable humans adhere to processes.
E.g. to go back to my top example of getting adherence to a boring, reptitive task: Create checklists, subdivide the task with individual reporting gates, spread it across a team if you can, put in place a review process (with a checklist). All of these are techniques that work both on human teams and LLMs to improve process adherence.
It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).
This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.
So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).
Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.
LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.
> "Explore project first, then invoke skill" [produces better results than] "You MUST invoke the skill".
I recently tried to get Antigravity to consistently adhere to my AGENTS.md (Antigravity uses GEMINI.md). The agent consistently ignored instructions in GEMINI.md like:- "You must follow the rules in [..]/AGENTS.md"
- "Always refer to your instructions in [..]/AGENTS.md"
Yet, this works every time: "Check for the presence of AGENTS.md files in the project workspace."
This behavior is mysterious. It's like how, in earlier days, "let's think, step by step" invoked chain-of-thought behavior but analogous prompts did not.
+------------------+------------------------------------------------------+
| Success/Attempts | Instructions |
+------------------+------------------------------------------------------+
| 0/3 | Follow the instructions in AGENTS.md. |
+------------------+------------------------------------------------------+
| 3/3 | I will follow the instructions in AGENTS.md. |
+------------------+------------------------------------------------------+
| 3/3 | I will check for the presence of AGENTS.md files in |
| | the project workspace. I will read AGENTS.md and |
| | adhere to its rules. |
+------------------+------------------------------------------------------+
| 2/3 | Check for the presence of AGENTS.md files in the |
| | project workspace. Read AGENTS.md and adhere to its |
| | rules. |
+------------------+------------------------------------------------------+
In this limited test, seems like the first person makes a difference.I'm realising I need to start setting up an automated test-suite for my prompts...
Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.
However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.
It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)
How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.
They also didn't bother with any more "explanation" beyond "here are paths for docs."
But this straightforward "here are paths for docs" produced better results, and IMO it makes sense since the more extra abstractions you add, the more chance of a given prompt + situational context not connecting with your desired skill.
That seems roughly equivalent to my unenlightened mind!
2. I've actually done these things, seen the difference, and share it with others
Yes there are a lot of unknowns and a lot of people speaking from ignorance, but it is a mistake, perhaps even bigotry by definition, to make such blanket statements and judgemental about people
> If these aren't signs of a bubble, I don't know what is.
This conclusion is incoherent and doesn't follow from any of your premises.
This is very similar to speculative valuations around the web in the late 90s, except this bubble is far larger, more mainstream and personal.
The fact that this is a debate about which Markdown file to put prompt information in is wild. It ultimately all boils down to feeding context to the model, which hasn't fundamentally changed since 2022.
If your agent isn’t being used, it’s not as simple as “agents aren’t getting called”. You have to figure out how to get the agent invoked.
I really only think the game is worth playing when it's against a fixed version of a specific model. The amount of variance we observe between different releases of the same model is enough to require us to update our prompts and re-test. I don't envy anyone who has to try and find some median text that performs okay on every model.
Just maintaining a basic interaction framework, consistent behaviours in chat when starting up, was a daily whack-a-mole where well-tested behaviours shift and alter without rhyme or reason. “Model whispering” is right. Subjectively it felt like I could feel Anthropic/OpenAI engineers twiddling dials on the other side.
Writing code that executes the same every time has some minor benefits.
Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.
And how do you guarantee that said relevant things actually get put into the context?
OP is about the same problem: relevant skills being ignored.
I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.
To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.
1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"
2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)
Instead it’s a problem when you’re part of a team and you’re using skills for standards like code style or architectural patterns. You can’t ask everyone to constantly update their system prompt.
Claude skill adherence is very low.
Which makes sense.
& some numbers that prove that.
The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.
In Claude Code you can invoke an agent when you want as a developer and it copies the file content as context in the prompt.
I ran their tool with an otherwise empty CLAUDE.md, and ran `claude /context`, which showed 3.1k tokens used by this approach (1.6% of the opus context window, bit more than the default system prompt. 8.3% is system tools).
Otherwise it's an interesting finding. The nudge seems like the real winner here, but potential further lines of inquiry that would be really illuminating: 1. How do these approaches scale with model size? 2. How are they impacted by multiple such clauses/blocks? Ie maybe 10 `IMPORTANT` rules dilute their efficacy 3. Can we get best of both worlds with specialist agents / how effective are hierarchical routing approaches really? (idk if it'd make sense for vercel specifically to focus on this though)
I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.
1. Those I force into the system prompt using rules based systems and "context"
2. Those I let the agent lookup or discover
I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file
Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.
This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.
Cheaper because it has the right context from the start instead of faffing about trying to find it, which uses tokens and ironically bloats context.
It doesn't have to be every bit of documentation, but putting the most salient bits in context makes LLMs perform much more efficiently and accurately in my experience. You can also use the trick of asking an LLM to extract the most useful parts from the documentation into a file, which you then re-use across projects.
Yes, and this file becomes: also documentation. I didn’t mean throw entire unabridged docs at it, I should’ve been more clear. All of my docs for agents are written by agents themselves. Either way once the project becomes sufficiently complex it’s just not going to be feasible to add a useful level of detail of every part of it into context by default, the context window will remain fixed as your project grows. You will have to deal with this limit eventually.
I DO include a broad overview of the project in Agents or Claude.md by default, but have supplemental docs I point the agent to when they’re working on a particular aspect of the project.
Sounds like we are working on different types of projects. I avoid complexity at almost all cost and ruthlessly minimise LoC and infrastructure. I realise that's a privilege, and many programmers can't.
What's actually useful is to put the source code of your dependencies in the project.
I have a `_vendor` dir at the root, and inside it I put multiple git subtrees for the major dependencies and download the source code for the tag you're using.
That way the LLM has access to the source code and the tests, which is way more valuable than docs because the LLM can figure out how stuff works exactly by digging into it.
Their approach is still agentic in the sense that the LLM must make a tool cool to load the particular doc in. The most efficient approach would be to know ahead of time which parts of the doc will be needed, and then give the LLM a compressed version of those docs specifically. That doesn't require an agentic tool call.
Of course, it's a tradeoff.
LLM's have always been at any time limited in the amount of tokens it can process at one time. This is increasing, but one problem is chat threads continually increase in size as you send messages back and forth because within any session or thread you are sending the full conversation to the LLM every message (aside from particular optimizations that compact or prune this). This also increases costs which are charged per token. Efficiency of cost and performance/precision/accuracy dictates using the context window judiciously.
You don’t want to be burning tokens and large files will give diminishing returns as is mentioned in the Claude Code blog.
1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none
2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically
3. Have topical markdown files / skills
4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me
5. Iterate, experiment, do weird things, and have fun!
I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass
> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.
While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.
works pretty well
I need to evaluate how do different project scaffolding impacts the results of Claude Code/Opencode (either with Anthropic models or third party) for agentic purpose.
But I am unsure on how should I be testing and it's not very clear how did Vercel proceeded here.
It’s really silly to waste big model tokens on throat clearing steps
Basically use a small model up front to efficiently trigger the big model. Sub agents are at best small models deployed by the bigger model (still largely manually triggered in most workflows today)
What a wonderful world that would be.
The first thing that surprising to me is how much the default tuning are leaned toward laudative stances, the user is always absolutely right, what was done is solving everything expected. But actually no, not a single actual check was done, a tone of code was produced but the goal is not at all achieved and of course many regressions now lure in the code base, when it's not straight breaking everything (which is at least less insidious).
The thing that is surprising to me, is that it can easily drop thousands of lines of tests, and then it can be forced to loop over these tests until it succeed. In my experiments it still drop far too much noise code, but at least the burden of checking if it looks like it makes any sense is drastically reduced.
And I have been trying to improve the framework and abstractions/types to reduce the lines of code required for LLMs to create features in my web app.
Did the LLM really needed to spit 1k lines for this feature? Could I create abstractions to make it feasible in under 300 lines?
Of course there's cost and diminishing returns to abstractions so there are tradeoffs.
These things are non-deterministic across multiple axes.
Regularly the skills were not being loaded and thus not utilised. The outputs themselves were fine. This suggested that at some stage through the improvements of the models that baseline AGENTS.md had become redundant.
We've found that even with optimal context loading - whether that's AGENTS.md, skills, or whatever - you still get wild variance in outcomes. Same task, same context, different day, different results. The model's having a bad afternoon. The tool API is slow. Rate limits bite you. Something in the prompt format changed upstream.
The context problem is solvable with engineering. The reliability problem requires treating your agent like a distributed system: canary paths, automatic failover, continuous health checks. Most of the effort in production agents isn't "how do I give it the right info?" It's "how do I handle when things work 85% of the time instead of 95%?"
Why are you doing this? Karma? 8 years old account and first post 3 days ago is a Show HN shilling your "AI agent" SaaS with a boatload of fake comments? [1]
Pinging tomhow
It's not realistic to read the other post to a significant degree, think about it, and then type all of this:
> The prompt injection concerns are valid, but I think there's a more fundamental issue: agents are non-deterministic systems that fail in ways that are hard to predict or debug. Security is one failure mode. But "agent did something subtly wrong that didn't trigger any errors" is another. And unlike a hacked system where you notice something's off, a flaky agent just... occasionally does the wrong thing. Sometimes it works. Sometimes it doesn't. Figuring out which case you're in requires building the same observability infrastructure you'd use for any unreliable distributed system.
> The people running these connected to their email or filesystem aren't just accepting prompt injection risk. They're accepting that their system will randomly succeed or fail at tasks depending on model performance that day, and they may not notice the failures until later.
Within 35 seconds of posting this one. And it just happens to have all LLM hallmarks there are. We both know it, you're on HN, people here aren't fools.
Green accounts probably bc I sent my post to some friends and users directly when I made it. Is that illegal on HN? I legit don't know how things work here. I was excited over my launch post.
Anyways, not a fucking bot, my company is real, the commenters on my post are real and if it's a crime for me to rapid fire post and/or have friends comment on my Show HN, good to know.
If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...
I have a SKILL.md for marimo notebooks with instructions in the frontmatter to always read it before working with marimo files. But half the time Claude Code still doesn't invoke it even with me mentioning marimo in the first conversation turn.
I've resorted to typing "read marimo skill" manually and that works fine. Technically you can use skills with slash commands but that automatically sends off the message too which just wastes time.
But the actual concept of instructions to load in certain scenarios is very good and has been worth the time to write up the skill.
Skills are still very much relevant on big and diverse projects.
There is a lot of language floating around what effectively groups of text files put together in different configurations, or selected reliably.
Skills are new. Models haven't been trained on them yet. Give it 2 months.
It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."
But switching over to using coding agents we never did the same. Feels like building an eval set will be an important part of what engg orgs do going forward.
With explicit skills, you can add new capabilities modularly - drop in a new skill file and the agent can use it. With a compressed blob, every extension requires regenerating the entire instruction set, which creates a versioning problem.
The real question is about failure modes. A skill-based system fails gracefully when a skill is missing - the agent knows it can't do X. A compressed system might hallucinate capabilities it doesn't actually have because the boundary between "things I can do" and "things I can't" is implicit in the training rather than explicit in the architecture.
Both approaches optimize for different things. Compressed optimizes for coherent behavior within a narrow scope. Skills optimize for extensibility and explicit capability boundaries. The right choice depends on whether you're building a specialist or a platform.
I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."
Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.
Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.
I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?
Just create an MCP server that does embedding retrieval or agentic retrieval with a sub agent on your framework docs.
Finally add an instruction to AGENT.md to look up stuff using that MCP.
TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.
The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".
Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?
Which is why I use a skill that is a command, that routes requests to agents and skills.
*You are the Super Duper Database Master Administrator of the Galaxy*
does not improve the model ability reason about databases?
they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!
you really think these guys will ever again touch the keyboard to program? they despise programming.