Backpressure != feedback (the more general term). And in the agentic world, we use the term 'context' to describe information used to help LLMs make decisions, where the context data is not part of the LLM's training data. Then, we have verifiable tasks (what he is really talking about), where RL is used in post-training in a harness environment to use feedback signals to learn about type systems, programming language syntax/semantics, etc.
Context is also a misnomer, where in fact it's just a part of prompt.
Prompt itself is also a misnomer, where in fact it's just part of model input.
Model input is also a misnomer, in fact it's just first input token + prefill for model output to generate more output.
Harness is also a misnomer, where it's just scaffold / tools around the model input/output.
I'm not trying to discount any attempt to correct people, especially when it gets confusing (like here, I was also confused honestly), but we could formulate it nicer IMHO.
Every time you find a runtime bug, ask the LLM if a static lint rule could be turned on to prevent it, or have it write a custom rule for you. Very few of us have time to deep dive into esoteric custom rule configuration, but now it's easy. Bonus: the error message for the custom rule can be very specific about how to fix the error. Including pointing to documentation that explains entire architectural principles, concurrency rules, etc. Stuff that is very tailored to your codebase and are far more precise than a generic compiler/lint error.
Eg compiler errors, unit tests, mcp, etc.
Ive heard of these; but havent tried them yet.
https://github.com/hmans/beans
https://github.com/steveyegge/gastown
Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.
I've found https://github.com/obra/superpowers very helpful for breaking the work up into logical chunks a subagent can handle.
What we often like to do in a PR - look over the code and say "LGTM" - I call this "vibe testing" and think it is the real bad pattern to use with AI. You can't commit your eyes on the git repo, and you are probably not doing as good of a job as when you have actual test coverage. LGTM is just vibes. Automating tests removes manual work from you too, not just make the agent more reliable.
But my metaphor for tests is "they are the skin of the agent", allow it to feel pain. And the docs/specs are the "bones", allow it to have structure. The agent itself is the muscle and cerebellum, and the human in the loop is the PFC.
Most of my feedback that can be automated is done either by this or by fuzzing. Would love to hear about other optimisations y'all have found.
There are also openapi spec validators to catch spec problems up front.
And you can use contract testing (e.g. https://docs.pact.io/) to replay your client tests (with a mocked server) against the server (with mocked clients)--never having to actually spin up both a the same time.
Together this creates a pretty widespread set of correctness checks that generate feedback at multiple points.
It's maybe overkill for the project I'm using it on, but as a set of AI handcuffs I like it quite a bit.
* (I've moved those comments to https://news.ycombinator.com/item?id=46675246. If you want to reply, please do so there so we can hopefully keep the main thread on topic.)
I’ve been wondering why I can’t use it to generate electricity.