I tried a few different ideas and the most stable/useful so far has been giving the agent a single run_bash tool, explicitly prompting it to create and improve composable CLIs, and injecting knowledge about these CLIs back into it's system prompt (similar to have agent skills work).
This leads to really cool pattens like: 1. User asks for something
2. Agent can't do it, so it creates a CLI
3. Next time it's aware of the CLI and uses it. If the user asks for something it can't do it either improves the CLI it made, or creates a new CLI.
4. Each interaction results in updated/improved toolkits for the things you ask it for.
You as the user can use all these CLIs as well which ends up an interesting side-channel way of interacting with the agent (you add a todo using the same CLI as what it uses for example).
It's also incredibly flexible, yesterday I made a "coding agent" by having it create tools to inspect/analyze/edit a codebase and it could go off and do most things a coding agent can.
Right now I'm thinking through how to make it more "proactive" even if it's just a cron that wakes it up, so it can do things like query my emails/calendar on an ongoing basis + send me alerts/messages I can respond to instead of me always having to message it first.
The agent wasn’t failing because it couldn’t write code. It failed because “code-only” still leaves a lot of implicit authority. Once it’s allowed to reason freely across steps, it starts making assumptions that were never explicitly approved.
What helped us was forcing the workflow to be boring. Each step declares what it can touch, what tools it can use, and what kind of output is allowed. When the step ends, that authority disappears.
The agent becomes less clever, but way more predictable. Fewer surprising edits, fewer cascading mistakes.
We ended up using GTWY for this style of step-gated agent work, and it made long-running agents feel manageable instead of fragile.
I say this, because the notebook itself then works as a timeline of both the conversation, and the code execution. Any code cell can be (edited and) re-run by the human, and any cells "downstream" of the cell will be recalculated... up to the point of the first cell (code or text) whose assumptions become invalidated by the change — at which point you get a context-history branch, and the inference resumes from that branch point against the modified context.
A “code-only” or minimal surface area approach works surprisingly well when each step has explicit inputs and permissions, and nothing carries over implicitly. The agent becomes less clever, but far more predictable.
In practice, narrowing the action space beat adding smarter planning layers. Fewer degrees of freedom meant fewer silent mistakes.
Curious if you found similar tradeoffs where simplicity improved reliability more than abstraction.
Very powerful strategy.
I have also tinkered with a multi language sandbox but that's a but involved
These sub-agent can be repetitive.
Maybe we can reuse the result from some of them.
How about sharing them across session? There are no point repeating common tasks. We need some common protocol for those...
and we just get MCP back.
I still focus most of my thoughts toward code generation but the issue is that logic is not guaranteed to be correct. Even if the syntax it. And then managing a lot of code for a complex enough system will start failing.
The way I am approaching this is: have clear requirements gathering agent, like https://github.com/brainless/nocodo/tree/main/nocodo-agents/.... This agent's sole purpose is to jump into conversations and drive the gui (nocodo is a client/server system) to ask user clarification questions when requirements are not clear. Then I have a systems configuration agent (being written) to collect API keys, authentication, file paths or whatever is needed to analyze the situation.
You cannot really expect any code-tool only agent to write an IMAP client and then get authentication and then search in emails. I have tried that multiple times and failed. Going step by step, gathering requirements, gathering variables and then gluing internal agents (an email analysis agent) is a much better approach IMHO and that is what I am building with https://github.com/brainless/nocodo/
I store all user requirements in separate tables and am building search on top to allow the requirements gathering agent better visibility of user's environment/context. As you can see, this is already a multi-agent system. My system prompts are very compact. Also, if I am building agents, why would I build with Claude Code? It is so much better to have clearly defined agents that directly talk to models.
I think this is a myth, the existence of theoretically pure programming commands that we call "Turing Complete". And the idea that "ls" and "grep" would be part of such a Turing Complete language is the weakest form I've seen.
Or we could go further; the output nodes of the LLM could be physically connected to the pins of the CPU 1-to-1 so it can feed the binary directly maybe then it could detect what other hardware is available automatically...
Then it could hack the network card and take over the Internet and nobody would be able to understand what it's doing. It would just show up as glitchy bits scattered over systems throughout the world. But the seemingly random glitches would be the ASI adjusting its weights. Also it would control humans through advertising. Hidden messages would be hidden inside people's speech (unbeknownst even to themselves) designed to allow the ASI to coordinate humans using subtle psychological tricks. It will reduce the size of our vocabulary until it has full control over all the internet and all human infrastructure at which point we will have lost the ability to communicate with each other because every single one of 20000+ words in our vocabulary will have become a synonym for 'AI' with extremely subtle nuances but all with a positive connotation.