We’ve open-sourced a multi-agent orchestrator that we’ve been using to handle long-running LLM tasks. We found that single LLM agents tend to stall, loop, or generate non-compiling code, so we built a harness for agents to coordinate over shared context while work is in progress.
How it works: 1. Orchestrator agent that manages task decomposition 2. Sub-agents for parallel work 3. Subscriptions to task state and progress 4. Real-time sharing of intermediate discoveries between agents
We tested this on a Putnam-level math problem, but the pattern generalizes to things like refactors, app builds, and long research. It’s packaged as a Claude Code skill and designed to be small, readable, and modifiable.
Use it, break it, tell me about what workloads we should try and run next!
* Throw more agents * Use something like Beads
I'm in the latter, I don't have infinite resources, I'd rather stick to one agent and optimize what it can do. When I hit my Claude Code limit, I stop, I use Claude Code primarily for side projects.
I ignore all Skills, MCPs, and view all of these as distractions that consume context, which leads to worse performance. It's better to observe what agent is doing, where it needs help and just throw a few bits of helpful, sometimes persistent context at it.
You can't observe what 20 agents are doing.
Now I see why Grey Walter made artificial tortoises in the 50s - he foresaw that it would be turtles all the way down.
At some point the interesting question isn’t whether one agent or twenty agents can coordinate better, but which decisions we’re comfortable fully delegating versus which ones feel like they need a human checkpoint.
Multi-agent systems solve coordination and memory scaling, but they also make it easier to move further away from direct human oversight. I’m curious how people here think about where that boundary should sit — especially for tasks that have real downstream consequences.
More specifically, we've been working on a memory/context observability agent. It's currently really good at understanding users and understanding the wide memory space. It could help with the oversight and at least the introspection part.
Failed strategies + successful tactics all get written to shared memory, so if a claim expires and a new agent picks it up, it sees everything the previous agent tried.
Ranking is first-verified-wins.
For competing decomposition strategies, we backtrack: if children fail, the goal reopens, and the failed architecture gets recorded so the next attempt avoids it.
If your registration process is eventually going to ask me for a username, can the org name and user name be the same?
I created the account from my phone, and don't have access to the dev tools I'd want to paste the key into. I can deal with it, but I don't know if I'll be able to regenerate the key if I lose it, I'd rather not store it on my phone, and I don't trust my accuracy in manually typing it in on my laptop while looking at my phone, so all the options feel not great. Again, not an actual roadblock, but still something I'd encourage fixing.
Edit added: Good thing I copied the key to my phone before writing this message. Jumping over to this page seems to have forced a refresh/logout on the ensure page in the other tab, so my token would (I think? maybe?) be lost at this point if I'd done it in the other order.
Will make this more clear in the quickstart, thanks for the feedback
I'm curious to see how it feels for you when you run it. I'm happy to help however I can.
Any workloads you want to see? The best are ones that have ways to measure the output being successful, thinking about recreating the C compiler example Anthropic did, but doing it for less than the $20k in tokens they used.
I guess maybe I'm doing the orchestration manually, but I always find there's tons of decisions that need to be made in the middle of large plan implementations.
Your refactor example terrifies me because the best part of a refactor is cleaning out all the bandaid workarounds and obsolete business logic you didn't even know existed. Can't see how an agent swarm would be able to figure that out unless you provide a giga-spec file containing all current business knowledge. And if you don't spec it the agents will just eagerly bake these inefficiencies and problems into your migrated app.