If you're running 20 instances of a service and Stripe starts returning 500s, each instance discovers that independently. Instance 1 trips its breaker after 5 failures. Instance 14 just got recycled and hasn't seen any yet. Instance 7 is in half-open, probing a service you already know is dead. For some window of time, part of your fleet is protecting itself and part of it is still hammering a dead dependency and timing out, and all you can do is watch.
Libraries can't fix this. Opossum, Resilience4j, Polly are great at the pattern, but they make per-instance decisions with per-instance state. Your circuit breakers don't talk to each other.
Openfuse is a centralized control plane. It aggregates failure metrics from every instance in your fleet and makes the trip decision based on the full picture. When the breaker opens, every instance knows at the same time.
It's a few lines of code:
const result = await openfuse.breaker('stripe').protect(
() => chargeCustomer(payload)
);
The SDK is open source, anyone can see exactly what runs inside their services.The other thing I couldn't let go of: when you get paged at 3am, you shouldn't have to find logs across 15 services to figure out what's broken. Openfuse gives you one dashboard showing every breaker state across your fleet: what's healthy, what's degraded, what tripped and when. And, you shouldn't need a deploy to act. You can open a breaker from the dashboard and every instance stops calling that dependency immediately. Planned maintenance window at 3am? Open beforehand. Fix confirmed? Close it instantly. Thresholds need adjusting? Change them in the dashboard, takes effect across your fleet in seconds. No PRs, no CI, no config files.
It has a decent free tier for trying it out, then $99/mo for most teams, $399/mo with higher throughput and some enterprise features. Solo founder, early stage, being upfront.
Would love to hear from people who've fought cascading failures in production. What am I missing?
During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.
On-prem: You're right, and it's on the roadmap. For teams at the scale you're describing, a hosted control plane doesn't make sense. The architecture is designed to be deployable as a self-hosted service, the SDK doesn't care where the control plane lives, just that it can reach it (you can swap the OpenfuseCloud class with just the Openfuse one, using your own URL).
Roundtrip time: The SDK never sits in the hot path of your actual request. It doesn't check our service before firing each call. It keeps a local cache of the current breaker state and evaluates locally, the decision to allow or block a request is pure local memory, not a network hop. The control plane pushes state updates asynchronously. So your request latency isn't affected. The propagation delay is how quickly a state change reaches all instances, not how long each request waits.
False positives / single system errors: This is exactly why aggregation matters. Openfuse doesn't trip because one instance saw one error. It aggregates failure metrics across the fleet, you set thresholds on the collective signal (e.g., 40% failure rate across all instances in a 30s window). A single server throwing an error doesn't move that needle. The thresholds and evaluation windows are configurable precisely for this reason.
Local cache location: It's in-process memory, not Redis or Memcache. Each SDK instance holds the last known breaker state in memory. The control plane pushes updates to connected SDKs. So the per-request check is: read a boolean from local memory. The network only comes into play when state changes propagate, not on every call. The cache size for 100 breakers is ~57KB, and for 1000, which is quite extreme, is ~393KB.
Backpressure: 100% agree, breakers alone don't solve cascading failures. They're one layer. Openfuse is specifically tackling the coordination and visibility gap in that layer, not claiming to replace load shedding, rate limiting, retry budgets, or backpressure strategies. Those are complementary. The question I'm trying to answer is narrower: when you do have breakers, why is every instance making that decision independently? why do you have no control over what's going on? why do you need to make a code change to temporarily disconnect your server from a dependency? And if you have 20 services, you configure it 20 times (1 for each repo)?
Would love to hear more about what you've seen work at scale for the backpressure side. That would be a next step :)
At extremely high scale you start to run into very strange problems. We used to say that all of your "Unix Friends" fail at scale and act differently.
I once had 3000 machines running NTP sync'd cronjobs on the exact same second pounding the upstream server and causing outages (Whoops, add random offsets to cron!)
This sort of "dogpile effect" exists when fetching keys as well. A key drops out of cache and 30 machines (or worker threads) trying to load the same key at the same time, because the cache is empty.
One of the solutions around this problem was Facebook's Dataloader (https://github.com/graphql/dataloader), which tries to intercept the request pipeline, batch the requests together and coalesce many requests into one.
Essentially DataLoader will coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call your batch function with all requested keys.
It helps by reducing requests and offering something resembling backpressure by moving the request into one code path.
I would expect that you'd have the same sort of problem at scale with this system given the number of requests on many procs across many machines.
We had a lot of small tricks like this (they add up!), in some cases we'd insert a message queue inbetween the requestor and the service so that we could increase latency / reduce request rate while systems were degraded. Those "knobs" were generally implemented by "Decider" code which read keys from memcache to figure out what to do.
By "pushes to connected SDKs": I assume you're holding a thread with this connection; How do you reconcile this when you're running something like node with PM2 where you've got 30-60 processes on a single host? They won't be sharing memory, so that's a lot of updates.
It seems better to have these updates pushed to one local process that other processes can read from via socket or shared memory.
I'd also consider the many failure modes of services. Sometimes services go catatonic upon connect and don't respond, sometimes they time out, sometimes they throw exceptions, etc...
There's a lot to think about here but as I said what you've got is a great start.
But this is a feature, not a bug. You seems to be assuming that people use circuit-breaks only on external requests, in this situation your approach seems reasonable.
If you have cbs between every service call your model doesn't seem a good idea. Where I work every network call is behind a cb (external services, downstream services, database, redis, s3, ...) and it's pretty common to see failures isolated in a single k8s node. In this situation we want to have independent cbs, they can open independently.
Your take on observability/operation seems interesting but it is pretty close to feature flags. And that is exactly how we handle these scenarios, we have a couple of feature flags we can enable to switch traffic around during outages. Switching to fallback is easy most of the time, but switching back to normal operation is harder to do.
Openfuse is aimed at the other case: shared external dependencies where 15 services all call the same dependency and each one is independently discovering the same outage at different times. Different failure modes, different coordination needs, and you have no way to manually intervene or even just see what's open. Think of your house: every appliance has its own protection system, but that doesn't exempt you from having the distribution board.
You can also put it between your service/monolith and your own other services, e.g. if a recommendations engine, or a loyalty system in an E-Commerce or POS softwares go down, all hotpath flows from all other services will just bypass their calls to it. So with "external" I mean another service, whether it's yours or from a vendor.
On the feature flag point: that's interesting because you're essentially describing the pain of building circuit breaker behavior on top of feature flag infrastructure. The "switching back" problem you mention is exactly what half-open state solves: controlled probe requests that test recovery automatically and restore traffic gradually, without someone manually flipping a flag and hoping. That's the gap between "we can turn things off" and "the system recovers on its own." But yeah, we can all call Openfuse just feature flags for resilience, as I said: it's a fusebox for your microservices.
Curious how you handle the recovery side, is it a feature flag provider itself? or have you built something around it and store in your own database?
I don't really see what problem this solves. If you have proper timeouts and circuit breakers in your service this shouldn't really matter. This solution will save a few hundred requests, but I don't think this really matters. If this is a pain point its easier to adjust the circuit-breaker settings (reduce the error rate, increase the window, ...) than introduce a whole new level of complexity.
> Curious how you handle the recovery side
We have a feature flag provider built in-house. But it doesn't support this use-case, so what we done is to create flag where we put the % value we want to bring back and handle the logic inside the service. Example: if you want to bring back 6,25% (1/16) of our users this means we should switch back every user that has an account-id ending in 'a'. For 12.5% (2/16) we want users with account-id ending either in 'a' or 'b'. This is a pretty hacky solution, but it solves our problem when we need to transition from our fallback to our main flow.
Each service discovering by their own is not really the main problem to be solved with my proposal, the thing is that by doing it locally, we lack observability and there is no way to act on them.
> what we done is to create flag where we put the % value we want to bring back
Oh I see, well that is indeed a good problem to solve. Openfuse does not do that gradual recovery but it would be possible to add.
Do you think that by having that feature and having the Openfuse solution self-hosted, it would be something you would give a try? Not trying to sell you anything, just gathering feedback so I can learn from the discussion.
By the way, if you don't mind, how often do you have to run that type of recovery?
No, I don't think this is compelling enough to try it at work.
> By the way, if you don't mind, how often do you have to run that type of recovery?
I would say we use this feature once every 3 months.
Doesn't this suffer from the opposite problem though? There is a very brief hiccup for Stripe and instance 7 triggers the circuitbreaker. Then all other services stop trying to contact Stripe even though Stripe has recovered in the mean time. Or am I missing something about how your platform works?
So instance 7 seeing a brief hiccup doesn't trip anything, the breaker only opens when the collective signal crosses your threshold (e.g., 40% failure rate across all instances in a 30s window). A momentary blip from one instance doesn't affect the others.
And when it does trip, the half-open state sends controlled probe requests to test recovery, so if Stripe bounces back quickly, the breaker closes again automatically.
Making your circuit breaker state global seems like it would just exacerbate the problem. Failures are often partial in the real world.
It's not complex individually, but it takes time, and it's the ongoing maintenance that gets you. Openfuse is a bet that most teams would rather pay $99/mo than maintain that.
That said, a self-hosted option is on the near-term roadmap for teams that need it.
The reason why I only launched the cloud version of it is just so I could have a faster iteration pace in the back-end after having people actually using it reliably.
Now it is pretty solid and self hosting is the next thing to go out.
If you check the SDK code, it is ready for self hosting.
what happens when your service goes down
"The SDK is fail-open by design. If our service is unreachable, it falls back to the last known breaker state.
If no state has ever been cached (e.g., a cold start with no connectivity), it defaults to closed, meaning your protected calls keep executing normally. Your app is never blocked by Openfuse unavailability."