I'm doing some field research on unique ways on-call rotations can be unhealthy. It would be great to hear some anecdata from the community about /why/ you feel your on-call rotation sucks - and I figure it would be even better to experience it firsthand :)
I do understand asking to join/shadow your rotation is probably not practical however I am 100% serious and happy to sign whatever.
Cheers & may your pager stay silent
No matter if you get paged or not, you need to be available, and that sucks.
p.s. Knew I recognised the name - loved following development of Planimeter/Grid a while ago!
I stopped hitting snooze as I got older. I either wake up or give up and sleep in..
Inadequate staff is the only reason on-call exists. Sure, people might be mostly sitting around all night being paid and not being terribly busy.
But if a company needs someone at night, they need someone at night. Companies getting away with not paying for that is why oncall sucks.
In other words oncall sucks because companies don’t pay for solving the problems that require it. There’s no self correcting feedback.
A tool can’t fix that and oncall is not inevitable. Good luck.
Though I'd agree it's a staffing issue. 5 people in a cycle is fine. If you had a concert or something that week, just swap places with a colleague. When we reduced it to 2 people, it was not cool to spend half your time on-call.
There's also policies like don't release on Fridays, don't release on a vacation week. If there's a tool for it, it would be flagging these behaviors. Unfortunately, we can't really control when partners go down.
When I started people were paid for any hours they worked on-call. By the end, the company changed the policy so on-call was part of base pay. For those who were on-call during the change over, their last year of on-call pay was averaged and added to their salary. For everyone who came after that, they got screwed (that includes me).
Once I changed to the day shift I got called a few times for on-call. Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority. If I ever get called two times for the same issue, that’s my fault. So far, it’s never happened.
I've yet to hear of any alternative compensation model that actually works. Just pay people in their choice of money or time off in lieu. Sorry to hear you got screwed.
> Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority.
100% agree, I think people are far too tolerant of being paged. Especially management - the productivity impact of constant interrupts is huge. In a previous job one of my favourite things to do was go out to teams and just disable alerts they said were noisy or unactionable. If there was any pushback/consequence I was happy to accept responsibility (but never had to).
But as long as the expected cost of downtime outweighs the financial cost of keeping someone available to fix it, on-call in some form will be inevitable. (There are a lot of instances where the cost doesn't make sense, and we should just accept the system being broken until 9am)
I don't think on-call needs to suck though. IMO "staffing issues" (whether it's headcount, time, competing priorities, etc) are resourcing issues and I believe better tooling can absolutely help with that - either by reducing the resources required to fix it or by making the cost of the issues quantifiable. Thanks for the good luck :)
We began with free food delivery over the weekend, and the expectation that you'd take a day off the next week ("unlimited" PTO policy). Eventually they stopped letting us do that and now the "unlimited" in our PTO policy has an invisible limit, so you can't actually do that without it counting towards the invisible limit on your unlimited PTO for the year.
Our monitoring and alerting is unusably noisy. Deviance is fully normalized. All our postmortems typically have a section stating that alerts were issued, but ignored until customers began complaining. Attempts to cut the noise down to a sane level have all been defeated by the ever present pressure to feature factory. TBF this is mostly an engineering self-own and I feel partially responsible for this outcome.
The on-call engineer does a shocking amount of manual labor to paper over bugs in the product and un-stick users who fall through the (many) cracks. It is effectively a T3 tech support rotation. We've taken steps to tone it down to mere triage and channel this into pressure against offending teams' timelines, but there's a huge amount of silent cultural resistance and no one is being held accountable when a feature increases support load. I suspect this issue alone would make most bigtech engineers quit.
For the (many) issues that require manual intervention, the on-call engineer cannot actually do anything unless 2 other engineers sign off on a PR (either to run a SQL query or to deploy some tool or bugfix to resolve the problem).
This is more specific to the product I work on, but the sheer amount of 3rd party services we rely on means that something is constantly acting up and there's not a lot we can do about it. Our API client code for each service we use typically contains _at least_ one service-specific hacky workaround to keep things running in the face of bad behavior.
The frontend team has no on-call rotation despite causing plenty of bugs on their own. Backend engineers are expected to triage what are clearly frontend problems. We stood up a lot of observability tooling for the frontend but it took years for them to even start to use it.
More than anything, it feels like the moment I stop championing the issue, everyone stops paying attention and the on-call experience reverts to the mean. Other on-call engineers just sort of stop boyscouting and let the chaos wash over them while focusing on sprint obligations (can't blame them), and leadership takes their eye off the ball to chase growth (also can't blame them). Hugely fucked lack of accountability and the buck eventually stops at whoever is the poor guy holding the pager that week.
At least I certainly wouldn't be happy to learn that my product was bursting at the seams and nobody was being held accountable. But I'm not an executive leader. (Maybe that's why?)
And there's certainly a calculus to it that changes when you're an executive. To me, craftsmanship, diligence, and engineering excellence are important, not just because I love programming but also because I'm an IC and it affects me directly. To an executive, I am just some weird nerd they have to pay a lot of money to make computers do things. Beautiful code and a serene on-call experience are nice but they don't usually get a company acquired.