[0] https://docs.hetzner.com/robot/dedicated-server/general-info...
It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling
It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.
Still throttled. We replaced the windows with linux since it was atleast a bit more usable
At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.
One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.
very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.
I have alerts on PSU's and frequency for this reason.
The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...
100% agreed. There is nothing worse than a slow server in your fleet. This behavior reeks of "pet" thinking.
It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.
The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.
Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.
That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.
> In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job
For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.
This would mean that IPMI is most likely not available or disabled.
If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.
Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.
Also, if your business is in the US, find a US host ffs.
This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.
Windows is a well-known example; people used to wait for a service pack or two before upgrading.
I mean evergreen releases make sense imo, as the overhead of maintaining older versions for a long time is huge, but you need to have canary releases, monitoring, and gradual rollout plans; for something like Windows, this should be done with a lot of care. Even a 1% release rate will affect hundreds of thousands if not millions of systems.
Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)
This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.
On the other hand, dmidecode output in the article shows:
Manufacturer: Dell Inc. Product Name: 0H3K7P
Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?
I don't believe they simply lifted a power cap (if there was one in the first place). I genuinely think the fix came after the motherboard replacements. We had 2 batches of motherboard replacements and after that, the issue disappeared.
If someone from Hetzner is here, maybe they can give extra information.
[1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-8a27-...
> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.
But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).
In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.
What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?
Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?
To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.
If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.
Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.
https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...
If I rent a server I want to be able to run it to the maximum capacity, since I'm paying for all of it. It's dishonest to make me pay for X and give me < X. Idle CPU is wasted money.
The flip side is that the provider should be also offering more climate friendly, lower power options. I'll still want to run them to the max, but the total energy consumed would be less than before.
Also not forgetting that code efficiency matters if we want to get the max results for the minimum carbon spend. Another reason why giant web frameworks and bloated OSes depress me a little.
* Data centers in the Netherlands use approx. 2% of nationwide electricity production (4% in the US [2])
* Data center electricity usage is nearly constant, while access patterns aren’t
* Even heavily used servers spend 1/3 of power usage on idle cycles, 99% for the most lightly used servers
* Power-saving modes save approx. 10% of electricity without affecting application performance
* Many respondents do not use power-saving modes because of a lack of knowledge, because they fear the consequences, or because they have been instructed not to by their sysadmin/vendor
* Nonetheless, latency-sensitive applications (e.g. HPC or HFT) are not well-suited to power-saving modes
Given these results, it seems sensible to use power-saving modes by default, unless your workload is extremely latency-sensitive.
In any case, I disagree that potential 10% electricity savings across the worldwide data center industry, without affecting application performance, are ‘environmentalism at any cost’.
[1, Dutch] Harryvan, D. et al. (2020). Analyse LEAP Track 1 “Powermanagement.” Rijksdienst voor Ondernemend Nederland. URL: https://www.rvo.nl/sites/default/files/2021/01/Rapport%20LEA...
[1, English] Harryvan, D. et al. (2021). Analysis LEAP Track 1 “Powermanagement.” Netherlands Enterprise Agency. URL: https://www.rvo.nl/sites/default/files/2021/01/Rapport%20LEA...
[2] Shehabi, A. et al. (2024) United States Data Center Energy Usage Report (page 5). Berkeley Lab. URL: https://eta-publications.lbl.gov/sites/default/files/2024-12...
It can be pretty annoying, because it means that systems can perform better under higher load and that you get drastically different latency depending on whether a request is scheduled on a core that just processed another request (already at high freq) or one that was idle.
And because the frequency control isn't fun enough, this behavior also exists with cpu idle states. Even at high frequency Linux can enter idle states...
I've debugged several cases where this set of issues has caused unintuitive behavior. E.g.
a) switching to a more powerful servers drastically increased latency
b) optimized code resulting in higher latency / lower throughout because that provided enough idle cycles for a deeper idle time between requests
c) slightly increased IO latency leading to significantly worse overall performance, due to the IO getting long though to clock down
Actually, thinking this through, even then it doesn't make much sense to me: if you have that many short requests coming in, the CPU would simply never scale back if it's reasonably constant. It would first need to see some gap, and why not scale the CPU back in that gap (at the cost of having the 1st request of the next batch be a few milliseconds slower)? From there on, every subsequent request is fast again until there's another lull. Keeping the CPU always on high frequency should only be needed if you have a very tight deadline on that surprise request (high-frequency trading perhaps?), or if your requests are coincidentally always spaced by the same amount of time as CPU scaling measures across. I'm sure these things exist but "intermittent workload" is 90% of all workloads and most workloads definitely aren't meaningfully impacted by cpu scaling
Yea. Most of the cases I was looking at were with postgres, with fully cached simple queries. Each taking << 10ms. The problem is more extreme if the client takes some time to actually process the result or there is network latency, but even without it's rather noticeable.
> Turning it off would, to me, only make sense if you want a server to handle thousands of fast requests per second, but those requests don't come in for periods of, say, 50 ms at a time and so the CPU scales back.
I see regressions at periods well below 50ms, but yea, that's the shape of it.
E.g. a postgres client running 1000 QPS over a single TCP connection from a different server, connected via switched 10Gbit Ethernet (ping RTT 0.030ms), has the following client side visible per-query latencies:
powersave, idle enabled: 0.392 ms
performance, idle enabled: 0.295 ms
performance, idle disabled: 0.163 ms
If I make that same 1 client go full tilt, instead of limiting it to 1000 QPS: powersave, idle enabled: 0.141 ms
performance, idle enabled: 0.107 ms
performance, idle disabled: 0.090 ms
I'd call that a significant performance change.> if you have that many short requests coming in, the CPU would simply never scale back if it's reasonably constant.
Indeed, that's what makes the whole issue so pernicious. One of the ways I saw this was when folks moved postgres to more powerful servers and got worse performance due to frequency/idle handling. The reason being that it made it more likely that cores were idle long enough to clock down.
On the same setup as above, if I instead have 800 client connections going full tilt, there's no meaningful difference between powersave/performance and idle enabled/disabled.
Maybe for completeness, what CPU type is this on?
FWIW, my results corresponded to:
cpupower frequency-set --governor powersave && cpupower idle-set -E
cpupower frequency-set --governor performance && cpupower idle-set -E
cpupower frequency-set --governor performance && cpupower idle-set -D0
It's perhaps worth pointing out that -D0 sometimes hurts performance, by reducing the boost potential of individual cores, due to the higher baseline temp & power usage.> Maybe for completeness, what CPU type is this on?
This was a 2x Xeon Gold 5215. But I've reproduced this on newer Intel and AMD server CPUs too.
> (though I'm far from running into performance limitations on the old laptop that I use for hosting various projects, it could still be something to tune when I run some big task with lots of queries)
If you're run larger queries or queries at a higher frequency (i.e. client on the same host instead of via network, or the client uses pipelining), the problem doesn't typically manifest to a significant degree.
I did four tests on my "server" with an Intel i7 3630QM CPU. Pseudocode:
- Test 1, simple benchmark: `php -r '$starttime=microtime(1); while($starttime+1>microtime(1)){$loops++} print($loops);`, running in parallel for each real CPU core (not hyperthread)
- Test 2A, fast queries: time `for ($i=1..1000){ $db->query('SELECT ' . mt_rand()); }` (localhost, querying into a different container)
- Test 2B: intermittent fast queries: same as above, but on each loop it sleeps for mt_rand(1,10e3) microseconds to perhaps trick the CPU into clocking down
- Test 3, ApacheBenchmark command requesting a webpage that does a handful of database queries: `ab -n 500 https://lucb1e.com/`, executed from a VPS in a nearby country, taking the 95th percentile response time
Governor results:The governor makes no measurable difference for the benchmark and serial queries (tests 1 and 2A), but in test 3 there's a very clear difference: 86-88 ms for the 95th percentile versus 92-95 ms (ran each test 3 times to see if the result is stable). CPU frequency is not always at max when the performance governor is set (I had expected it would be at max all the time then). For test 2B, I see a 3% difference (powersave being slower) but I'm not sure that's not just random variation.
Idle states results:
Disabling idle states has mixed results, basically as you describe: it makes the CPU report maximum frequency all the time, which, instead of making it faster, seems to make it throttle for thermal reasons: the benchmark suffers and gets ~20 instead of ~27 million loops per core per second, while sensors shoot up from ~50 to ~80 °C. On the other hand, it has the same effect on web requests as setting the governor to performance (but I didn't change the governor), and on test 2B it has an even bigger impact: ~11% faster.
---
I'll have to ponder this. My first thought was that my HTTP-based two-way latency measurement utility should trigger the performance governor for more reliable results, but when I test it now here on WiFi (not that VPS with a stable connection as used in the test above), the results are indistinguishable before or after the governor change; the difference must be too small compared to the variability that WiFi adds (also on 5 GHz that can stably max out the throughput). My second thought is that this might give me another stab at exploiting a timing side channel in database index lookups, where the results were just too variable and I couldn't figure out how to make my work laptop set a fixed CPU frequency (the hardware or driver doesn't support it, iirc, as far as I could find). I was also not aware that there are power states besides "everything is running" (+/- frequency changes), "stand by / suspend", and "powered off". This 2012 laptop has 6 idle states already, all with different latency values, and my current laptop 9! Lots to learn here still, and I'm sure I'll think of more implications later
I've set things back to idle states enabled and governor powersave, since everything ran great on that for years, and I expect to keep using that virtually all the time. But now that I know this, I'll certainly set it to performance to see if it helps for certain workloads (timing side channels which already work fine may become more reliable if my CPU runs more predictably). Thanks for making me aware :)
We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.
Another surprising name is Huawei. Their servers just don't die.
When the BIOS or the iDRAC is shot, there's no way they gonna get their support ZIP file. If they want they can connect that dead I/O board to a spare part and try. :)
Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.
The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.
That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.
[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...
Volts is as supplied by the utility company.
Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!
The only way you can decrease power used by a server is by throttling the CPUs.
The normal way of throttling CPUs is via the OS which requires cooperation.
I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.
I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.
Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.
[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...
No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.
https://electronics.stackexchange.com/a/65827
> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.
Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...
https://docs.hetzner.com/robot/dedicated-server/general-info...
Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.
I think the website said they recently raised 16 million euros (or dollars).
Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.
By using rented servers (and only renting them when a customer signs up) they avoid this problem.
Building and owning an institution that finances, racks, services, networks, and disposes of servers, both takes time and increases the commitment level. Hetzner is month to month, with a fixed overhead for fresh leasing of servers: the set-up fee.
This is a lot to administer when also building a software institution, and a business. It was not certain at the outset, for example, that the GitHub Actions Runner product would be as popular as it became. In its earliest form, it was partially an engineering test for our virtual machines, and we went around asking friendly contacts that we knew would report abnormalities to use it. There's another universe where it only went as far as an engineering test, and our utilization and revenue pattern (that is, utility to other people) is different.
Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.
> In the days that followed, the crash frequency increased.
I don't find the article conclusive whether they would still call them reliable.
https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...
Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.
Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.
When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...
On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.
There are also others, but Hetzner is under discussion here.
You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.
This was something I hadn't heard before, & a surprise to me.
I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.
YOU do the monitoring.
YOU do the troubleshooting.
YOU etc., etc.
If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.
Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.