Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode

345
108
ngalstyan4
8 months ago
ubicloud.com

nik736
·
8 months ago
·
[ - ]

Most other AX models (AX42, AX52 and AX102) also have serious reliability issues, where they will fail after some months. They are based on a faulty motherboard. Hetzner has to replace most, if not all, motherboards for servers built before a certain date over the next 12 months [0]

[0] https://docs.hetzner.com/robot/dedicated-server/general-info...

babuskov
·
8 months ago
·
[ - ]

I have two AX42's. One has been stable since I got it during the Eurocup discount period. The other got replaced 2 times so far, but it looks like the latest replacement is holding up. So, it's like 50% failure rate based on my small sample. I guess only Hetzner and ASRock know the real numbers.

jonatron
·
8 months ago
·
[ - ]

At a previous company, devops would regularly find CPU fan failures on Hetzner. That's in addition to the usual expected HD/SSD failures. You've got to do your own monitoring, it's one of the reasons why unmanaged servers are cheaper than cloud instances.

jeffbee
·
8 months ago
·
[ - ]

I regularly find broken thermal solutions in azure and when I worked at Google it was also a low-level but constant irritant. When I joined Dropbox I said to my team on my first day that I could find a machine in their fleet running at 400MHz, and I was right: a bogus redundant PSU controller was asserting PROCHOT. These things happen whenever you have a lot of machines.

radicality
·
8 months ago
·
[ - ]

The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.

It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling

porridgeraisin
·
8 months ago
·
[ - ]

And it brought me back to memories of debugging that on my friends laptop.

It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.

Still throttled. We replaced the windows with linux since it was atleast a bit more usable

At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.

One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.

iforgotpassword
·
8 months ago
·
[ - ]

I once had a dell laptop that after about three years started complaining on power up that I wouldn't be using a genuine dell PSU and should switch to one. I ignored it at first because you could just hit enter and carry on, but after a while I noticed that every time this happened, the cpu would clock at a fixed 800mhz. I ordered a new power brick but the message didn't go away, so I returned the brick and decided to never buy dell again.

bityard
·
8 months ago
·
[ - ]

A laptop that I had would assert PROCHOT if it didn't like the power supply you plugged into it. It actually took an embarrassing amount of time for me to notice that this is what was causing Slack to be inexplicably slower at my desk than when I was out working in a common area in the building.

tryauuum
·
8 months ago
·
[ - ]

in my (limited) experience this only happened with GIGABYTE servers

very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.

dijit
·
8 months ago
·
[ - ]

I've seen it on nearly every brand, I have some Lenovo Servers in the basement that also down-clock if both PSU's aren't installed.

I have alerts on PSU's and frequency for this reason.

The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...

jeffbee
·
8 months ago
·
[ - ]

> I'd prefer my servers to crash instead of lowering frequency to 400MHz.

100% agreed. There is nothing worse than a slow server in your fleet. This behavior reeks of "pet" thinking.

formerly_proven
·
8 months ago
·
[ - ]

Stuff like this just comes up from time to time as soon as you run a four digit and up number of systems.

KennyBlanken
·
8 months ago
·
[ - ]

No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

It's still the hosting company's responsibility to competently own, maintain, and repair the physical hardware. That includes monitoring. In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job.

The only time a hosting company should be hands-off is when they're just providing rack space, power, and data. Anything beyond that is between you and them in a contract/agreement.

Every time I hear Hetzner come up in the last few years it's been a story about them being incompetent. If they're not detecting things like CPU fan failures of their own hardware and they deployed new systems without properly testing them first, then that's just further evidence they're still slipping.

marcus0x62
·
8 months ago
·
[ - ]

> No? Maybe you cloud kids don't know how this stuff works, but unmanaged just means you get silicon-level access and remote KVM.

That's one way it can work. There are a great many hosted server options out there from fully managed to fully unmanaged with price points to match. Selling a cheap server under the conditions "call us when it breaks" is a perfectly reasonable offering.

cuu508
·
8 months ago
·
[ - ]

Alright, let's say the hosting company has an out-of-band mechanism for detecting reboots. How do they know if the reboots are abnormal (like in this case) or normal, customer-ordered reboots after software upgrades?

cryptonym
·
8 months ago
·
[ - ]

Probably covered here:

> In the old days you had to run a script or install a package to hook into their monitoring....but with IPMI et al being standard they don't need anything from you to do their job

cuu508
·
8 months ago
·
[ - ]

How can IPMI detect the cause (kernel panic vs user command) for restart?

babuskov
·
8 months ago
·
[ - ]

Do Hetzner servers even run IPMI?

For dedicated servers, you have to schedule KVM access in advance, so I assume they need to move some hardware and plug into to your server.

This would mean that IPMI is most likely not available or disabled.

p_l
·
8 months ago
·
[ - ]

Not anymore, but you can abuse pstore to know about last messages from before reboot

·
8 months ago
·
[ - ]

·
8 months ago
·
[ - ]

·
8 months ago
·
[ - ]

throwaway984393
·
8 months ago
·
[ - ]

[dead]

TZubiri
·
8 months ago
·
[ - ]

I'm heavily against both relying on free dependencies and going for the cheapest option.

If you can't put yourself in the shoes for a second when evaluating a purchase and you just braindead try to make cost go lower and income go higher, your ngmi except in shady sales businesses.

Server hardware is incredibly cheap, if you are somewhat of a competent programmer you can handle most programs in a single server or even a virtual machine. Just give them a little bit of margin and pay 50$/mo instead of 25$/mo, it's not even enough to guarantee they won't go broke or make you a valuable customer, you'll still be banking on whales to make the whole thing profitable.

Also, if your business is in the US, find a US host ffs.

V__
·
8 months ago
·
[ - ]

> Looking back, waiting six months could have helped us avoid many issues. Early adopters usually find problems that get fixed later.

This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.

esafak
·
8 months ago
·
[ - ]

GitHub is looking to add this feature to dependabot: https://github.com/dependabot/dependabot-core/issues/3651

TZubiri
·
8 months ago
·
[ - ]

Being so deep into dependencies that you have to find more dependencies and features to make your dependency less of a clusterfuck is sad.

esafak
·
8 months ago
·
[ - ]

Are you referring to dependabot? You are free to update your dependencies manually.

h1fra
·
8 months ago
·
[ - ]

In theory, that works in practice nope. You get a random update with a possible bug inside that is only fixed by a new version that you won't get until later. The other strategy is to wait for a package to be fully stable (no update), and in that case, some packages that receive daily/weekly updates are never updated

esafak
·
8 months ago
·
[ - ]

It does help, because major version updates are more likely to cause breakage than minor ones, so you benefit if you wait for a few minor version updates. That is not to say minor versions can't introduce bugs.

Windows is a well-known example; people used to wait for a service pack or two before upgrading.

ajmurmann
·
8 months ago
·
[ - ]

We could even wait for a patch version or the minor being out a certain amount of time. For a major I'd wait even longer and potentially for a second patch.

Cthulhu_
·
8 months ago
·
[ - ]

And then they went towards a more evergreen update strategy, causing some major outages when some releases caused issues.

I mean evergreen releases make sense imo, as the overhead of maintaining older versions for a long time is huge, but you need to have canary releases, monitoring, and gradual rollout plans; for something like Windows, this should be done with a lot of care. Even a 1% release rate will affect hundreds of thousands if not millions of systems.

·
8 months ago
·
[ - ]

InDubioProRubio
·
8 months ago
·
[ - ]

This is a wildly successfully pattern in nature, the old using the young and inexperienced, as enthusiastic test units.

In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.

Tzela
·
8 months ago
·
[ - ]

Just for curiosity: do you have a source?

pwmtr
·
8 months ago
·
[ - ]

Author of the blog post here.

Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)

This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.

ThePowerOfFuet
·
8 months ago
·
[ - ]

>The silver lining is that our suffering helped uncover the underlying issue faster.

Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?

pwmtr
·
8 months ago
·
[ - ]

The root cause was a problem with the motherboard, though the exact issue remains unknown to us. I suspect that a component on the motherboard may have been vulnerable to power limitations or fluctuations and that the newer-generation motherboards included additional protection against this. However, this is purely my speculation.

I don't believe they simply lifted a power cap (if there was one in the first place). I genuinely think the fix came after the motherboard replacements. We had 2 batches of motherboard replacements and after that, the issue disappeared.

If someone from Hetzner is here, maybe they can give extra information.

oz3d
·
8 months ago
·
[ - ]

hetzner is currently replacing motherboards of their dedicated servers [1] But I dont know if thats the same issue that was mentioned in the article.

[1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-8a27-...

ubanholzer
·
8 months ago
·
[ - ]

Thats the same issue, yes.

axus
·
8 months ago
·
[ - ]

Customers are the best QA. And they pay you too, instead of the reverse!

rat9988
·
8 months ago
·
[ - ]

I'm pretty sure they pay for QA. QA cannot always catch every possible bug.

knowitnone
·
8 months ago
·
[ - ]

these crashes should have been caught easily

crishoj
·
8 months ago
·
[ - ]

Were you able to identify the manufacturer and model/revision of the failing motherboards? This would be extremely helpful when shopping for seconds hand servers.

babuskov
·
8 months ago
·
[ - ]

I cannot find the link now, but it was mentioned that it was ASRock mobos.

crishoj
·
8 months ago
·
[ - ]

Thanks. This comment above does mention ASRock: https://news.ycombinator.com/item?id=43112594

On the other hand, dmidecode output in the article shows:

Manufacturer: Dell Inc. Product Name: 0H3K7P

fdr
·
8 months ago
·
[ - ]

It varies by system. As the legendary (to some) Kelly Johnson of the Skunk Works had as one of his main rules:

> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.

But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).

bayindirh
·
8 months ago
·
[ - ]

Dell has this problem sometimes. I remember getting the first batch one of their older servers when they were new. We had to replace motherboards' I/O (rear) section because the servers lost some devices on that part (e.g.: Ethernet controllers, iDRAC, sometimes BIOS) for some time. After shaking out these problems, they ran for almost a decade.

We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.

merb
·
8 months ago
·
[ - ]

Dell has tons of issues. A faulty mini board of the front led can actually stop the server from booting/running at all (even drac will be dead)

bayindirh
·
8 months ago
·
[ - ]

Interesting. From my experience, Dell is generally one of the least problematic brands when compared in large numbers.

Another surprising name is Huawei. Their servers just don't die.

merb
·
8 months ago
·
[ - ]

well tbf their server pro support is actually good, but we still had a lot of minor issues, we barely had a dead one in Production. Most problems arises after unpacking. Like the one I told. Of course we had dead transrecievers and dead hard drives, but the pro support guys only want the support zip that you can create from the drac and they ask for some details and than they order replacements parts for you.

bayindirh
·
8 months ago
·
[ - ]

Ah, I understand what you go through now. In our case, they came with the parts that we said we gonna need, see that whatever device is dead with their eyes, and just replaced the problematic part.

When the BIOS or the iDRAC is shot, there's no way they gonna get their support ZIP file. If they want they can connect that dead I/O board to a spare part and try. :)

andai
·
8 months ago
·
[ - ]

> Hetzner didn’t confirm or deny the possibility of power limiting

What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?

Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?

radicality
·
8 months ago
·
[ - ]

Related and perhaps useful: I’ve seen this in multiple cloud offerings already, where the cpu scaling governor is set to some eco-friendly value, in benefit to the cloud provider and in zero benefit to you and much reduced peak cpu perf.

To check, run `cat /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. It should be `performance`.

If it’s not, set it with `echo performance | sudo tee /sys/devices/system/cpu/cpu/cpufreq/scaling_governor`. If your workload is cpu hungry this will help. It will revert on startup, so can make it stick, with some cron/systemd or whichever.

Of course if you are the one paying for power or it’s your own hardware, make your own judgement for the scaling governor. But if it’s a rented bare metal server, you do want `performance`.

Tijdreiziger
·
8 months ago
·
[ - ]

However, eco-friendly power modes can reduce electricity usage, so they can be friendlier for our climate.

https://www.rvo.nl/onderwerpen/energie-besparen-de-industrie...

kjellsbells
·
8 months ago
·
[ - ]

Yes, but the point is that the customer has the agency to decide.

If I rent a server I want to be able to run it to the maximum capacity, since I'm paying for all of it. It's dishonest to make me pay for X and give me < X. Idle CPU is wasted money.

The flip side is that the provider should be also offering more climate friendly, lower power options. I'll still want to run them to the max, but the total energy consumed would be less than before.

Also not forgetting that code efficiency matters if we want to get the max results for the minimum carbon spend. Another reason why giant web frameworks and bloated OSes depress me a little.

rat9988
·
8 months ago
·
[ - ]

I'm not sure why you are downvoted. Is it wrong?

wmf
·
8 months ago
·
[ - ]

Probably blowback from "environmentalism at any cost" thinking.

Tijdreiziger
·
8 months ago
·
[ - ]

I’m not downvoted at the moment, but in any case, it’s not really ‘at any cost’. Salient points from the linked report [1]:

* Data centers in the Netherlands use approx. 2% of nationwide electricity production (4% in the US [2])

* Data center electricity usage is nearly constant, while access patterns aren’t

* Even heavily used servers spend 1/3 of power usage on idle cycles, 99% for the most lightly used servers

* Power-saving modes save approx. 10% of electricity without affecting application performance

* Many respondents do not use power-saving modes because of a lack of knowledge, because they fear the consequences, or because they have been instructed not to by their sysadmin/vendor

* Nonetheless, latency-sensitive applications (e.g. HPC or HFT) are not well-suited to power-saving modes

Given these results, it seems sensible to use power-saving modes by default, unless your workload is extremely latency-sensitive.

In any case, I disagree that potential 10% electricity savings across the worldwide data center industry, without affecting application performance, are ‘environmentalism at any cost’.

[1, Dutch] Harryvan, D. et al. (2020). Analyse LEAP Track 1 “Powermanagement.” Rijksdienst voor Ondernemend Nederland. URL: https://www.rvo.nl/sites/default/files/2021/01/Rapport%20LEA...

[1, English] Harryvan, D. et al. (2021). Analysis LEAP Track 1 “Powermanagement.” Netherlands Enterprise Agency. URL: https://www.rvo.nl/sites/default/files/2021/01/Rapport%20LEA...

[2] Shehabi, A. et al. (2024) United States Data Center Energy Usage Report (page 5). Berkeley Lab. URL: https://eta-publications.lbl.gov/sites/default/files/2024-12...

PinkSheep
·
8 months ago
·
[ - ]

You can tune the ondemand (or any other) governor first to ramp up faster and clock down slower. "performance" should be seen as the nuclear option.

andai
·
8 months ago
·
[ - ]

How? I've been wondering the same thing for my laptop. My laptop doesn't even seem to support the ondemand option. It just switches from performance to powersave when I unplug the charger, and becomes painfully slow.

chpatrick
·
8 months ago
·
[ - ]

Is there any downside to ondemand? If your servers aren't running at 100% then there's no point wasting watts, even if you aren't paying for them, right?

anarazel
·
8 months ago
·
[ - ]

It performs terrible if you have an intermittent workload. Like e.g. a request response workload where request processing is cheap (so that the time to increase the frequency matters). I've seen cases it's a more than 2x request latency increase.

It can be pretty annoying, because it means that systems can perform better under higher load and that you get drastically different latency depending on whether a request is scheduled on a core that just processed another request (already at high freq) or one that was idle.

And because the frequency control isn't fun enough, this behavior also exists with cpu idle states. Even at high frequency Linux can enter idle states...

I've debugged several cases where this set of issues has caused unintuitive behavior. E.g.

a) switching to a more powerful servers drastically increased latency

b) optimized code resulting in higher latency / lower throughout because that provided enough idle cycles for a deeper idle time between requests

c) slightly increased IO latency leading to significantly worse overall performance, due to the IO getting long though to clock down

lucb1e
·
8 months ago
·
[ - ]

Intermittent workloads on the order of 2 milliseconds, you mean? Frequency scaling is much faster than blinking, as well as ubiquitous in all recent consumer hardware that I'm aware of due to the huge amount of power it saves. Turning it off would, to me, only make sense if you want a server to handle thousands of fast requests per second, but those requests don't come in for periods of, say, 50 ms at a time and so the CPU scales back.

Actually, thinking this through, even then it doesn't make much sense to me: if you have that many short requests coming in, the CPU would simply never scale back if it's reasonably constant. It would first need to see some gap, and why not scale the CPU back in that gap (at the cost of having the 1st request of the next batch be a few milliseconds slower)? From there on, every subsequent request is fast again until there's another lull. Keeping the CPU always on high frequency should only be needed if you have a very tight deadline on that surprise request (high-frequency trading perhaps?), or if your requests are coincidentally always spaced by the same amount of time as CPU scaling measures across. I'm sure these things exist but "intermittent workload" is 90% of all workloads and most workloads definitely aren't meaningfully impacted by cpu scaling

anarazel
·
8 months ago
·
[ - ]

> Intermittent workloads on the order of 2 milliseconds, you mean?

Yea. Most of the cases I was looking at were with postgres, with fully cached simple queries. Each taking << 10ms. The problem is more extreme if the client takes some time to actually process the result or there is network latency, but even without it's rather noticeable.

> Turning it off would, to me, only make sense if you want a server to handle thousands of fast requests per second, but those requests don't come in for periods of, say, 50 ms at a time and so the CPU scales back.

I see regressions at periods well below 50ms, but yea, that's the shape of it.

E.g. a postgres client running 1000 QPS over a single TCP connection from a different server, connected via switched 10Gbit Ethernet (ping RTT 0.030ms), has the following client side visible per-query latencies:

  powersave, idle enabled: 0.392 ms
  performance, idle enabled: 0.295 ms
  performance, idle disabled: 0.163 ms

If I make that same 1 client go full tilt, instead of limiting it to 1000 QPS:

  powersave, idle enabled: 0.141 ms
  performance, idle enabled: 0.107 ms
  performance, idle disabled: 0.090 ms

I'd call that a significant performance change.

> if you have that many short requests coming in, the CPU would simply never scale back if it's reasonably constant.

Indeed, that's what makes the whole issue so pernicious. One of the ways I saw this was when folks moved postgres to more powerful servers and got worse performance due to frequency/idle handling. The reason being that it made it more likely that cores were idle long enough to clock down.

On the same setup as above, if I instead have 800 client connections going full tilt, there's no meaningful difference between powersave/performance and idle enabled/disabled.

lucb1e
·
8 months ago
·
[ - ]

Huh, that is interesting, thanks for clarifying and going so far as sharing benchmark examples! I will try this on my own hardware as well and edit in the results (though I'm far from running into performance limitations on the old laptop that I use for hosting various projects, it could still be something to tune when I run some big task with lots of queries)

Maybe for completeness, what CPU type is this on?

anarazel
·
8 months ago
·
[ - ]

> I will try this on my own hardware as well

FWIW, my results corresponded to:

  cpupower frequency-set --governor powersave && cpupower idle-set -E

  cpupower frequency-set --governor performance && cpupower idle-set -E

  cpupower frequency-set --governor performance && cpupower idle-set -D0

It's perhaps worth pointing out that -D0 sometimes hurts performance, by reducing the boost potential of individual cores, due to the higher baseline temp & power usage.

> Maybe for completeness, what CPU type is this on?

This was a 2x Xeon Gold 5215. But I've reproduced this on newer Intel and AMD server CPUs too.

> (though I'm far from running into performance limitations on the old laptop that I use for hosting various projects, it could still be something to tune when I run some big task with lots of queries)

If you're run larger queries or queries at a higher frequency (i.e. client on the same host instead of via network, or the client uses pipelining), the problem doesn't typically manifest to a significant degree.

lucb1e
·
8 months ago
·
[ - ]

Thanks, also for providing the ready-to-use commands!

I did four tests on my "server" with an Intel i7 3630QM CPU. Pseudocode:

    - Test 1, simple benchmark: `php -r '$starttime=microtime(1); while($starttime+1>microtime(1)){$loops++} print($loops);`, running in parallel for each real CPU core (not hyperthread)
    - Test 2A, fast queries: time `for ($i=1..1000){ $db->query('SELECT ' . mt_rand()); }` (localhost, querying into a different container)
    - Test 2B: intermittent fast queries: same as above, but on each loop it sleeps for mt_rand(1,10e3) microseconds to perhaps trick the CPU into clocking down
    - Test 3, ApacheBenchmark command requesting a webpage that does a handful of database queries: `ab -n 500 https://lucb1e.com/`, executed from a VPS in a nearby country, taking the 95th percentile response time

Governor results:

The governor makes no measurable difference for the benchmark and serial queries (tests 1 and 2A), but in test 3 there's a very clear difference: 86-88 ms for the 95th percentile versus 92-95 ms (ran each test 3 times to see if the result is stable). CPU frequency is not always at max when the performance governor is set (I had expected it would be at max all the time then). For test 2B, I see a 3% difference (powersave being slower) but I'm not sure that's not just random variation.

Idle states results:

Disabling idle states has mixed results, basically as you describe: it makes the CPU report maximum frequency all the time, which, instead of making it faster, seems to make it throttle for thermal reasons: the benchmark suffers and gets ~20 instead of ~27 million loops per core per second, while sensors shoot up from ~50 to ~80 °C. On the other hand, it has the same effect on web requests as setting the governor to performance (but I didn't change the governor), and on test 2B it has an even bigger impact: ~11% faster.

---

I'll have to ponder this. My first thought was that my HTTP-based two-way latency measurement utility should trigger the performance governor for more reliable results, but when I test it now here on WiFi (not that VPS with a stable connection as used in the test above), the results are indistinguishable before or after the governor change; the difference must be too small compared to the variability that WiFi adds (also on 5 GHz that can stably max out the throughput). My second thought is that this might give me another stab at exploiting a timing side channel in database index lookups, where the results were just too variable and I couldn't figure out how to make my work laptop set a fixed CPU frequency (the hardware or driver doesn't support it, iirc, as far as I could find). I was also not aware that there are power states besides "everything is running" (+/- frequency changes), "stand by / suspend", and "powered off". This 2012 laptop has 6 idle states already, all with different latency values, and my current laptop 9! Lots to learn here still, and I'm sure I'll think of more implications later

I've set things back to idle states enabled and governor powersave, since everything ran great on that for years, and I expect to keep using that virtually all the time. But now that I know this, I'll certainly set it to performance to see if it helps for certain workloads (timing side channels which already work fine may become more reliable if my CPU runs more predictably). Thanks for making me aware :)

vitus
·
8 months ago
·
[ - ]

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.

The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.

pwmtr
·
8 months ago
·
[ - ]

At the time of our investigation, we found few articles supporting that power caps could potentially cause hardware degradation, though I don't have the exact sources at hand. I see the child comment shared one example, and after some searching, I found a few more sources [1], [2].

That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.

[1] https://electronics.stackexchange.com/questions/65837/can-el... [2] https://superuser.com/questions/1202062/what-happens-when-ha...

immibis
·
8 months ago
·
[ - ]

The power used by a computer isn't limited by giving it less voltage/current than it should have - if it was, the CPU would crash almost immediately. It's done by reducing the CPU's clock rate until the power it naturally consumes is less than the power limit.

nickcw
·
8 months ago
·
[ - ]

Power = volts * amps

Volts is as supplied by the utility company.

Amps are monitored per rack and the usual data centre response to going over an amp limit is that a fuse blows or the data centre asks you for more money!

The only way you can decrease power used by a server is by throttling the CPUs.

The normal way of throttling CPUs is via the OS which requires cooperation.

I speculate this is possible via the lights out base band controller (which doesn't need the os to be involved), but I'm pretty sure you'd see that in /sys if it was.

tecleandor
·
8 months ago
·
[ - ]

Yep, that's weird, I've always read that high power/temp can degrade electronics way faster. Any EE can shed a light here?

avian
·
8 months ago
·
[ - ]

As an electronics engineer I have no idea what the author is talking about here and was about to post the same question.

redleader55
·
8 months ago
·
[ - ]

Every rack in a data center has a power budget, which is actually constrained by how much heat the HVAC system can pull out of the DC, rather than how much power is available. Nevertheless it is limited per rack to ensure a few high power servers don't bring down a larger portion of the DC.

I don't know for sure how the limiting is done, but a simple circuit breaker like the ones we have in our houses would be a simple solution for it. That causes the rack to loose power when the circuit breaks, which is not ideal because you loose the whole rack and affect multiple customers.

Another option would be a current/power limiter[0], which would cause more problems because P = U * I. That would make the voltage (U) drop and then the whole system to be undervolted - weird glitches happen here and it's a common way to bypass various security measures in chips. For example, Raspberry Pi ran this challenge [1] to look for this kind of bugs and test how well their chips can handle attacks, including voltage attacks.

[0] - https://en.m.wikipedia.org/wiki/Current_limiting [1] - https://www.raspberrypi.com/news/security-through-transparen...

immibis
·
8 months ago
·
[ - ]

Computers implement power limits by reducing their own speed until their power consumption falls under the limit. There's no risk of damage and it should actually extend the lifetime due to less heat, as well as increasing the efficiency (computation per watt).

No idea what the article is talking about with the damage. Computers like to run slow when possible. There's basically no downside except they take longer to do things.

cibyr
·
8 months ago
·
[ - ]

One possibility is that at lower power settings, the CPUs don't get as hot, which means the fans don't spin up as much, which can mean that other components also get less airflow and then get hotter than they would otherwise. The fix for this is usually to monitor the temperature of those other components and include that as an input to the fan speed algorithm. No idea if that's what's actually going on here though.

wmf
·
8 months ago
·
[ - ]

Expert in server power management here. Your intuition is right and the comments/links to the contrary are wrong. Undervolting is unreliable but let's be clear: no one is undervolting servers. I don't even know if it's possible. Power limiting (e.g. RAPL) is completely safe to use because it keeps voltage, frequency, temperature, fan speed, etc within safe bounds.

OptionOfT
·
8 months ago
·
[ - ]

The only place I could find some answer that sheds some light was StackOverflow:

https://electronics.stackexchange.com/a/65827

> A mosfet needs a certain voltage at its gate to turn fully on. 8V is a typical value. A simple driver circuit could get this voltage directly from the power that also feeds the motor. When this voltage is too low to turn the mosfet fully on a dangerous situation (from the point of view of the moseft) can arise: when it is half-on, both the current through it and the voltage across it can be substantial, resulting in a dissipation that can kill it. Death by undervoltage.

chronid
·
8 months ago
·
[ - ]

We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...

rikafurude21
·
8 months ago
·
[ - ]

Similar thing happened to a AX102 I currently use, something related the network card which caused crashes. Thankfully hetzner support was helpful with replacement hardware. caused quite some grief but at least it was a good lesson in hardware troubleshooting. Worth it to me personally

yread
·
8 months ago
·
[ - ]

Yep same here. AX102 crashes with almost no load, nothing in the logs, won't come on. Hetzner looked at it multiple times and found either nothing or replaced cpu paste or a PSU connector. I migrated to AX162 and so far so good

jaigupta
·
8 months ago
·
[ - ]

Same here. Hetzner found no issues with hardware in diagnostics, they insisted it is related to OS/Software side but on my request they changed hardware which fixed issue.

rikafurude21
·
8 months ago
·
[ - ]

I Was given the choice between diagnostics and hardware replacement, and decided to do the diagnostics first, which turned out nothing. Decided to reinstall the os and when that didnt fix I was sure it had to be something related to hardware which diagnostics didnt catch. If you have a server mentioned in here and problems turn up, just get the hardware replacement immediately

https://docs.hetzner.com/robot/dedicated-server/general-info...

urbandw311er
·
8 months ago
·
[ - ]

Would anybody with data center experience be able to hazard a guess on what type of commercial resolution Hetzner would have reached with the Motherboard supplier here? Would we assume all mobos replaced free of charge plus compensation?

wmf
·
8 months ago
·
[ - ]

When you buy name-brand servers you'll definitely get any faulty hardware replaced. Compensation would only happen if you negotiated for that and you'd have to pay extra. You're probably better off buying some kind of business interruption insurance instead of trying to get vendors to pay you for downtime (even if it is their fault).

Hetzner is not a normal customer though. As part of their extreme cost optimization they probably buy the cheapest components available and they might even negotiate lower prices in exchange for no warranty. In that case they would have to buy replacement motherboards.

babuskov
·
8 months ago
·
[ - ]

I think they probably got a batch of these really cheap in the first place, because those servers were offered without the setup fee initially. It was during the soccer World Cup in Germany.

jauntywundrkind
·
8 months ago
·
[ - ]

> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

This was something I hadn't heard before, & a surprise to me.

scottcha
·
8 months ago
·
[ - ]

I’d like to see what cpu governor is running on those systems before assuming a power cap is in place. Lots of defaults installs of Linux ship with the power save governor running which is going to limit your max frequencies and through that the max power you can hit.

__m
·
8 months ago
·
[ - ]

schedutil on mine scheduled for mainboard replacement

trod1234
·
8 months ago
·
[ - ]

It would have been nice if they linked to the power metrics for the new servers.

I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.

vednig
·
8 months ago
·
[ - ]

as a CI/CD provider wouldn't it benefit if Ubicloud had their own servers?

immibis
·
8 months ago
·
[ - ]

Depends how many they need and how much control. Do they want to be a server company or an adapting-servers-to-run-your-CI/CD company or both? You can extract value from both parts of the equation, but theoretical economics tells us you can get the most value for the least effort by doing more of what you're best at and paying someone else to do what they're best at, rather than doing everything mediocrely yourself.

Sometimes that other company isn't actually very good and you can increase value by insourcing their part of your operation. But you can't assume that is always the case. It wouldn't have solved this particular problem - I think we can safely guess that your chance of getting a batch of faulty motherboards is at least as high as Hetzner's chance.

eitland
·
8 months ago
·
[ - ]

They are in the early stages.

I think the website said they recently raised 16 million euros (or dollars).

Making investments into data centers and hardware could burn through that really quick in addition to needing more engineers.

By using rented servers (and only renting them when a customer signs up) they avoid this problem.

vednig
·
8 months ago
·
[ - ]

understood, would love to know about it from founders tho, and what went through in their decision

fdr
·
8 months ago
·
[ - ]

GP is more or less correct.

Building and owning an institution that finances, racks, services, networks, and disposes of servers, both takes time and increases the commitment level. Hetzner is month to month, with a fixed overhead for fresh leasing of servers: the set-up fee.

This is a lot to administer when also building a software institution, and a business. It was not certain at the outset, for example, that the GitHub Actions Runner product would be as popular as it became. In its earliest form, it was partially an engineering test for our virtual machines, and we went around asking friendly contacts that we knew would report abnormalities to use it. There's another universe where it only went as far as an engineering test, and our utilization and revenue pattern (that is, utility to other people) is different.

wink
·
8 months ago
·
[ - ]

> One of the providers we like is Hetzner because of their affordable and reliable servers.

> In the days that followed, the crash frequency increased.

I don't find the article conclusive whether they would still call them reliable.

cbozeman
·
8 months ago
·
[ - ]

Hetzner's reliable... until they aren't.

Since they don't do any sort of monitoring on their bare metal servers at all, at least insofar as I can tell having been a customer of theirs for ten years, you don't know there's as problem until there's a problem, or unless you've got your own monitoring solution in place.

wink
·
8 months ago
·
[ - ]

Back in 2012 it regularly happened that we called them because the network was gone, because our monitoring seemed to be better. Or at least quicker than what they showed.

Back in 2006 my coworker claimed he was the person responsible for them adding a "exchange my dead HDD" menu point on the support site because he wrote one of those tickets per week.

When I got a physical server, the HDD died in the first 48h, so I've not exactly forgiven them, even if this was a tragic story over the last 18 or so years...

On the other hand, I've been recommending their cloud vps for a couple of years because unlike with their HW, I've never had problems.

immibis
·
8 months ago
·
[ - ]

Seems like this problem, was unforeseeable, is isolated to a particular current-generation/model of server motherboards (AX2), and doesn't usually happen. I had an AX41* previously with no such problem, so it's not all AXes, just all current-generation AXes (which is all of the AXes they give to new customers, so that's no consolation).

aduffy
·
8 months ago
·
[ - ]

To their credit they actually fixed the problem. Good luck getting this level of support from any of the big 3 public cloud providers.

frenchtoast8
·
8 months ago
·
[ - ]

For example, AWS's Mac machines frequently run into hardware failures. My current job runs a measly 5 mac1.metal hosts for internal testing, and we experience hardware failures on these machines a few times a year. Doesn't sound like a lot, but these machines are almost always completely idle, and we almost never get host failures for Linux hosts. To make matters worse, sometimes a brand new instance needs replacement before it even comes up for the first time, which is annoying because you are billed a minimum of 24 hours for these instances. People have been complaining about this for years and seemingly nothing is being done about it.

https://www.reddit.com/r/aws/comments/131v8md/beware_of_brok...

janc_
·
8 months ago
·
[ - ]

The main difference being that you talk with real humans who try to help you, not computer programs designed to give you an illusion…

dumbledoren
·
7 months ago
·
[ - ]

Right. When you see the curt, cranky and blunt responses of Hetzner engineers in the support threads, you know you are talking to an actual engineer who can fix your stuff and not some American-style 'We are sorry you are experiencing this problem!' type of support rep who cant do anything about it.

·
8 months ago
·
[ - ]

dangoodmanUT
·
8 months ago
·
[ - ]

is there a provider that's like bare metal, but would detect these kinds of things mostly automatic? E.g. faulty or constantly crashing hardware.

greggyb
·
8 months ago
·
[ - ]

Managed servers: https://www.hetzner.com/managed-server/

There are also others, but Hetzner is under discussion here.

Tijdreiziger
·
8 months ago
·
[ - ]

Managed servers are quite a different product, closer to ‘old-school’ shared webhosting.

You don’t get root access, but you do get a preinstalled LAMP stack and a web UI for management.

gtirloni
·
8 months ago
·
[ - ]

Anyone got experience with Ubicloud's OpenStack stack?

fdr
·
8 months ago
·
[ - ]

Ubicloud does not have an OpenStack dependency.

gtirloni
·
8 months ago
·
[ - ]

Thanks, I was under the impression it did but re-reading the posts I see it's not the case.

TacticalCoder
·
8 months ago
·
[ - ]

[dead]

indulona
·
8 months ago
·
[ - ]

i am so glad my sign up process with hetzner failed when i was so dumb that i wanted to give them a chance even with the internet full of horrific stories of bad experiences from their customers. lucky me.

cbozeman
·
8 months ago
·
[ - ]

Hetzner is fine for what it is, you just need to know that it's all on you and only YOU.

YOU do the monitoring.

YOU do the troubleshooting.

YOU etc., etc.

If that doesn't appeal to you, or if you don't have the requisite knowledge, which I admit is fairly broad and encompassing, then it's not for you. For those of you that meet those checkboxes, they're a pretty amazing deal.

Where else could I get a 4c/8t CPU with 32 GB of RAM and four (4) 6TB disks for $38 a month? I really don't know of many places with that much hardware for that little cost. And yes, it's an Intel i7-3770, but I don't care. It's still a hell of a lot of hardware for not much price.

jaigupta
·
8 months ago
·
[ - ]

We had been colocating servers from decades but there is too much "YOU", compared to that we find Hetzner doing a lot for us (hardware inventory, replacement, remote hands, networking etc). We are slowly moving away from colocating to renting at Hetzner. It is so much better.

nobankai
·
8 months ago
·
[ - ]

[flagged]