Timeline
15:45 UTC on 29 October 2025 – Customer impact began.
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD.
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page.
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health.
17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration.
18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally.
18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally.
23:15 UTC on 29 October 2025 - PowerApps mitigation of dependency, and customers confirm mitigation.
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers.
Me: "How do I connect [X] to [Y] using [Z]?"
Copilot: "Please select the AKS cluster you'd like to delete"
Don't forget extremely insecure. There is a quarterly critical cross-tenant CVE with trivial exploitation for them, and it has been like that for years.
But what we do when things are easy is not who we are. That's a fiction. It's how we show up when we are in the shit that matters. It's discipline that tells you to voluntarily go into all of the multi-tenant mitigations instead of waiting for your boss to notice and move the goalposts you should have moved on your own.
Can't say I've experienced many bugs in there either. It definitely is overpriced but I assume they all are?
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered. ` A 19-minute delay in alert is a joke.
I think if you really wanted to do on call right to avoid gaps you’d want no more than 6 hours on primary per day per shift, and you want six, not four, shifts per day. So you’re only alone for four hours in the middle of your shift and have plenty of time to hand off.
It would be nice though if alert systems made it easy to wire up CD to turn down sensitivity during observed actions. Sort of like how the immune system turns down a bit while you're eating.
The reason is probably because changes to the status page require executive approval, because false positives could lead to bad publicity, and potentially having to reimburse customers for failing to meet SLAs.
We should be lucky MSFT is so consistent!
Hug ops to the Azure team, since management is shredding up talent over there.
Troubleshooting has completed
Troubleshooting was unable to automatically fix all of the issues found. You can find more details below.
>> We initiated the deployment of our ‘last known good’ configuration.
System Restore can help fix problems that might be making your computer run slowly or stop responding.
System Restore does not affect any of your documents, pictures, or other personal data. Recently installed programs and drivers might be uninstalled.
Confirm your restore point
Your computer will be restored to the state it was in before the event in the Description field below.
You don’t want to debug stuff with low sugar.
    16:04 Started running around screaming
    16:15 Sat down & looked at logsLooks like there was no monitoring and no alerts.
Which is kinda weird.
I think it's perhaps a gap in the tools. We apply the same alert criteria at 2 am that we do while someone is actively running deployment or admin tasks and there's a subset that should stay the same, like request failure rate, and others that should be tuned down, like overall error rate and median response times.
And it means one thing if the failure rate for one machine is 90% and something else if the cluster failure rate is 5%, but if you've only got 18 boxes it's hard to discern the difference. And which is the higher priority error may change from one project to another.
Very circular way of saying “the validator didn’t do its job”. This is AFAICT a pretty fundamental root cause of the issue.
It’s never good enough to have a validator check the content and hope that finds all the issues. Validators are great and can speed a lot of things up. But because they are independent code paths they will always miss something. For critical services you have to assume the validator will be wrong, and be prepared to contain the damage WHEN it is wrong.
Azure Portal Access Issues
Starting at approximately 16:00 UTC, we began experiencing Azure Front Door issues resulting in a loss of availability of some services. In addition. customers may experience issues accessing the Azure Portal. Customers can attempt to use programmatic methods (PowerShell, CLI, etc.) to access/utilize resources if they are unable to access the portal directly. We have failed the portal away from Azure Front Door (AFD) to attempt to mitigate the portal access issues and are continuing to assess the situation.
We are actively assessing failover options of internal services from our AFD infrastructure. Our investigation into the contributing factors and additional recovery workstreams continues. More information will be provided within 60 minutes or sooner.
This message was last updated at 16:57 UTC on 29 October 2025
---
Update: 16:35 UTC:
Azure Portal Access Issues
Starting at approximately 16:00 UTC, we began experiencing DNS issues resulting in availability degradation of some services. Customers may experience issues accessing the Azure Portal. We have taken action that is expected to address the portal access issues here shortly. We are actively investigating the underlying issue and additional mitigation actions. More information will be provided within 60 minutes or sooner.
This message was last updated at 16:35 UTC on 29 October 2025
---
Azure Portal Access Issues
We are investigating an issue with the Azure Portal where customers may be experiencing issues accessing the portal. More information will be provided shortly.
This message was last updated at 16:18 UTC on 29 October 2025
---
Message from the Azure Status Page: https://azure.status.microsoft/en-gb/status
Starting at approximately 16:00 UTC, we began experiencing Azure Front Door issues resulting in a loss of availability of some services. We suspect that an inadvertent configuration change as the trigger event for this issue. We are taking two concurrent actions where we are blocking all changes to the AFD services and at the same time rolling back to our last known good state.
We have failed the portal away from Azure Front Door (AFD) to mitigate the portal access issues. Customers should be able to access the Azure management portal directly.
We do not have an ETA for when the rollback will be completed, but we will update this communication within 30 minutes or when we have an update.
This message was last updated at 17:17 UTC on 29 October 2025
"This message was last updated at 18:11 UTC on 29 October 2025"
This message was last updated at 19:57 UTC on 29 October 2025
> In 50%+ the cases they just don‘t report it anywhere, even if its for 2h+.
I assume you mean publicly. Are you getting the service health alerts?
But, for future reference:
site:microsoft.com csam
Storytelling is how issues get addressed. Help the CSAM tell the story to the higher ups.
Child Sex-Abuse Material?!? Well, a nice case of acronym collision.
No -- the one referencing crime should NEVER have be turned into an acronym.
Crimes should not be described in euphemistic terms (which is exactly what the acronym is)
actual Managers hate that
I'm simplifying a bit, but I don't think it's likely that Azure has a similar race condition wiping out DNS records on _one_ system than then propagates to all others. The similarity might just end at "it was DNS".
They didn't provide any details on latency. It could have been delayed an hour or a day and no one noticed
Edit: Typo!
• https://www.xbox.com/en-US also doesn't fully paint. Header comes up, but not the rest of the page.
• https://www.minecraft.net/en-us is extremely slow, but eventually came up.
The other day during the AWS outage they "reported" OVH down too.
We already had to do it for large files served from Blob Storage since they would cap out at 2MB/s when not in cache of the nearest PoP. If you’ve ever experienced slow Windows Store or Xbox downloads it’s probably the same problem.
I had a support ticket open for months about this and in the end the agent said “this is to be expected and we don’t plan on doing anything about it”.
We’ve moved to Cloudflare and not only is the performance great, but it costs less.
Only thing I need to move off Front Door is a static website for our docs served from Blob Storage, this incident will make us do it sooner rather than later.
Be aware that if you’re using Azure as your registrar, it’s (probably still) impossible to change your NS records to point to CloudFlare’s DNS server, at least it was for me about 6 months ago.
This also makes it impossible to transfer your domain to them either, as CloudFlare’s domain transfer flow requires you set your NS records to point to them before their interface shows a transfer option.
In our case we had to transfer to a different registrar, we used Namecheap.
However, transferring a domain from Azure was also a nightmare. Their UI doesn’t have any kind of transfer option, I eventually found an obscure document (not on their Learn website) which had an az command which would let you get a transfer code which I could give to Namecheap.
Then I had to wait over a week for the transfer timeout to occur because there is no way on Azure side that I could find to accept the transfer immediately.
I found CloudFlare’s way of building rules quite easy to use, different from Front Door but I’m not doing anything more complex than some redirects and reverse proxying.
I will say that Cloudflare’s UI is super fast, with Front Door I always found it painfully slow when trying to do any kind of configuration.
Cloudflare also doesn’t have the problem that Front Door has where it requires a manual process every 6 months or so to renew the APEX certificate.
They quickly updated the message to REMOVE the link. Comical at this point.
https://news.ycombinator.com/item?id=32031639
https://news.ycombinator.com/item?id=32032235
Edit: wow, I can't believe we hadn't put https://news.ycombinator.com/item?id=32031243 in https://news.ycombinator.com/highlights. Fixed now.
Long before that, the first raid array anyone set up for my (teams’) usage, arrived from Sun with 2 dead drives out of 10. They RMA’d us 2 more drives and one of those was also DOA. That was a couple years after Sun stopped burning in hardware for cost savings, which maybe wasn’t that much of a savings all things considered.
I was an intern but everyone seemed very stressed.
dang saying it's temporary: https://news.ycombinator.com/item?id=32031136
    $ dig news.ycombinator.com
    ; <<>> DiG 9.10.6 <<>> news.ycombinator.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54819
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 512
    ;; QUESTION SECTION:
    ;news.ycombinator.com.  IN A
    ;; ANSWER SECTION:
    news.ycombinator.com. 1 IN A 209.216.230.207
    ;; Query time: 79 msec
    ;; SERVER: 100.100.100.100#53(100.100.100.100)
    ;; WHEN: Wed Oct 29 13:59:29 EDT 2025
    ;; MSG SIZE  rcvd: 65
Bunch of on-call peeps over there that definitely know the instant something major goes down
But they won't be.
I've seen it multiple times at various stores; only once did I see them taking cash and writing things down (probably to enter into the system later when it came back up).
Time and time again it's shown that AWS is far more expensive than other solutions, just easier for the Execs to offshore the blame.
they think that they are 'eliminating a single point of failure', but in reality, they end up adding multiple, complicated points of mostly failure.
However netatmo does need to have a server to store data as you need to consolidate acreoss devices plus you can query gfor a year's data and that won't and can't be held locally.
I always go everywhere adequately prepared for beverages and food. Thanks to your comment, I have a new reason to do so. Take out coffees are actually far from guaranteed. Payment systems could go down, my bank account could be hacked or maybe the coffee shop could be randomly closed. Heck, I might even have an accident crossing the road. Anything could happen. Hence, my humble flask might not have the top beverage in it but at least it works.
We all design systems with redundancy, backups and whatnot, but few of us apply this thinking to our food and drink. Maybe get a kettle for the office and a backup kettle, in case the first one fails?
Here in The Netherlands, almost all trains were first delayed significantly, and then cancelled for a few hours because of this, which had real impact because today is also the day we got to vote for the next parlement (I know some who can't get home in time before the polls close, and they left for work before they opened).
If it’s a multi day event, it’s probably that way for a reason. Partially the same as the solution to above.
The description of voting in the Netherlands is that you can see your ballot physically go into a clear box and stay to see that exact box be opened and all ballots tallied.
Dropping a ballot in a box in tour neighborhood helps ensure nothing with regards to the actually ballot count.
> You can stay there and wait for the count at the end of the day if you want to.
And if you watch the election night news, you'll see footage of multiple people counting the votes from the ballot boxes, again with various people observing to check that nothing dodgy is going on.
Having everyone just put their ballots in a postbox seems like a good way remove public trust from the electoral system, because noone's standing around waiting for the postie to collect the mail, or looking at what happens in the mail truck, or the rest of the mail distribution process.
I'm sure I've seen reports in the US of people burning postboxes around election time. Things like this give more excuses to treat election results as illegitimate, which I believe has been an issue over there.
(Yes, we do also have advanced voting in NZ, but I think they're considered "special votes" and are counted separately .. the elections are largely determined on the day by in-person votes, with the special votes being confirmed some days later)
It is a small but distinct difference between mail/early voting and putting the votes directly into the ballot box.
(AI generated explanation) How the double-envelope system works
Inner “secrecy” envelope
You mark your ballot, fold it, and slip it into an unmarked inner envelope. No name or identifying info is on this envelope, so your choices stay anonymous. Outer declaration envelope
The inner envelope goes inside a larger outer envelope that carries: – A ballot ID/barcode unique to you. – A signature line that must match the one on file with your election office. In many states, a detachable privacy flap or perforated strip hides the signature until election officials open the outer envelope, keeping the ballot secret.
There's so much more you have to trust.
If you wish, you can write a phrase on your ballot. The phrases and their corresponding vote are broadcast (on tv, internet, etc). So if you want to validate that your vote was tallied correctly, write a unique phrase. Or you could pick a random 30 digit number, collisions should be zero-probability, right?
I mean, this would be annoying because people would write slurs and advertisements, and the government would have to broadcast them. But, it seems pretty robust.
I’d suggest the state handle the number issuing, but then they could record who they issues which numbers to, and the winning party could go about rounding up their opposition, etc.
Googling around a bit, it sounds like there are systems that let you verify that your ballot made it, but not necessarily that it was counted correctly. (For this reason, I guess?)
When I vote in person, I know all the officials there from various parties are just like...looking at the box for the whole day to make sure everything is counted. It's much easier to understand and trust.
Sure you got a notification! That doesn't mean anything. Even with human counted ballots or electronic ballots.
Following the chain of custody from vote to verification, in some way, would be nice.
Here in Latvia the "election day" is usually (always?) on weekend, but the polling stations are open for some (and different!) part of every weekday leading up. Something like couple hours on monday morning, couple hours on tuesday evening, couple around midday wednesday, etc. In my opinion, it's a great system. You have to have a pretty convoluted schedule for at least one window not to line up for you.
Here is the form to register for postal voting in the Republic of Ireland - https://www.dublincity.ie/sites/default/files/2024-01/pv4-wo...
Instructions on how to submit the form / register for mail-in votes is on page 4.
Hope that helps anyone else out who needs in Ireland
> You may use this form to apply for a postal vote if, due to the circumstances of your work/service or your full-time study in the State, you cannot go to your polling station on polling day.
Which seems to indicate that's only for people who can't go to the polling station, otherwise you do have to go there.
As someone who spent the first 30 years of my life in Ireland but is now part of that diaspora, it's frustrating but I get it. I don't get to vote, but neither do thousands of plastic paddys who have very little genuine connection to Ireland.
That said, I'm sure they could expand the voting window to a couple of days at least without too much issue.
But I still prefer the paper vote and I usually a blockchain apathetic.
We've been closing a lot of polling places recently:
https://abcnews.go.com/US/protecting-vote-1-5-election-day-p...
Here's the President of the United States on Sunday: https://truthsocial.com/@realDonaldTrump/posts/1154418712892...
"No mail-in or 'Early' Voting, Yes to Voter ID! Watch how totally dishonest the California Prop Vote is! Millions of Ballots being 'shipped.' GET SMART REPUBLICANS, BEFORE IT IS TOO LATE!!!"
Mail in voting is just better all around for a geographically diverse place as the US and I wish would be adopted by all states.
So excited to see how the right-wing pedants here disagree with this.
If so, I see a lot to dislike. As the point I was making is you can’t anticipate what might come up. Just because it’s worked thus far doesn’t mean it’s designed for resilience. There’s a lot of ways you could miss out in that type of situation. I seems silly to make sure everything else is redundant and fault tolerant in the name of democracy when the democratic process itself isn’t doing the same.
That’s just ridiculous in my opinion. Makes me wonder how many well intentioned would be voters end up missing out each election cause shit happens and voting is pretty optional
What is the that group's deviation from the general voting population's preferences?
What are the margins of the votes on those ballot questions?
In most countries, in the elections you vote or the member of parliament you want. Presidential elections, and city council elections are held separately, but are also equally simple. But in one election you cast your vote for one person, and that's it.
With this kind of elections, many countries manage to hold the elections on paper ballots, count them all by hand, and publish results by midnight.
But on an American ballot, you vote for, for example:
    - US president
    - US senator  
    - US member of congress  
    - state governor  
    - state senator  
    - state member of congress  
    - several votes for several different state judge positions  
    - several other state officer positions  
    - several votes for several local county officers  
    - local sheriff  
    - local school board member  
    - several yes/no votes for several proposed laws, whether they should be passed or not
Here in Indonesia, in a city of 2 million people there are over 7000 voting stations. While we vote for 5 ballots (President, Legislative (National, Province, and City/Regency), we still use paper ballots and count them by hand.
There is a ballot tracking system as well, I can see and be notified as my ballot moves through the counting system. It's pretty cool.
I actually just got back from dropping off my local elections ballot 15m ago, quick bike trip maybe a mile or so away and back.
Of course, because it makes it easy for people to vote, the republicans want to do away with it. If you have to stand in line for several hours (which seems to be very normal in most cities) and potentially miss work to do it that's going to all but guarantee that working people and the less motivated will not vote.
So yes in places that only do in person voting, national or state holiday.
https://nltimes.nl/2025/10/29/ns-hit-microsoft-cloud-outage-...
It should be noted that the article isn't complete: while the travel planner and ticket machines were the first to fail, trains were cancelled soon after; it took a few hours before everything restarted.
Based on what the conductors said, I would speculate that the train drivers digital schedule was not operative, so they didn't know where to go next.
This list doesn't have anything that looks relevant: https://www.rijdendetreinen.nl/en/disruptions/archive?date_b...
The day does not appear as an outlier in the monthly statistics: https://www.rijdendetreinen.nl/en/statistics/2025/10
I don't find a detailed statistic on the overall delays, but the per-station statistics for Amsterdam Centraal say 5% of trains were cancelled and 17% were delayed by 5 minutes or more (mostly by 10 minutes): https://www.rijdendetreinen.nl/en/train-archive/2025-10-29/a...
I do need a human to provision a few servers and configure e.g. load balancing and when to spin up additional servers under load. But that is far less of a PITA than having my systems tied to a specific provider or down whenever a cloud precipitates.
The moment you choose to use S3 instead of hosting your own object store, though, you either use AWS because S3 and IAM already have you or spend more time on the care and feeding of your storage system as opposed to actually doing the thing you customers are paying you to do.
It's not impossible, just complicated and difficult for any moderately complex architecture.
One thing very important, is that I can authorise specific web clients (users) to access specific resources from S3. Such as a document that he can download, but others with the link should not be able to download.
Thank you!
Another way you can do it is generating pre-signed URLs in your backend on each request to download something... but the URL that is generated when you do that is only valid for some small time period, so not a stable URL at all.
In my use case, I needed stable URLs, so I went the proxy route.
I really do feel the only viable future for clouds is hybrid or agnostic clouds.
Horses were famously tamed in 2007 after AWS released S3 to the public, this is the best of times.
Old trains had paper tickets, the locomotive was its own power source, the conductor had a flashlight, and the conductor could sell tickets for cash.
And if everything else failed, the conductor would just let you ride for free.
Now everything's so interconnected that any one part failing brings everything to a halt.
Personally I am thinking more and more about hetzner, yes I know its not an apples to orange comparison. But its honestly so good
Someone had created a video where they showed the underlying hardware etc., I am wondering if there is something like https://vpspricetracker.com/ but with geek-benchmarks as well.
This video was affiliated with scalahosting but still I don't think that there was too much bias of them and they showed at around 3:37 a graph comparison with prices https://www.youtube.com/watch?v=9dvuBH2Pc1g
Now it shows how contabo has better hardware but I am pretty sure that there might be some other issues, and honestly I feel a sense of trust with hetzner I am not sure about others.
Either hetzner or self hosting stuff personally or just having a very cheap vps and going to hetzner if need be but hetzner already is pretty cheap or I might use some free service that I know of are good as well.
https://blog.cloudflare.com/rearchitecting-workers-kv-for-re...
Personally I just trust cloudflare more than google, given how their focus is on security whereas google feels googly...
I have heard some good things about google cloud run and the google's interface feels the best out of AWS,Azure,GCloud but I still would just prefer cloudflare/hetzner iirc
Another question: Has there ever been a list of all major cloud outages, like I am interested how many times google cloud and all cloud providers went majorly down I guess y'know? is there a website/git project that tracks this?
I have never had much confidence in Azure as a cloud provider. The vertical integration of all the things for a Microsoft shop was initially very compelling. I was ready to fight that battle. But, this fantasy was quickly ruined by poor execution on Microsoft's part. They were able to convince me to move back to AWS by simply making it difficult to provision compute resources. Their quota system & availability issues are a nightmare to deal with compared to EC2.
At this point I'd rather use GCP over Azure and I have zero seconds of experience with it. The number of things Microsoft gets right in 2025 can be counted single-handedly. The things they do get right are quite good, but everything else tends to be extremely awful.
I remember I at one point had expanded enough menus that it covered the entirety of the screen.
Never before have I felt so lost in a cloud product.
Yeah, that had some fun ideas but was way more confusing than it needed to be. But also that was quite a few years back now. The Portal ditched that experience relatively quickly. Just long enough to leave a lot of awful first impressions, but not long enough for it to be much more than a distant memory at this point, several redesigns later.
[0] The name "Blades" for that came from the early years of the Xbox 360, maybe not the best UX to emulate for a complex control panel/portal.
Like, AWS, and GCP to a lesser extent, has a principled approach where simple click-ops goals are simple. You can access the richer metadata/IAM object model at any time, but the wizards you see are dumb enough to make easy things easy.
With Azure, those blades allow tremendously complex “you need to build an X Container and a Container Bucket to be able to add an X” flows to coexist on the same page. While this exposes the true complexity, and looks cool/works well for power users, it is exceedingly unintuitive. Inline documentation doesn’t solve this problem.
I sometimes wonder if this is by design: like QuickBooks, there’s an entire economy of consultants who need to be Certified and thus will promote your product for their own benefit! Making the interface friendly to them and daunting to mere mortals is a feature, not a bug.
But in Azure’s case it’s hard to tell how much this is intentional.
I don't want to pay for or lock myself into, "Azure Insights".
I just want to see the logging, that I know if I can remember the right buttons to click, are available.
The worst place to try is "Monitoring > Logs", this is where you get faced up front with a query designer. I've never worked out how to do a simple "list by time" on that query designer, but it doesn't matter, because if you suffer through that UX, you find out that's not actually where the logs are anyway.
You have to go down a different path. Don't be distracted by "Log Stream", that's not it either, it sounds useful but it's not. By default it doesn't log anything. If you do configure it to log, then it still doesn't actually log everything.
What you have to actually do, and I've had to open the portal to check this, is click "Diagnose and Solve Problems" and then look for "Diagnostic tools" and then a small link to "Application Event Logs".
Finally you get to your logs, although it's still a bad way to try to view logs, it's at least marginally better than the real windows event viewer, an application that feels like it hasn't been updated since NT4. ( Although some might suggest that's a good thing. )
By bringing those eyeballs onto your cloud console, you're creating infinitely more opportunities for branded interaction and discovery of your other cloud products - you could even quantify these eyeballs as you would ad inventory! There should have been an arms race for each cloud provider to have the best log-tailing and log-searching and log-aggregation system imaginable. OTel could have been killed before it began, because Honeycomb and its other originators would have been acquired years ago and made specific locked-in value-adds for each cloud.
But nobody had this foresight, and thus comments like yours are absolutely correct. OTel is a blessing and I love the tools coming out. But from a cloud provider's perspective, it's a massive missed opportunity that continues to be missed.
I think that's what Application Insights has always been, Azure's free-to-start, suggest-out-of-the-box Honeycomb. App Insights had a long slow road away from Microsoft-specific log and metrics ingesters that weren't OTel, but it is hard to argue that standard ingestors are a bad idea. App Insights still downplays that it can be "just a Honeycomb" using only OTel sources and still encourages "secret sauce" ingestors in addition to OTel ones. App Insights is a small moat (around a data lake; to mix metaphors). That said, it's also a standards-supporting tool now as well.
It's not been as clear of an arms race because AWS and GCP didn't invest in it in a similar way and it mostly impacted what are often called "dark matter" teams (Microsoft shops doing "boring" stuff that rarely makes HN headlines), but I have worked in teams that absolutely favored Azure over AWS/GCP with one of the reasons being Application Insights was an easy install and powerful first-party supported tool rather than an extra third party vendor relationship like Grafana/Honeycomb/Dynatrace/etc.
Here's a somewhat ancient Stack Overflow screenshot I found: https://i.sstatic.net/yCseI.png
(I think that's from near the transition because it has full "windowing" controls of minimize/maximize/close buttons. I recall a period with only close buttons.)
All that blue space you could keep filling with more "blades" as you clicked on things until the entire page started scrolling horizontally to switch between "blades". Almost everything you could click opened in a new blade rather than in place in the existing blade. (Like having "Open in New Window" as your browser default.)
It was trying to merge the needs of a configurable Dashboard and a "multi-window experience". You could save collections of blades (a bit like Niri workspaces) as named Dashboards. Overall it was somewhere between overkill and underthought.
(Also someone reminded me that many "blades" still somewhat exist in the modern Portal, because, of course, Microsoft backwards compatibility. Some of the pages are just "maximized Blades" and you can accidentally unmaximize them and start horizontally scrolling into new blades.)
depending on the resource you're accessing, you can get 5+ sections each with their own ui/ux on the same page/tab and it can be confusing to understand where you're at in your resources
if you're having trouble visualizing it, imagine an url where each new level is a different application with its own ui/ux and purpose all on the same webpage
I never understood why a clear and consistent UI and improved UX isn't more of a priority for the big three cloud providers. Even though you talk mostly via platform SDK's, I would consider better UI especially initially, a good way to bind new customers and pick your platform over others.
I guess with their bottom line they don't need it (or cynically, you don't want to learn and invest in another cloud if you did it once).
For some reason this applies to all AWS, GCP and Azure. Seems like the result of dozens of acquisitions.
Any time something is that unintuitive to get started, I automatically assume that if I encounter a problem that I’ll be unable to solve it. That thought alone leads me to bounce every time.
AWS Is a complete mess. Everything is obscured behind other products, and they're all named in the most confusing way possible.
MSFT : Hold my beer...
TBH, GCP is very good! More people should use it.
https://cloud.google.com/resource-manager/docs/project-suspe...
I'd hope you can create a Google Cloud account under a completely different email address, but I do as little business with Google as I can get away with, so I have no idea.
>TBH, GCP is very good! More people should use it.
These takes couldn't be further apart. Gotta love HN comments.
I feel like compliance is the entire point of using these cloud providers. You get a huge head start. Maintaining something like PCI-DSS when you own the real estate is a much bigger headache than if it's hosted in a provider who is already compliant up through the physical/hardware/networking layers. Getting application-layer checkboxes ticked off is trivial compared to "oops we forgot to hire an armed security team". I just took a look and there are currently 316 certifications and attestations listed under my account.
Microsoft really wants you to use their PaaS offerings, and so things on Azure are priced accordingly. A Microsoft shop just wanting to lift-and-shift, Azure isn't the best choice unless the org has that "nobody ever got fired for buying Microsoft" attitude.
They think they have the market captured, but I think what their dwindling quality and ethics are really going to drive is adoption of self hosting, distributed computing frameworks. Nerds are the ones who drove adoption of these platforms, and we can eventually end if we put in the work.
Seriously with container technology, and a bit more work / adoption on distributed compute systems and file storage (IPFS,FileCoin) there is a future where we dont have to use big brothers compute platform. Fuck these guys.
I really hope this pushes the internet back to how it used to be, self hosted, privacy, anonymity. I truly hope that's where we're headed, but the masses seem to just want to stay comfortable as long as their show is on TV
if all companies focused on fixing each and every social issue that exists in the world, how would they make any money?
From 2000-2016 most tech marketing\branding was aimed at some kind of social benefit.
I would link to that article, but that one does seem down ;)
> They're stating they're working with the Azure teams, so I suspect this is related.
Credit card information would be recorded by the POS, synced to a mini-server in the back office (using store-and-forward to handle network issues) and then in a batch process overnight, sent to HQ where the payment was processed.
It wasn't until chip-and-PIN was rolled out that they started supporting "online" (i.e. processed then and there) card transactions, and even then the old method still worked if there was a network issues or power failure (all POSes has their own UPS).
The only real risk at the time was that someone tried to pay with a cancelled credit card - the bank would always honour the payment otherwise. But that was pretty uncommon back then, as you'd have to phone your bank to do it, not just press a button in an app.
Chick-fil-a has this.
One of the tech people there was on HN a few years ago describing their system. Credit card approval slows down the line, so the cards are automatically "approved" at the terminal, and the transaction is added to a queue.
The loss from fraudulent transactions turns out to be less than the loss from customers choosing another restaurant because of the speed of the lines.
I go there daily because it's a nice 30min round trip walk and I wfh. I go up there to get a diet coke or something else just to get out of the house. It amazes me when i see a handwritten sign on the door "closed, system is down". I've gotten to know the cashiers so I asked and it's because the internet connection goes down all the time. That store has to one of the most poorly run things i've ever seen yet it stays in business somehow.
Your responses imply that you think people are questioning whether you would lose money on the deal while we are instead saying you’ll get laughed out of the store, or possibly asked never to come back.
It seems like an easy problem to fix and a retail store being closed for a whole weekday because of inet access sounds crazy to me.
1: I doubt they're "with it" enough to put together a backup arrangement for internet.
2: Their internet problems are probably due to a cheapo router, loose wire, ect.
3: The employees probably like the break.
Good luck if you make this work for you, it would be exciting to hear about if you're able to get them to work with you.
EDIT: their last quarterly was 36%. they lost $3.7bn in 24Q4 -- the christmas quarter. sold to PE in Q1.
Why doesn't someone in the store at least have one of those manual kachunk-kachunk carbon copy card readers in the back that they can resuscitate for a few days until the technology is turned back on? Did they throw them all away?
And that was the day Visa had a full on outage. We would walk into one shop, try to buy stuff, get declined, then go into the next and get accepted because they were running in offline mode.
Got a nice big bill from my cellphone carrier for making the call to visa to ask them wtf as well.
How aptly descriptive.
The stores are in the hood or middle of nowhere. The customers don’t have many options.
Last week I couldn't pay for flowers for grandma's grave because smartphone-sized card terminal refused to work - it stuck on charging-booting loop so I had to get cash. Tho my partner thinks she actually wanted to get cash without a receipt for herself excluding taxes
Whereas the smaller, owner-run stores have more leeway; the local tiny grocery "sold" all freezer/refrigerator food for cheap/free during a power failure. The big Walmart closed and threw everything away the next day.
God help me if I hand someone $25 for a $14.75 total. I’m getting small bills back.
I wonder what they teach in Germany.
Its not the we are not capable. Its, is the business willing to assume the risk?
There's a fairly large supermarket near me that has both kinds of outages.
Occasionally it can't take cards because the (fiber? cable?) internet is down, so it's cash only.
Occasionally it can't take cash because the safe has its own cellular connection, and the cell tower is down.
I was at Frank's Pizza in downtown Houston a few weeks ago and they were giving slices of pizza away because the POS terminal died, and nobody knew enough math to take cash. I tried to give them a $10 and told them to keep the change, but "keep the change" is an unknown phrase these days. They simply couldn't wrap their brains around it. But hey, free pizza!
I feel pretty justified in my previous decisions to move away from Azure. Using it feels like building on quicksand…
At this point I dont believe that any one of them is any better or reliable than the others.
I felt this way about AWS last week
And microsoft.com too - that's gotta hurt
- on a US tenant I am unable to access login.microsoftonline.com and the login flow stalls on any SSO authentication attempt.
- on a European tenant, probably germany-west, I am able to login and access the Azure portal.
Luckily, we moved off Azure Front Door about a year ago. We’d had three major incidents tied to Front Door and stopped treating it as a reliable CDN.
They weren’t global outages, more like issues triggered by new deployments. In one case, our homepage suddenly showed a huge Microsoft banner about a “post-quantum encryption algorithm” or something along those lines.
Kinda wild that a company that big can be so shaky on a CDN, which should be rock solid.
Error: visual-studio-code: Download failed on Cask 'visual-studio-code' with message: Download failed: https://update.code.visualstudio.com/1.105.1/darwin-arm64/st...
The root zone and www. do not: https://dnschecker.org/#A/microsoft.com (all resolvers return records)
And querying https://www.microsoft.com/ results in HTTP 200 on the root document, but the page elements return errors (a 504 on the .css/.js documents, a 404 on some fonts, Name Not Resolved on scripts.clarity.ms, Connection Timed Out on wcpstatic.microsoft.com and mem.gfx.ms). That many different kinds of errors is actually kind of impressive.
I'm gonna say this was a networking/routing issue. The CDN stayed up, but everything else non-CDN became unroutable, and different requests traveled through different paths/services, but each eventually hit the bad network path, and that's what created all the different responses. Could also have been a bad deploy or a service stopped running and there's different things trying to access that service in different ways, leading to the weird responses... but that wouldn't explain the failed DNS propagation.
I wonder if this is microsoft "learning" to "prevent" such an issue and instead triggered it...
"One often meets his destiny on the path he takes to avoid it" -- Master Oogway
2028: the year of migrating from a managed provider to the cloud
2029: the year of migrating from the cloud to your own metal in a rack
People keep thinking the solution to their problems is to do something new (that they don't fully understand).
TIL it's called Nirvana Fallacy
We used to call it "The grass is always greener on the other side of the fence."
> There are currently no active events. Use Azure Service Health to view other issues that may be impacting your services.
Links to a page on Azure Portal which is down...
"We are investigating an issue with the Azure Portal where customers may be experiencing issues accessing the portal. More information will be provided shortly."
Moving a website quickly is never fun.
It's only after the fact they are transparent about the impact
It acts as a GSLB controller inside Kubernetes — doing DNS-level health checks, region awareness, and automatic failover between clusters when one goes down.
It integrates with ExternalDNS and supports multiple DNS providers (Infoblox, Route53, Azure DNS, NS1, etc.), so it can handle failover across both on-prem and cloud clusters.
It’s not a silver bullet for every architecture, but it’s one of the few OSS projects that make multi-region failover actually manageable in practice.
How did we get here? Is it because of scale? Going to market in minutes by using someone else's computers instead of building out your own, like co-location or dedicated servers, like back in the day.
Now, they go down a lot less frequently, but when they do, it's more widespread.
I work on a product hosted on Azure. That's not the case. Except for front door, everything else is running fine. (Front door is a reverse proxy for static web sites.)
The product itself (an iot stormwater management system) is running, but our customers just can't access the website. If they need to do something, they can go out to the sites or call us and we can "rub two sticks together" and bypass the website. (We could also bypass front door if someone twisted our arms.)
Most customers only look at the website a few times a year.
---
That being said, our biggest point of failure is a completely different iot vendor who you probably won't hear about on Hacker News when they, or their data networks, have downtime.
Now I will admit I am more of a point-and-click person; but if I had to I could have figured out how to use the command line.
> Big Tech lobbying is riding the EU’s deregulation wave by spending more, hiring more, and pushing more, according to a new report by NGO’s Corporate Europe Observatory and LobbyControl on Wednesday (29 October).
> Based on data from the EU’s transparency register, the NGOs found that tech companies spend the most on lobbying of any sector, spending €151m a year on lobbying — a 33 percent increase from €113m in 2023.
Gee whizz, I really do wonder how they end up having all the power!
I think the response lies in the surrounding ecosystem.
If you have a company it's easier to scale your team if you use AWS (or any other established ecosystem). It's way easier to hire 10 engineers that are competent with AWS tools than it is to hire 10 engineers that are competent with the IBM tools.
And from the individuals perspective it also make sense to bet on larger platforms. If you want to increase your odds of getting a new job, learning the AWS tools gives you a better ROI than learning the IBM tools.
Pick your point on the scale
But the cloud compute market is basically centralized into 2.5 companies at this point. The point of paying companies like Azure here is that they've in theory centralized the knowledge and know-how of running multiple, distributed datacenters, so as to be resilient.
But that we keep seeing outages encompassing more than a failure domain, then it should be fair game for engineers / customers to ask "what am I paying for, again?"
Moreover, this seems to be a classic case of large barriers to entry (the huge capital costs associated with building out a datacenter) barring new entrants into the market, coupled with "nobody ever got fired for buying IBM" level thinking. Are outages like these truly factored into the napkin math that says externalizing this is worth it?
In our highly interconnected world, decentralization paradoxically requires a central authority to enforce decentralization by restricting M&A, cartels, etc.
Stonks
You name them. Other good providers you have experience with?
There is no reason for an expensive cloud. Never has been, but decision makers tried to keep their pants dry.
If Azure goes down and nobody feels it, does Azure really matter?
If Azure goes down, it's mostly affecting internal stuff at big old enterprises. Jane in accounting might notice, but the customers don't. Contrast with AWS which runs most of the world's SaaS products.
People not being able to do their jobs internally for a day tends not to make headlines like "100 popular internet services down for everyone" does.
Even the national digital id service is down.
Can't help but smirk as my country is ramming through "Digital ID" right now
What a time to be alive.
The Microsoft status page mostly referenced the portal outage, but it was more than that.
-Cloudflare for R2 (object storage) and CDN (Fastly+backblaze also available). -Two VPS/Server providers with a decent reputation and mid-size (using a comparison site like https://serversearcher.com or look directly into people like Hetzner or latitude) -PlanetScale or Neon for database if you don't co-locate it, though better to use someone like digital ocean, vultr or latitude who offer databases too)
But then who do we blame when things are down? If we manage our own infrastructure we have to stay late to fix it when it breaks instead of saying “sorry, Microsoft, nothing we can do” and magically our clients accepting that…
Lol
And more importantly nobody lose any reputation except AWS/Azure/Google.
The real reason is that outages are not your fault. Its the new version of "nobody ever got fired for buying IBM" - later it became MS, and now its any big cloud provider.
On the merits though, I agree, haven’t had any serious issues with Hetzner.
[1] https://azure.microsoft.com/en-us/products/frontdoor
[2] https://learn.microsoft.com/en-us/azure/frontdoor/front-door...
It's CDN and FrontDoor at least.
And it's very clear from these updates that they're more focused on the portal than the product, their updates haven't even mentioned fixing it yet, just moving off of it, as if it's some third party service that's down.
Unsubstantiated idea: So the support contract likely says there is a window between each reporting step and the status page is the last one and the one in the legal documents giving them several more hours before the clauses trigger.
Oh, that'll be why Scan & Go was down yesterday evening. I thought it was another instance of an iOS 26 update breaking their crappy code.
[0] https://corporate.asda.com/newsroom/2025/22/09/asda-announce...
1. Mandatory
2. "Voluntary"
3. Voluntary
And I suspect that very little of what the NSA does falls into category 3. As Sen Chuck Schumer put it "you take on the intelligence community, they have six ways from Sunday at getting back at you"
https://microsoft.com/deviceloginus
Seems like they migrated the non-Gov login but not the Gov one. C'mon Microsoft, I've got a deadline in a few days.
Best of luck to the teams responding to this incident.
Azure Portal Access Issues
Starting at approximately 16:00 UTC, we began experiencing DNS issues resulting in availability degradation of some services. Customers may experience issues accessing the Azure Portal. We have taken action that is expected to address the portal access issues here shortly. We are actively investigating the underlying issue and additional mitigation actions. More information will be provided within 60 minutes or sooner.
This message was last updated at 16:35 UTC on 29 October 2025
----
Azure Portal Access Issues
We are investigating an issue with the Azure Portal where customers may be experiencing issues accessing the portal. More information will be provided shortly.
This message was last updated at 16:18 UTC on 29 October 2025
-- From the Azure status page
When you find an honest vendor, cherish them. They are rare, and they work hard to earn and keep your confidence.
So if we look at these companies' bottom lines, all those big wigs are actually doing something right. Sales and lobbying capacity is way more effective than reliability or good engineering (at least in the short term).
You know nobody is migrating off of AWS or Azure because of these.
That's certainly not the right conclusion.
I guess the GCP is next.
There's no way to tell, and after about 30 minutes, the release process on VS Code Marketplace failed with a cryptic message: "Repository signing for extension file failed.". And there's no way to restart/resume it.
I have been having issues with GitHub and the winget tool for updates throughout the day as well. I imagine things are pulling from the same locations on Azure for some of the software I needed to update (NPM dependencies, and some .NET tooling).
Much of Xbox is behind that too.
"We’re investigating an issue impacting Azure Front Door services. Customers may experience intermittent request failures or latency. Updates will be provided shortly."
This mom’s son was asking Tesla’s Grok AI chatbot about soccer. It told him to send nude pics, she says
xAI, the company that developed Grok, responds to CBC: 'Legacy Media Lies'
Be interesting to understand cause here. Pretty big impact on services we use
(couldn't resist adding it. i acknowledge this comment adds no value to the discussion)
For example when I try to log into our payroll provider Brightpay, it sends me here:
https://bpuk1prod1environment.blob.core.windows.net/host-pro...
Edit: As of 9:19 AM Pacific time, I'm now getting successful A responses but they can take several seconds. The web server at that address is not responding.
Microsoft CDN
There, that's it. You're selling it to (hopefully) technical people
But seriously I thought it would be the console, not a CDN.
This message was last updated at 16:35 UTC on 29 October 2025”
And so is Microsoft: http://www.microsoft.com/
The actual stuff I was working on (App Insights, Function App) that was still open was operational.
Doesn't seem to be too bad of an outage unless you were relying on Azure Front Door.
>What is required to be able to use MyGet? ... MyGet runs its operations from the Microsoft Azure in the West Europe region, near Amsterdam, the Netherlands.
We had to bypass the Frontdoor
There's a lot of outages this month!
Any guess on what's causing it?
In hindsight, I guess the foresight of some organizations to go multi-cloud was correct after all.
It's not easy though.
I'm curious—at what point did you decide the overhead was worth it? Was it after experiencing an outage, or did you architect for it from day one?
As someone launching a product soon (more on the builder/product side than infra-engineer), I keep wrestling with this. The pragmatist in me says "start simple, prove the concept, then layer in resilience." But then you see events like this week and think "what if this happens during launch?"
How did you handle the operational complexity? Did you need dedicated DevOps folks, or are there patterns/tools that made it manageable for a smaller team?
I would recommend focusing on multi-region within a single CSP instead (both for workloads AND your tooling), which covers the vast majority of incidents and lays some of the architectural foundation for multi-cloud down the road. Develop failover plans for each service in your architecture (eg. planned/tested runbooks to migrate to Traffic Manager in the event AFD goes down)
Also choose your provider wisely. We experience 3-5x the number of service-impacting incidents on Azure that we do on AWS. I'm sure others have different experiences, but I would never personally start a company on Azure. AWS has its own issues, of course, but reliability has not been a major one (relatively speaking) over the past 10 years. Last week's incident with DynamoDB in us-east-1 had zero impact on our AWS workloads in other regions.
I also got weird notification in VS2022 that my license key was upgraded to Enterprise, but we did not purchase anything.
That said, I don't hear about GCP outages all that often. I do think AWS might be leading in outages, but that's a gut feeling, I didn't look up numbers.
This isn't GCP's fault, but the outage ended up taking down Cloudflare too, so in total impact I think that takes the cake.
Few customers....few voices to complain as well.
Institutional knowledge matters. Just has to be the right institution is all.
That is a pass.
To be clear, they should get criticism. They should be held liable for any damage they cause.
But that they remain the biggest cloud offering out there isn't something you'd expect to change from a few outages that, by most all evidence, potential replacements have, as well? More, a lot of the outages potential replacements have are often more global in nature.
I thought one of the major selling points of the big cloud providers was that they were more reliable than running your own stuff (by which i mean anything from a VPS to multiple data centres depending on your scale. Compared to those alternatives they seem to be less reliable in practice!
The solution is to have a multi-region, or even multi-cloud setup, but then bang goes the "they do all the work for you" argument (which i doubt anyway).
You are further asserting that these outages prove they are not still more reliable than home spun. Is that the case? More than a few people aren't ready for a single hard drive to crash on the stuff they are doing.
if that's true then it's a sign that Azure's control / data plane separation is doing it's job! at least for now
This is not the first or second time this happened, multiple Hyperscaler failed one by one.
FD and CDN are global resources and are experiencing issues. Probably some other global resources as well.
Hate to say it, but DNS is looking like it's still the undisputed champ.
(Coder is currently at the top of the experiment list. Any other suggestions?)
    HTTPSConnectionPool(host='schemas.xmlsoap.org', port=443): Max retries exceeded with url: /soap/encoding/ (Caused by SSLError(CertificateError("hostname 'schemas.xmlsoap.org' doesn't match '*.azureedge.net'")))
160k+ results on GitHub: https://github.com/search?q=http%3A%2F%2Fschemas.xmlsoap.org...
> An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.
> As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.
> The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.
So, so far they're saying it's a combination of bad config + their config-validator had a bug. Would love more details.
edit: it worked once, then died again. So I guess - some resolvers, or FD servers may be working!
Edit: nope looks like there's actually a spike on GCP as well
Funnily enough, AI has been training on its own data as generated by users writing AI conversations back to the internet - there's a feedback loop at play.
QNBQ-5W8
In other words, people reporting outages at AWS are probably having trouble with microsoft-run DNS services or caching proxies. It's not that the issues aren't there, it's that the internet is full of intermingled complexity. Just that amount of organic false-positives can make it look like an unrelated major service is impacted.
I noticed that winget is also down eg.
  winget upgrade fabric
  Failed in attempting to update the source: winget
  An unexpected error occurred while executing the command:
  InternetOpenUrl() failed.
  0x80072ee7 : unknown errorThere is no way it’s DNS
It was DNS
I can at least login to Azure. But several MS sites are down.
How can one of the richest companies in the world not offer a better service?
Better service costs money.
Except that it is not!
Interesting times...
What a terrible advise.
But what if I don't want AI brought to me?
Although judging by the available transports it will likely be colonized by nazis.
    > “Microsoft is being recognized and rewarded at levels never seen before,” Nadella wrote. “And yet, at the same time, we’ve undergone layoffs. This is the enigma of success in an industry that has no franchise value.”
     
    > Nadella explained the disconnect between thriving financials and layoffs by stating that “progress isn’t linear” and that it is “sometimes dissonant, and always demanding.”
    > These decisions are among the most difficult we have to make. They affect people we’ve worked alongside, learned from, and shared countless moments with—our colleagues, teammates, and friends.
Unless that's a euphemism for "vibe coding", no.
> We have confirmed that an inadvertent configuration change as the trigger event for this issue.
Save the speculation for Reddit. HN is better than that.