GCP’s architecture seems clearly better to me especially if you are looking to be global.
Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
GCP’s use of folders makes way more sense.
GCP having global VPCs is also potentially a huge benefit if you want your users to hit servers that are physically close to them. On AWS you have to architect your own solution with global accelerator which becomes even more insane if you need to cross accounts, which you’ll probably have to do eventually because of the aforementioned insanity of AWS account/organization best practices.
Know how you find all the permissions a single user in GCP has? You have to make 9+ API calls, then filter/merge all the results. They finally added a web tool to try and "discover" the permissions for a user... you sit there and watch it spin while it madly calls backend APIs to try to figure it out. Permissions for a single user can be assigned to users, groups, orgs, projects, folders, resources, (and more I forget), and there's inheritance to make it more complex. It can take all day to track down every single place the permissions could be set for a single user in a single hierarchical organization, or where something is blocking some permission. The complexity increases as you have more GCP projects, folders, orgs. But, of course, if you don't do all this, GCP will fight you every step of the way.
Compare that to AWS, where you just click a user, and you see what's assigned to it. They engineered it specifically so it wouldn't be a pain in the ass.
> Every organization I’ve ever witnessed eventually ends up with some kind of struggle with AWS’ insane organizations and accounts nightmare.
This was an issue in the early days, but it's well solved now with newer integrations/services. Follow their Well Architected Framework (https://docs.aws.amazon.com/wellarchitected/latest/framework...), ask customer support for advice, implement it. I'm not exaggerating when I say this is the best description of the best information systems engineering practice in the world, and it's achievable by startups. It just takes a long time to read. If you want to become an excellent systems engineer/engineering manager/CTO/etc, this is your bible. (Note: you have to read the entire thing, especially the appendixes; you can't skim it like StackOverflow)
What are these struggles? The product I work on uses AWS and we have ~5 accounts (I hear they used to be more TBF) but nowadays all the infrastructure is on one of them and the other are for some niche stuff (tech support?). I could see how going overboard with many accounts could be an issue, but I don't really see issues having everything on one account.
Just before they announced that I was working on creating org accounts specifically to contain S3 buckets and then permitting the primary app to use those accounts just for their bucket allocation.
AWS themselves recommend an account per developer, IIRC.
It's as you say, some policy or limitation might require lots of accounts and lots of accounts can be pretty challenging to manage.
Architecturally I'd go with GCP in a heartbeat. Bigquery was also one of the biggest wins in my previous role. Completely changed out business for almost everyone, vs Redshift which cost us a lot of money to learn that it sucked.
You could say I'm biased as I work at Google (but not on any of this), but for me it was definitely the other way around, I joined Google in part because of the experience of using GCP and migrating AWS workloads to in.
Undersea cable failures are probably more likely than a google core networking failure.
In AWS a lot of "global" things are actually just hosted in us-east-1.
If the author had a Ko-Fi they would've just earned $50 USD from me.
I've been thinking of making the leap away from JIRA and I concur on RDS, Terraform for IAC, and FaaS whenever possible. Google support is non-existent and I only recommend GC for pure compute. I hear good things about Big Table, but I've never used in in production.
I disagree on Slack usage aside from the postmortem automation. Slack is just gonna' be messy no matter what policies are put in place.
Other options are email of course, and what, teams for instant messages?
Organized by topics, must be threaded, and default to asynchronous communications. You can still opt in to notifications, and history is well organized and preserved.
It’s funny how we get an instant messaging platform and derive best practices that try to emulate a previous technology.
Btw, email is pretty instant.
See the other point in the article about discouraging one on one private messages and encouraging public discussion. That is the main reason.
* half a day later or days later if you do true async, but that's fine.
But aren’t mailling lists and distribution groups pretty ubiquitous?
I've been working across time zones via IM and email since ... ICQ.
I'm probably biased by that but I consider email the place for questions lists and long statuses with request for comments, and for info that I want retained somewhere. While IM is a transient medium where you throw a quickie question or statement or whine every couple hours - and check what everyone else is whining about.
But clearly, thats cultural.
If you keep your eyes on the linux kernel mailing you’ll see a lot of (on topic) short and informal messages flying in all directions.
If you keep your eyes on the emails from big tech CEOs that sometimes appear in court documents; you’ll see that the way they use email is the same way that I’d use slack or an instant messenger.
Thats likely because its the tool they have available- we have IM tools that connect us to people we need (inside the company)- making email the only place for long form content, which means its only perceived as being for long form content.
But when people have to use something federated more often, it does seem like email is actually used this way.
I, too, prefer McDonald's cheeseburgers to ground glass mixed with rusty nails. It's not so much that I love Terraform (spelled OpenTofu) as that it's far and away the least bad tool I've used in the space.
Terragrunt is the only sane way to deploy terraform/openTofu in a professional environment though.
I'm trying to make the decision for where to go with my home lab, and while Pulumi and Cue look neat, cdk8s seems so predictable & has such clear structure & form to it.
That's said the l1/l2/l3 distinction can be a brute to deal with. There's significant hidden complexity there.
Homelab CDKs: https://github.com/shepherdjerred/monorepo/tree/main/package...
Script I wrote to generate types from Helm charts: https://github.com/shepherdjerred/monorepo/tree/main/package...
Infrastructure needs to be consistent, intuitive and reproducible. Imperative languages are too unconstrained. Particularly, they allow you to write code whose output is unpredictable (for example, it'd be easy to write code that creates a resources based on the current time of day...).
With infrastructure, you want predictability and reproducibility. You want to focus more on writing _what_ your infra should look like, less _how_ to get there.
I have written both TF and then CDKTF extensively (!), and I am absolutely never going back to raw TF. TF vs CDKTF isn't declarative vs imperative, it's "anemic untyped slow feedback mess" vs "strong typesystem, expressive builtins and LSP". You can build things in CDKTF that are humanly intractable in raw TF and it requires far less discipline, not more, to keep it from becoming an unmaintainable mess. Having a typechecker for your providers is a "cannot unsee" experience. As is being able to use for loops and defining functions.
That being said, would I have preferred a CDKTF in Haskell, or a typed Nix dialect? Hell yes. CDKTF was awful, it was just the least bad thing around. Just like TF itself, in a way.
But I have little problems with HCL as a compilation target. Rich ecosystem and the abstractions seem sensible. Maybe that's Stockholm syndrome? Ironically, CDKTF has made me stop hating TF :)
Now that Hashicorp put the kibosh on CDKTF though, the question is: where next...
There are things I think Terraform could do to improve its declarative specs without violating the spirit. Yet, I still prefer it as-is to any imperative alternatives.
Is that an easy mistake to make and a hard one to recover from, in your experience?
The way you have to bend over backwards in Terraform just to instantiate a thing multiple times based on some data really annoys me..
Granted, I'm a programmer, have been for a long time, so using programming tools is a no brainer for me. If someone wants to manage infra but doesn't have programming skills, then learning the Terraform config language is a great idea. Just kidding, it's going to be just as confusing and obnoxious as learning the basic skills you need in python/js to get up and running with Pulumi.
For my current startup I ended up not going a direction where I needed ansible. I've now got everything in helm charts and deployable to K8S clusters, and packaged with Dockerfiles. Not really missing ansible, but not exactly in love with K8S either. It works well enough I guess.
You ended up needing Terraform too for the infrastructure though. At that point why not just use Terraform?
That made me laugh. Yes I get that they probably didn't use all of these at the same time.
1: https://kubernetes.io/blog/2026/01/29/ingress-nginx-statemen...
This post was a great read.
Tangent to this, I've always found "best practices" to be a bit of a misnomer. In most cases in software and especially devops I have found it means "pay for this product that constrains the way that you do things so you don't shoot yourself in the foot". It's not really a "practice" if you're using a product that gives you one way to do something. That said my company uses a very similar tech stack and I would choose the same one if I was starting a company tomorrow, despite the fact that, as others have mentioned, it's a ton to keep in your head all at once.
past discussion: https://news.ycombinator.com/item?id=39313623
I've worked with hundreds of customers to integrate IdP's with our application and Google Workspace was by far the worst of the big players (Entra ID, Okta, Ping). Its extremely inflexible for even the most basic SAML configuration. Stay far, far away.
Knative on k8s works well for us, there's some oddities about it but in general does the job
By the same token, it's more efficient to let an LLM operate all these tools (and more) than to force an LLM to keep all of that on its "mind", that is, context.
Just because they can run tools, doesn't mean they run them reliably. Running tools is not a be all and end all of the problem.
Amdahl's law is still in play when it comes to agents orchestrating entire business processes on their own.
I've been working mostly at startups most of my career (for Sydney Australia values of "start up" which mostly means "small and new or new-ish business using technology", not the Silicon Valley VC money powered moonshot crapshoot meaning). Two of those roles (including the one I'm in now) have been longer that a decade.
And it's pretty much true that almost all infrastructure (and architecture) decisions are things that 4-5 years later become regrets. Some standouts from 30 years:
I didn't choose Macromind/Macromedia Director in '94 but that was someone else's decision I regretted 5 years later.
I shouldn't have chosen to run a web business on ISP web hosting and Perl4 in '95 (yay /cgi-bin).
I shouldn't have chosen globally colocated desktop pc linux machines and MySQL in '98/99 (although I got a lot of work trips and airline miles out of that).
I shouldn't have chosen Python2 in 2007, or even worse Angular2 in 2011.
I _probably_ shouldn't have chosen Arch Linux (and a custom/bastardised Pacman repo) for a hardware startup in 2013.
I didn't choose Groovy on Grails in 2014 but I regretted being recruited into being responsible for it by 2018 or so.
I shouldn't have chosen Java/MySQL in 2019 (or at least I should have kept a much tighter leash on the backend team and their enterprise architecture astronaut).
The other perspective on all those decisions though, each of them allowed a business to do the things they needed to take money off customers (I know I know, that's not the VC startup way...) Although I regretted each of those later, even in retrospect I think I made decent pragmatic choices at the time. And at this stage of my career I've become happy enough knowing that every decision is probably going to have regrets over a 4 or 5 year timeframe, but that most projects never last long enough for you to get there - either the business doesn't pass out and closes the project down, or a major ground up rewrite happens for reasons often unrelated to 5 year old infrastructure or architecture choices.
I also reached a lot of similar decisions and challenges, even where we differ (ECS vs EKS) I completely understand your conclusions.
Curious to hear more about Renovate vs Dependabot. Is it complicated to debug _why_ it's making a choice to upgrade from A to B? Working on a tool to do app-specific breaking change analysis so winning trust and being transparent about what is happening is top of mind.
When were you using quay.io? In the pre-CoreOS years, CoreOS years (2014-2018), or the Red Hat years?
Hire a DBA ASAP. They need to reign in also the laziness of all other developers when designing and interacting with the DB. The horrors a dev can create in the DB can take years to undo
modal.com???
I wish luck to the imo fools chasing the "you may not need it" logic. The vacuum that attitude creates in its wake demands many many many complex & gnarly home-cooked solutions.
Can you? Sure, absolutely! But you are doing that on your own, glueing it all together every step of the way. There's no other glue layer anywhere remotely as integrative, that can universally bind to so much. The value is astronomical, imho.
Everything in article is excellent point but other big point is schema changes become extremely difficult because you have unknown applications possibly relying on that schema.
It's also at certain point, the database becomes absolutely massive and you will need teams of DBAs care and feeding it.
Con: it’s sadly likely that no one on your staff knows a damn thing about how an RDBMS works, and is seemingly incapable of reading documentation, so you’re gonna run into footguns faster. To be fair, this will also happen with isolated DBs, and will then be much more effort to rein in.
Just FYI article is two years old
RDS is a very quick way to expand your bill, followed by EC2, followed by S3. RDS for production is great, but you should avoid the bizarre HN trope of "Postgres for everything" with RDS. It makes your database unnecessarily larger which expands your bill. Use it strategically and your cost will remain low while also being very stable and easy to manage. You may still end up DIYing backups. Aurora Serverless v2 is another useful way to reduce bill. If you want to do custom fancy SQL/host/volume things, RDS Custom may enable it.
I'm starting to think Elasticache is a code smell. I see teams adopt it when they literally don't know why they're using it. Similar to the "Postgres for everything" people, they're often wasteful, causing extra cost and introducing more complexity for no benefit. If you decide to use Elasticache, Valkey Serverless is the cheapest option.
Always use ECR in AWS. Even if you have some enterprise artifact manager with container support... run your prod container pulls with ECR. Do not enable container scanning, it just increases your bill, nobody ever looks at the scan results.
I no longer endorse using GitHub Actions except for non-business-critical stuff. I was bullish early on with their Actions ecosystem, but the whole thing is a mess now, from the UX to the docs to the features and stability. I use it for my OSS projects but that's it. Most managed CI/CD sucks. Use Drone.io for free if you're small, use WoodpeckerCI otherwise.
Buying an IP block is a complicated and fraught thing (it may not seem like it, but eventually it is). Buy reserved IPs from AWS, keep them as long as you want, you never have to deal with strange outages from an RIR not getting the correct contact updated in the correct amount of time or some foolishness.
He mentions K8s, and it really is useful, but as a staging and dev environment. For production you run into the risk of insane complexity exploding, and the constant death march of upgrades and compatibility issues from the 12 month EOL; I would not recommend even managed K8s for prod. But for staging/dev, it's fantastic. Give your devs their own namespace (or virtual cluster, ideally) and they can go hog wild deploying infrastructure and testing apps in a protected private environment. You can spin up and down things much easier than typical AWS infra (no need for terraform, just use Helm) with less risk, and with horizontal autoscaling that means it's easier to save money. Compare to the difficulty of least-privilege in AWS IAM to allow experiments; you're constantly risking blowing up real infra.
Helm is a perfectly acceptable way to quickly install K8s components, big libraries of apps out there on https://artifacthub.io/. A big advantage is its atomic rollouts which makes simple deploy/rollback a breeze. But ExternalSecrets is one of the most over-complicated annoying garbage projects I've ever dealt with. It's useful, but I will fight hard to avoid it in future. There are multiple ways to use it with arcane syntax, yet it actually lacks some useful functionality. I spent way too much time trying to get it to do some basic things, and troubleshooting it is difficult. Beware.
I don't see a lot of architectural advice, which is strange. You should start your startup out using all the AWS well-architected framework that could possibly apply to your current startup. That means things like 1) multiple AWS accounts (the more the better) with a management account & security account, 2) identity center SSO, no IAM users for humans, 3) reserved CIDRs for VPCs, 4) transit gateway between accounts, 5) hard-split between stage & prod, 6) openvpn or wireguard proxy on each VPC to get into private networks, 7) tagging and naming standards and everything you build gets the tags, 8) put in management account policies and cloudtrail to enforce limitations on all the accounts, to do things like add default protections and auditing. If you're thinking "well my startup doesn't need that" - only if your startup dies will you not need it, and it will be an absolute nightmare to do it later (ever changed the wheels on a moving bus before?). And if you plan on working for more than one startup in your life, doing it once early on means it's easier the second time. Finally if you think "well that will take too long!", we have AI now, just ask it to do the thing and it'll do it for you.
God I wish that were true. Unfortunately, ECR scanning is often cheaper and easier to start consuming than buying $giant_enterprise_scanner_du_jour, and plenty of people consider free/OSS scanners insufficient.
Stupid self inflicted problems to be sure, but far from “nobody uses ECR scanning”.
For the same amount of memory they should cost _nearly_ identical. Run the numbers. They're not significantly different services. Aside from this you do NOT pay for IPv4 when using Lambda, you do on EC2, and so Lambda is almost always less expensive.