I Cannot SSH into My Server Anymore (and That's Fine)

71
40
TheWiggles
4 days ago
soap.coffee

crawshaw
·
2 hours ago
·
[ - ]

The idea that an "observability stack" is going to replace shell access on a server does not resonate with me at all. The metrics I monitor with prometheus and grafana are useful, vital even, but they are always fighting the last war. What I need are tools for when the unknown happens.

The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.

·
14 minutes ago
·
[ - ]

ValdikSS
·
1 hour ago
·
[ - ]

>It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation.

It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.

Once you try newer tools, you don't want to go back.

Here's the example of my fairly recent debug session:

    - Network is really slow on the home server, no idea why
    - Try to just reboot it, no changes
    - Run kernel perf, check the flame graph
    - Kernel spends A LOT of time in nf_* (netfilter functions, iptables)
    - Check iptables rules
    - sshguard has banned 13000 IP addresses in its table
    - Each network packet travels through all the rules
    - Fix: clean the rules/skip the table for established connections/add timeouts

You don't need debugging facilities for many issues. You need observability and tracing.

Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.

IgorPartola
·
51 minutes ago
·
[ - ]

See I would not reboot the server first before figuring out what is happening. You lose a lot of info by doing that and the worst thing that can happen is that the problem goes away for a little bit.

galleywest200
·
3 minutes ago
·
[ - ]

My job as a DevOps engineer is to ensure customer uptime. If rebooting is the fastest, we do that. Figuring out the why is the primary developers’ jobs.

This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

My job may be different than other’s as I work at an ITSP and we serve business phone lines. When business phones do not work it is immediately clear to our customers. We have to get them back up not just for their business but for the ability for them to dial 911.

gerdesj
·
31 minutes ago
·
[ - ]

To be fair, turning it off and on again is unreasonably effective.

I recently diagnosed and fixed an issue with Veeam backups that suddenly stopped working part way through the usual window and stopped working from that point on. This particular setup has three sites (prod, my home and DR), and five backup proxies. Anyway, I read logs and Googled somewhat. I rebooted the backup server - no joy, even though it looked like the issue was there. I restarted the proxies and things started working again.

The error was basically: there are no available proxies, even though they were all available (but not working but not giving off "not working" vibes).

I could bother with trying to look for what went wrong but life is too short. This is the first time that pattern has happened to me (I'll note it down mentally and it was logged in our incident log).

So, OK, I'll agree that a reboot should not generally be the first option. Whilst sciencing it or nerding harder is the purist approach, often a cheeky reboot gets the job done. However, do be aware that a Windows box will often decide to install updates if you are not careful 8)

ValdikSS
·
30 minutes ago
·
[ - ]

I've debugged so many issues in my life that sometimes I'd prefer things to just work, and if reboot helps to at least postpone the problem, I'd choose that :D

butvacuum
·
40 minutes ago
·
[ - ]

most failstates arent worth preserving in a SMB environment. In larger environments or ones equipped for it a snapshot can be taken before rebooting- should the issue repeat.

Once is chance, twice is coincidence, three times makes a pattern.

gerdesj
·
48 minutes ago
·
[ - ]

I fail to understand how your approach is different to your parent.

perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

If you are advocating newer tools, look into nft - iptables is sooo last decade 8) I've used the lot: ipfw, ipchains, iptables and nftables. You might also try fail2ban - it is still worthwhile even in the age of the massively distributed botnet, and covers more than just ssh.

I also recommend a VPN and not exposing ssh to the wild.

Finally, 13,000 address in an ipset is nothing particularly special these days. I hope sshguard is making a properly optimised ipset table and that you running appropriate hardware.

My home router is a pfSense jobbie running on a rather elderly APU4 based box and it has over 200,000 IPs in its pfBlocker-NG IP block tables and about 150,000 records in its DNS tables.

ValdikSS
·
39 minutes ago
·
[ - ]

>perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

Well yes, and to be honest in this case I did that all over SSH: run `perf`, generate flame graph, copy the .svg to the PC over SFTP, open it in the file viewer.

What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command.

Take a look at Netflix presentation, especially on their web interface screenshots: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

>look into nft - iptables is sooo last decade

It doesn't matter in this context: iptables is using new netfilter (I'm not using iptables-legacy), and this exact scenario is 100% possible with native netfilter nft.

>Finally, 13,000 address in an ipset is nothing particularly special these days

Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well, but I wish I just had it as a dashboard picture from the start.

I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

gerdesj
·
17 minutes ago
·
[ - ]

>Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well!

>I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

I think you need to look into things if 70 IPs in a table are causing issues, such that a 10Gb link ends up at four Gb/s. I presume that if you remove the ipset, that 10Gb/s is restored?

Testing throughput and latency is also quite a challenge - how do you do it?

crawshaw
·
1 hour ago
·
[ - ]

How did you use tracing to check the current state of a machine’s iptables rules?

ValdikSS
·
31 minutes ago
·
[ - ]

In this case I used `perf` utility, but only because the server does not have a proper observability tool.

Take a look at this Netflix presentation, especially on the screenshots of their web interface tool: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

crawshaw
·
18 minutes ago
·
[ - ]

That is a command line tool run over ssh. If you have invented a new way to run command line tools, that’s great (and very possible, writing a service that can fork+exec and map stdio), but it is the equivalent to using ssh. You cannot run commands using traces.

reactordev
·
1 hour ago
·
[ - ]

Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

toast0
·
1 hour ago
·
[ - ]

> We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

You would connect to any of the nodes having the problem.

I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.

My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.

IgorPartola
·
49 minutes ago
·
[ - ]

Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?

ValdikSS
·
1 hour ago
·
[ - ]

>What I need are tools for when the unknown happens.

There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.

Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.

gear54rus
·
2 hours ago
·
[ - ]

Agreed, this sounds like some complicated ass-backwards way to do what k8s already does. If it's too big for you, just use k3s or k0s and you will still benefit from the absolutely massive ecosystem.

But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.

Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.

stryan
·
3 hours ago
·
[ - ]

Quadlets are a real game changer for this type of small-to-medium scale declarative hosting. I've been pushing for them at work over ugly `docker compose in systemd units` service management and moved my home lab over to using them for everything. The latter is a similar setup to OP except with OpenSUSE MicroOS instead of Fedora CoreOS and I'm not so brave as to destroy and rebuild my VPS's whenever I make a change :) . On the other hand, MicroOS (and I'm assuming FCOS) reboots automatically to apply updates with rollback if needed so combined with podman auto-update you can basically just spin up a box, drop the files on, and let it take care of itself (at least until a container update requires manual intervention).

A few things in the article I think might help the author:

1. Podman 4 and newer (which FCOS should definitely have) uses netavark for networking. A lot of older tutorials and articles were written back when Podman used CNI for it's networking and didn't have DNS enabled unless you specifically installed it. I think the default `podman` network is still setup with DNS disabled by default. Either way, you don't have to use a pod if you don't want to anymore, you can just attach both containers to the same network and it should Just Work.

2. You can run the generator manually with "/usr/lib/systemd/system-generators/podman-system-generator --dry-run" to check Quadlet validity and output. Should be faster than daemon-reload'ing all the time or scanning the logs.

And as a bit of self-promotion: for anyone who wants to use Quadlets like this but doesn't want to rebuild their server whenever they make a change, I'm created a tool called Materia[0] that can install, remove, template, and update Quadlets and other files from a Git repository.

[0] https://github.com/stryan/materia

plagiarist
·
1 hour ago
·
[ - ]

Do you know if it is possible to run a quadlet as an ephemeral systemd-sysuser? That would solve all my current problems.

stryan
·
29 minutes ago
·
[ - ]

Not sure I'm following; you want to create a an emphemeral system account and run a root-less Podman container as it? I don't think that's something supported out of the box but you may be able to jury rig something together by putting the quadlets directly in `/etc/containers/systemd/users/` instead of putting them in a home directory (since I'm assuming this is a systemd-sysuser created account and thus without a home).

skeptic_ai
·
24 minutes ago
·
[ - ]

You can try docker compose with Watch tower. Then you just deploy a new branch: dev, prod. On server side counterparty you fetch updates on git, if anybody change, it will run docker compose, which will build your image and put it live.

Worked well for me a few years.

Problems: when you have issues you need to look into pertainer logs to see why it failed.

That’s one big problem, if prefer something like Jenkins to build it instead.

And if you have more groups of docker compose, you just put another sh script to do this piling on the main infrastructure git repo, which on git change will spawn new git watchers

gucci-on-fleek
·
2 hours ago
·
[ - ]

Fedora IoT [0] is a nice intermediate solution. Despite its name, it's really good for servers, since it's essentially just the Fedora Atomic Desktops (Silverblue/Kinoite) without any of the desktop stuff. It gets you atomic updates, a container-centric workflow, and easy rollbacks; but it's otherwise a regular server, so you can install RPMs, ssh into it, create user accounts, and similar. This is what I do for my personal server, and I'm really happy with it.

[0]: https://fedoraproject.org/iot/

npodbielski
·
58 minutes ago
·
[ - ]

That is looks interesting. An idea to configure server on run via symtemd would probably mean that migrating from machine to machine would be very easy. It always meant for me at least two days of carefull planning, copying od files testing and fixes because I always forgot about some obscure config changes I did somewhere, like adding DNS entry somewhere or disabling default SMTP on debian.

starttoaster
·
1 hour ago
·
[ - ]

So it's AWS Fargate with a different name? That's cool for cloud hosted stuff. But if you're on prem, or manage your own VPS' then you need SSH access.

npodbielski
·
52 minutes ago
·
[ - ]

I bought last year mobo with IPMI so in theory I could have forgot about SSH and just inspect startup logs if it would fail to start.

Though I must say I am not brave enough and my family uses it so I prefer to have jest one broken service instead of enire machine.

But it is possible.

denkmoon
·
18 minutes ago
·
[ - ]

Except you've replaced something good with something worse. IPMI really isn't an improvement over having SSH to the system. It definitely has more security holes.

amluto
·
3 hours ago
·
[ - ]

> I’ve later learned that restarting a container that is part of a pod will have the (to me, unexpected) side-effect to restart all the other containers of that pod.

Anyone know why this is? Or, for that matter, why Kubernetes seems to work like this too?

I have an application for which the natural solution would be to create a pod and then, as needed, create and destroy containers within the pod. (Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace. No bridges.)

But despite containerd and Podman and Kubernetes kind-of-sort-of supporting this, they don’t seem to actually want to work this way. Why not?

kace91
·
2 hours ago
·
[ - ]

>Anyone know why this is? Or, for that matter, why Kubernetes seems to work like this too?

Pods are specifically not wanted to be treated as vms, but as a single application/deployment units.

Among other things, if a container goes down you don’t know if it corrupted shared state (leaving sockets open or whatever). So you don’t know if the pod is healthy after restart. Also reviving it might not necessarily work, if the original startup process relied on some boot order. So to guarantee a return to healthy you need to restart the whole thing.

gucci-on-fleek
·
2 hours ago
·
[ - ]

> Anyone know why this is?

In Podman, a pod is essentially just a single container; each "container" within a pod is just a separate rootfs. So from that perspective, it makes sense, since you can't really restart half of a container. (But I think that it might be possible to restart individual containers within a pod; but if any container within a pod fails, then I think that the whole pod will automatically restart)

> Why? Because I have some network resources that don’t really virtualize, so they can live in one network namespace.

You can run separate containers in the same network namespace with the "--network" option [0]. You can either start one container with its own automatic netns and then join the other containers to it with "--network=container:<name>", or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".

[0]: https://docs.podman.io/en/latest/markdown/podman-run.1.html#...

amluto
·
1 hour ago
·
[ - ]

> You can run separate containers in the same network namespace with the "--network" option [0].

Oh, right, thanks. I think I did notice that last time I dug into this. But:

> or you can manually create a new netns with "podman network create <name>" and then join all the containers to it with "--network=<name>".

I don’t think this has the desired effect at all. And the docs for podman network connect don’t mention pods at all, which is odd. In general, I have not been very impressed by podman.

Incidentally, apptainer seems to have a more or less first class ability to join an existing netns, and it supports CNI. Maybe I should give it a try.

stryan
·
2 hours ago
·
[ - ]

Yeah I was a little confused at this line; as far as I can tell you can restart containers that are a part of a Podman pod without restarting the whole pod just fine. I just verified this on one of my MicroOS boxes running Podman v5.7.1 .

Podman was changing pretty fast for a while so it could be an older version thing, though I'd assume FCOS is on Podman 5 by now.

esseph
·
1 hour ago
·
[ - ]

The general idea is you want a single application per pod, unless you need a sidecar service to live in the same pod of each instance of your app.

You are normally running several instances of your frontend so that it can crash without impacting the user experience, or so it can get deployed to in a rolling manner, etc.

·
1 hour ago
·
[ - ]

dorfsmay
·
1 hour ago
·
[ - ]

Perfect timing for me, I've just been spending my side-project time in the last few weeks on building the smallest possible VMs with different glibc distros exactly for this, running podman containers, and comparing results.

lawrencegripper
·
3 hours ago
·
[ - ]

I’ve been down a similar journey with Fedora Core OS and have loved it.

The predictability and drop in toil is so nice.

https://blog.gripdev.xyz/2024/03/16/in-search-of-a-zero-toil...

yigalirani
·
1 hour ago
·
[ - ]

real programmers can ssh to their servers

libHacker
·
1 hour ago
·
[ - ]

It's true. There's no reason to disable ssh. If you need it, it's there. If not, just don't use it.

andrewmcwatters
·
2 hours ago
·
[ - ]

I concede that this is the state of the art in secure deployments, but I’m from a different age where people remoted into colocated hardware, or at least managed their VPSs without destroying them every update.

As a result, I think developers are forgetting filesystem cleanliness because if you end up destroying an entire instance, well it’s clean isn’t it?

It also results in people not knowing how to do basic sysadmin work, because everything becomes devops.

The bigger problem I have with this, is the logical conclusion is to use “distroless” operating system images with vmlinuz, an init, and the minimal set of binaries and filesystem structure you need for your specific deployment, and rarely do I see anyone actually doing this.

Instead, people are using a hodgepodge of containers with significant management overhead, that actually just sit on like Ubuntu or something. Maybe alpine. Or whatever Amazon distribution is used on ec2 now. Or of course, like in this article, Fedora CoreOS.

One day, I will work with people who have a network issue and don’t know how to look up ports in use. Maybe that’s already the case, and I don’t know it.

bitwize
·
1 hour ago
·
[ - ]

What you describe is from the "pets" era of server deployment, and we are now deep into the "cattle" era. Train yourself on destroying and redeploying, and building observability into the stack from the outset, rather than managing a server through ssh. Every shop you go to professionally is going to work like this. Eventually, Linux desktops will work like this also, especially with all the work going into systemd to support movable home directories, immutable OS images with modular updates, and so forth.

andrewmcwatters
·
50 minutes ago
·
[ - ]

[dead]

irishcoffee
·
1 hour ago
·
[ - ]

> The bigger problem I have with this, is the logical conclusion is to use “distroless” operating system images with vmlinuz, an init, and the minimal set of binaries and filesystem structure you need for your specific deployment, and rarely do I see anyone actually doing this.

In the few jobs I’ve had over 20 years, this is common in the embedded space, usually using yocto. Really powerful, really obnoxious tool chain.