eBPF is really a superpower, it lets you do things which are incomprehensible if you don’t know about it.
I first confirmed that nvme drives were healthy according to SMART, I then worked up the stack and used BCC tools to look at block io latency. Block io latency was quite low for the NVME drives (microseconds) but was hundreds of milliseconds for the loopback block devices.
This lead me to believe that something was wrong with the loopback devices and not the underlying NVMEs. I used cachestat/cachetop and found that the page cache miss rate was very high and that we were thrashing the page cache constantly paging in and out data. From there I inspected the loopback devices using losetup and found that direct io was disabled and the sector size of the loopback device did not match the sector size of the backing filesystem.
I modified the loopback devices to use the same sector size as the block size of the underlying file system and enabled direct io. Instantly, the majority of the page cache was freed, iowait went way down, and io throughout went way up.
Without BCC tools I would have never been able to figure this out.
Double caching loopback devices is quite the footgun.
Another interesting thing we hit is that our version of losetup would happily fail to enable direct io but still give you a loopback device, this has since been fixed: https://github.com/util-linux/util-linux/commit/d53346ed082d...
https://github.com/containers/composefs https://github.com/project-machine/puzzlefs
And how did you know that tweaking the sector size to equal the underlying filesystem's block size would prevent double caching? Where can one get this sort of knowledge?
I knew that enabling direct io would most likely disable double caching because that is literally the point of enabling direct io on a loopback device. Initially I just tried enabling direct io on the loopback devices, but that failed with a cryptic “invalid argument” error. After some more research I found that direct IO needs the sector size to match the filesystems block size in some cases to work.
With Intel CET's tech there should be way to capture a shadow stack, that really just contains entry points, but wondering if that's going to be used...
But the sched_switch tracepoint is the hottest event, without stack sampling it's 200-500ns per event (on my Xeon 63xx CPUs), depending on what data is collected. I use #ifdefs to compile in only the fields that are actually used (smaller thread_state struct, fewer branches and instructions to decode & cache). Surprisingly when I collect kernel stack, the overhead jumps higher up compared to user stack (kstack goes from say 400ns to 3200ns, while ustack jumps to 2800ns per event or so).
I have done almost zero optimizations (and I figure using libbpf/BTF/CO-RE will help too). But I'm ok with these numbers for most of my workloads of interest, and since eBPF programs are not cast in stone, can do further reductions, like actually sampling stacks in the sched_switch probe on every 10th occurrence or something.
So in worst case, this full-visibility approach might not be usable as always-on instrumentation for some workloads (like some redis/memcached/mysql lookups doing 10M context switches/s on a big server), but even with such workloads, a temporary increase in instrumentation overhead might be ok, when there are known recurring problems to troubleshoot.
Edit: We actually went into much more specific detail on eBPF/BCC in contrast to DTrace a few weeks after the 20th anniversary podcast.[1]
What I loved about DTrace was that once it was out, even in beta, it was pretty complete and worked - all the DTrace ports that I've tried, including on Windows (!) a few years ago were very limited or had some showstopper issues. I guess eBPF was like that too some years ago, but by now it's pretty sweet even for more regular consumer who don't keep track of its development.
Edit: Oh, wasn't aware of the timeline, I may have some dates (years) wrong in my memory
For sure. Different systems, different times.
>rather than one eclipsing the other.
It does seem that DTrace has been eclipsed though, at least in Linux (which runs the vast majority of the world's compute). Is there a reason to use DTrace over eBPF for tracing and observability in Linux?
>There are certainly many things that DTrace can do that eBPF/BCC cannot
This may be true, but that gap is closing. There are certainly many things that eBPF can do that DTrace cannot, like Cilium.
Back when I tried to build xcapture with DTrace, I could launch the script and use something like /pid$oracle::func:entry/ but IIRC the probe was attached only to the processes that already existed and not any new ones that were started after loading the DTrace probes. Maybe I should have used some lower level APIs or something - but eBPF on Linux automatically handles both existing and new processes.
Without knowing your particular case, DTrace does too - it’d certainly be tricky to use if you’re trying to debug software that “instantly crashes on startup” if it couldn’t do that. “execname” (not “pid”) is where I’d look, or perhaps that part of the predicate is skipable; regardless, should be possible.
Execname is a variable in DTrace and not a probe (?), so how would it help with automatically attaching to new PIDs? Now that I recall more details, there was no issue with statically defined kernel "fbt" probes nor "profile", but the userspace pid one was where I hit this limitation.
You're correct, and I may have provided "a solution" to a misunderstanding of your problem - I don't think the "not matching new procs/pids" is inherent in DTrace, so indeed you might have run into an implementation issue (as it was 15 years ago). I misunderstood you as perhaps using a predicate matching a specific pid; my fault.
Unlike a module, it can only really read data, not modify data structures, so it's nice for things like tracing kernel events.
The XDP subsystem is particularly designed for you to apply filters to network data before it makes it to the network stack, but it still doesn't give you the same level of control or performance as DPDK, since you still need the data to go to the kernel.
But libbpf with CO-RE will solve these issues as I understand, so as long as the kernel supports what you need, the CO-RE binary will work.
This raises another issue for me though, it's not easy, but easier, for enterprises to download and run a single python + single C source file (with <500 code lines to review) than a compiled CO-RE binary, but my long term plan/hope is that I (we) get the RedHats and AWSes of this world to just provide the eventual mature release as a standard package.
The whole CO-RE thing is about having a kernel-version-agnostic way of reading fields from kernel data structures.
XDP reads the data in the normal NAPI kernel way, integrating with the IRQ system etc., which might or might not be desirable depending on your use case.
Then if you want to forward it to userland, you still need to write the data to a ring buffer, with your userland process polling it, at which point it's more akin to using io_uring.
It's mostly useful if you can write your entire logic in your eBPF program without going through userland, so it's nice for various tracing applications, filters or security checks, but that's about it as far as I can tell.
"bpftrace recipes: 5 real problems solved" - Trent Lloyd (Everything Open 2023) https://www.youtube.com/watch?v=ZDTfcrp9pJI