Interesting, the main things I've read for SMB Direct are from Microsoft, for their Windows Server implementation.

But with Apple's recent introduction of RDMA over Thunderbolt, that got my hopes up I could use it for storage, not only moving LLMs, but also for video file storage, where editing from one Mac to another (or over Ethernet, if that's supported) could be much faster, with lower latency.

is there some open source product who can leverage this or this just assume you have to use Microsoft stuff?
What is the performance impact of soft RDMA over SMB this way, vs the traditional SMB on the IP stack?
I was sure SMB3 was Super Mario Bros 3.
What, um... Are... Are people using samba to sync model weights between cluster nodes...?
Why not? SMB is no slouch. Microsoft has taken network storage performance very seriously for a long time now. Back in the day, Microsoft and others (NetApp, for instance,) worked hard to extend and optimize SMB and deliver efficient, high throughput file servers. I haven't kept up with the state of the art recently, but I know there have been long stretches where SMB consistently led the field in benchmark testing. It also doesn't hurt that Microsoft has a lot of pull with hardware manufacturers to see their native protocols remain tier 1 concerns at all times.
I think a lot of people have a hard time differentiating the underlying systems from what they _see_ and use it to bash MS products.

I heard that it was perhaps recently fixed, but copying many small files was multiple times faster to do via something like Total Commander vs the built in File Explorer (large files goes equally fast).

People seeing how slow Explorer was to copy would probably presume that it was a lower level Windows issue if they had a predisposed bias against Microsoft/Windows.

My theory about Explorers sluggishness is that they added visual feedback to the copying process at some point, and for whatever reason that visual feedback is synchronous/slow (perhaps capped at the framerate, thus 60 files a second), whilst TC does updating in the background and just renderers status periodically whilst the copying thread(s) can run at full speed of what the OS is capable of under the hood.

Plenty of other workloads that benefit from high performance file access and with networks speeds and disk speeds getting higher whilst single-core perf has more or less plateaued in comparison, it's thus more and more important to support data-paths where the kernel switching won't become a bottleneck.
Dunno but I have used samba to load model weights from my NAS