This is very interesting but many of the motivations listed are far better served with alternate approaches.

For "remote" model training there is NCCL + Deepspeed/FSDP/etc. For remote inferencing there are solutions like Triton Inference Server[0] that can do very high-performance hosting of any model for inference. For LLMs specifically there are nearly countless implementations.

That said, the ability to use this for testing is interesting but I wonder about GPU contention and as others have noted the performance of such a solution will be terrible even with relatively high speed interconnect (100/400gb ethernet, etc).

NCCL has been optimized to support DMA directly between network interfaces and GPUs which is of course considerably faster than solutions like this. Triton can also make use of shared memory, mmap, NCCL, MPI, etc which is one of the many tricks it uses for very performant inference - even across multiple chassis over another network layer.

[0] - https://github.com/triton-inference-server/server

I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda, as they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter), or submitting remote jobs (through SLURM or a library specific Kubernetes integration). Scuda is an interesting step towards a better solution for utilizing remote GPUs easily across a wide range of libraries, not just Pytorch and Tensorflow.

> I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda

I put "remote" in quotes because they're not direct equivalents but from a practical standpoint it's the alternate current approach.

> they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

IME this has changed quite a bit. Between improved support for torch FSDP, Deepspeed, and especially HF Accelerate wrapping of each with transformer models it's been a while since I've had to put much (if any) work in.

That said if you're running random training scripts it likely won't "just work" but given larger models becoming more common I see a lot more torchrun, accelerate, deepspeed, etc in READMEs and code.

> This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter)

Remotely, as in over the internet? 400gb ethernet is already too slow vs PCIe5 x16 (forget SXM). A 10gb internet connection is 40x slower (plus latency impacts).

Remote development via the internet with scuda would be impossibly and completely uselessly slow.

Why is working locally important?
Working locally still matters, and this is from someone who normally works in tmux/nvim. When working on vision and 3D ML work, being able to quickly open a visualizer windows is imperative to understanding what's going on. For Gaussian Splatting, point cloud work, SLAM, etc. you have to have access to a desktop environment to see visualizations; they very rarely work well remotely (even if they have some Jupyter support).

Working remotely, when having to use a desktop environment is painful, no matter the technology. The best tice come up with is using tmux/vim and sunshine/moonlight, but even still I'd rather just have access to everything locally.

This appears to only support CUDA on nvidia. I'm curious why they didn't just expose /dev/nvidia-uvm as a socket and forward that over the network instead of hooking hundreds of functions (maybe it's not that simple and I just don't know).
You can't mmap a socket, and mmap is core to how /dev/nvidia-uvm works.
Well, it's not impossible. It's just software after all. You can mmap a remote device file, but you need OS support to do the magical paging for you, probably some sort of page ownership tracking protocol like in HMM [1], but outside a coherence domain.

I was once working on CXL [2] and memory ownership tracking in the Linux kernel and wanted to play with Nvidia GPUs, but then I hit a wall when I realised that a lot of the functionalities were running on the GSP or the firmware blob with very little to no documentation, so I ended up generally not liking the system software stack of Nvidia and I gave up the project. UVM subsystem in the open kernel driver is a bit of an exception, but a lot of the control path is still handled and controlled from closed-source cuda libraries in userspace.

tldr; it's very hard to do systems hacking with Nvidia GPUs.

[1] https://www.kernel.org/doc/html/v5.0/vm/hmm.html [2] https://en.wikipedia.org/wiki/Compute_Express_Link

Yeah, the Nvidia stuff isn't really made to be hacked on.

I'd check out the AMD side since you can at least have a full open source GPU stack to play with, and they make a modicum of effort to document their gpus.

  • majke
  • ·
  • 2 months ago
  • ·
  • [ - ]
It's a first time I hear about /dev/nvidia-uvm. Is there any documentation on how nvidia API works? Especially, how strong is the multi-tenancy story. Can two users use one GPU and expect reasonable security?

Last time I checked the GPU did offer some kind of memory isolation, but that was only for their datacenter, not consumer cards.

There's not a lot of docs on how it works. It used to be entirely in the closed source driver, now it's mainly a thin bridge to the closed source firmware blob.

But yes, for more than a decade now even with consumer cards, separate user processes have separate hardware enforced contexts. This is as true for consumer cards as it is for datacenter cards. This is core to how something like webgl works without exposing everything else being rented on your desktop to public Internet. There have been bugs, but per process hardware isolation with a GPU local mmu has been tablestakes for a modern gpu for nearly twenty years.

What datacenter gpus expose in addition to that is multiple virtual gpus, sort of like sr-iov, where a single gpu can be exposed to multiple CPU kernels running in virtual machines.

Which seems weird to me: if we're going to have device files, it's super annoying that they actually don't really act like files.

Like we really should just have enough rDMA in the kernel to let that work.

At it's core, this device file is responsible for managing a GPU local address space, and sharing memory securely with that address space in order to have a place to write command buffers and data that the gpu can see. It doesn't really make sense without a heavy memory mapping component.

A plan 9 like model that's heavily just a standard file would massively cut into gpu performance.

I agree with you that making RDMA a more accessible commodity technology is very important for "the future of compute". Properly configuring something like RoCEv2 or Infiniband is expensive and difficult. These technologies need to be made more robust in order to be able to run on commodity networks.
Granted it requires additional support from your nics/switches, but it is probably straightforward to remote nvidia-uvm with an RDMA server
What you can do is mmap a file that's in a FUSE filesystem and relay reads/writes over the network to a server that holds that mmap.
More like "virtual cuda only gpu" over IP.
Well scuda has cuda in the name
You might have a problem using CUDA as part of the name, since Nvidia has it trademarked. Maybe you can switch to Scuba if they give you trouble, sounds like a good name for the tool.
Buda may Be a Better name
We need to do for CUDA what was done for Jell-o and Kleenex.
Would this let Nvidia card be accessible on Apple Silicon over TB4 for training on a e-GPU caddy? Would happily relegate my desktop to HTPC/Gaming duties.
As this mentions some prior art but not rCUDA (https://en.m.wikipedia.org/wiki/RCUDA) I'm a bit confused about what makes scuda different.
I've updated the README! rCUDA is indeed inspiration, in fact it inspired scuda's name too :)
[dead]
  • ghxst
  • ·
  • 2 months ago
  • ·
  • [ - ]
This looks more like CUDA over IP or am I missing something?
Reminds me of this, from a couple months ago.

https://news.ycombinator.com/item?id=41203475

Was going to post a reference to the same thing! Not sure about you but I tested it, and I'm not sure if it was just being hugged to death when I used it or not, but the network performance was incredibly poor.

Having something that you can self-host, as a user I find this really neat but what I really want is something more like

https://github.com/city96/ComfyUI_NetDist + OP's project mashed together.

Say I'm almost able to execute a workflow that would normally require ~16Gb VRAM. I have a nvidia 3060 12Gb running headless with prime/executing the workflow via the CLI.

Right now, I'd probably just have to run the workflow in a paperspace(or any other cloud compute) container, or borrow the power of a local apple M1 when using the second repository I mentioned.

I wish I had something that could lend me extra resources and temporarily act as either the host GPU or a secondary depending on the memory needed, only when I needed it(if that makes sense)

I have a laptop with a serviceable GPU but only 16gb of ram, and another with a low tier GPU but 32gb of ram. Wondering, will it be too slow to use the later as the control plane and delegate inference to the former laptop using something like comfyui to run text-to-image models?
I referenced this already, but definitely check out https://github.com/city96/ComfyUI_NetDist?tab=readme-ov-file...

I guess that depends on what you mean by "too slow". What card is the low tier GPU? A Nvidia Tesla? I've always been under the assumption that when running two cards in parallel the faster card will almost always slow down to the speed of the card with the most memory, though the only reference I have is from using Nvidia SLI with two 8800s almost a decade ago.

I could also be completely and utterly wrong, would love to hear from anyone in the field of GPU architecture or around it for some clarification though :)

Should have given more info, indeed. One notebook has a 3070ti which has 8gb of vram and the other is a mx150, which I guess has 2gb dedicated vram.
Everyone hates nvidia but treats ATI as an afterthought. Another completely useless tool to throw on the pile.
> Everyone hates nvidia but treats ATI as an afterthought.

Hehe, do you mean AMD?

What year is it?
ATI? afterthought, indeed
Curious if this could be simplified to provide NVENC over ip?
It would be nice to have a description added.
I have heard NVSwitch is used for GPU-to-GPU interconnection over network.

How is it different?

Orders of magnitude slower.
Isn't this GPU-to-CPU? And really slow. And only CUDA. And over IP. And implemented in software. I think it's really very different.
nice
[flagged]