For "remote" model training there is NCCL + Deepspeed/FSDP/etc. For remote inferencing there are solutions like Triton Inference Server[0] that can do very high-performance hosting of any model for inference. For LLMs specifically there are nearly countless implementations.
That said, the ability to use this for testing is interesting but I wonder about GPU contention and as others have noted the performance of such a solution will be terrible even with relatively high speed interconnect (100/400gb ethernet, etc).
NCCL has been optimized to support DMA directly between network interfaces and GPUs which is of course considerably faster than solutions like this. Triton can also make use of shared memory, mmap, NCCL, MPI, etc which is one of the many tricks it uses for very performant inference - even across multiple chassis over another network layer.
This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter), or submitting remote jobs (through SLURM or a library specific Kubernetes integration). Scuda is an interesting step towards a better solution for utilizing remote GPUs easily across a wide range of libraries, not just Pytorch and Tensorflow.
I put "remote" in quotes because they're not direct equivalents but from a practical standpoint it's the alternate current approach.
> they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.
IME this has changed quite a bit. Between improved support for torch FSDP, Deepspeed, and especially HF Accelerate wrapping of each with transformer models it's been a while since I've had to put much (if any) work in.
That said if you're running random training scripts it likely won't "just work" but given larger models becoming more common I see a lot more torchrun, accelerate, deepspeed, etc in READMEs and code.
> This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter)
Remotely, as in over the internet? 400gb ethernet is already too slow vs PCIe5 x16 (forget SXM). A 10gb internet connection is 40x slower (plus latency impacts).
Remote development via the internet with scuda would be impossibly and completely uselessly slow.
Working remotely, when having to use a desktop environment is painful, no matter the technology. The best tice come up with is using tmux/vim and sunshine/moonlight, but even still I'd rather just have access to everything locally.
I was once working on CXL [2] and memory ownership tracking in the Linux kernel and wanted to play with Nvidia GPUs, but then I hit a wall when I realised that a lot of the functionalities were running on the GSP or the firmware blob with very little to no documentation, so I ended up generally not liking the system software stack of Nvidia and I gave up the project. UVM subsystem in the open kernel driver is a bit of an exception, but a lot of the control path is still handled and controlled from closed-source cuda libraries in userspace.
tldr; it's very hard to do systems hacking with Nvidia GPUs.
[1] https://www.kernel.org/doc/html/v5.0/vm/hmm.html [2] https://en.wikipedia.org/wiki/Compute_Express_Link
I'd check out the AMD side since you can at least have a full open source GPU stack to play with, and they make a modicum of effort to document their gpus.
Last time I checked the GPU did offer some kind of memory isolation, but that was only for their datacenter, not consumer cards.
But yes, for more than a decade now even with consumer cards, separate user processes have separate hardware enforced contexts. This is as true for consumer cards as it is for datacenter cards. This is core to how something like webgl works without exposing everything else being rented on your desktop to public Internet. There have been bugs, but per process hardware isolation with a GPU local mmu has been tablestakes for a modern gpu for nearly twenty years.
What datacenter gpus expose in addition to that is multiple virtual gpus, sort of like sr-iov, where a single gpu can be exposed to multiple CPU kernels running in virtual machines.
Like we really should just have enough rDMA in the kernel to let that work.
A plan 9 like model that's heavily just a standard file would massively cut into gpu performance.
Having something that you can self-host, as a user I find this really neat but what I really want is something more like
https://github.com/city96/ComfyUI_NetDist + OP's project mashed together.
Say I'm almost able to execute a workflow that would normally require ~16Gb VRAM. I have a nvidia 3060 12Gb running headless with prime/executing the workflow via the CLI.
Right now, I'd probably just have to run the workflow in a paperspace(or any other cloud compute) container, or borrow the power of a local apple M1 when using the second repository I mentioned.
I wish I had something that could lend me extra resources and temporarily act as either the host GPU or a secondary depending on the memory needed, only when I needed it(if that makes sense)
I guess that depends on what you mean by "too slow". What card is the low tier GPU? A Nvidia Tesla? I've always been under the assumption that when running two cards in parallel the faster card will almost always slow down to the speed of the card with the most memory, though the only reference I have is from using Nvidia SLI with two 8800s almost a decade ago.
I could also be completely and utterly wrong, would love to hear from anyone in the field of GPU architecture or around it for some clarification though :)
How is it different?