1. Programs built against MLX -> Can take advantage of CUDA-enabled chips
but not:
2. CUDA programs -> Can now run on Apple Silicon.
Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).
Is this correct?
It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.
That's nice to have for a range of reasons.
Apple should do a similar thing for AMD.
https://www.theverge.com/2021/4/5/22367851/google-oracle-sup...
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
I appreciate that English is your second language after your Hungarian mother-tongue. My comment reflects upon the low and high powered compute of the apple vs. nvidia hardware.
Unfortunately when that case went to the Supreme Court they basically just said "yeah for this case it's fair use, but we're not going to comment on whether APIs in general are copyrightable"...
CUDA is an ecosystem of programming languages, libraries and developer tools.
Composed by compilers for C, C++, Fortran, Python JIT DSLs, provided by NVidia, plus several others with either PTX or NVVM IR.
The libraries, which you correctly point out.
And then the IDE integrations, the GPU debugger that is on par with Visual Studio like debugging, profiler,...
Hence why everyone that focus on copying only CUDA C, or CUDA C++, without everything else that makes CUDA relevant keeps failing.
As far as portability, people who care about that already have the option of using higher-level APIs that have CUDA backend among several others. The main reason why you'd want to do CUDA directly is to squeeze that last bit of performance out of the hardware, but that is also precisely the area where deviation in small details starts to matter a lot.
However, companies may still be hoping to get their own solutions in place instead of CUDA. If they do implement CUDA, that cements its position forever. That ship has probably already sailed, of course.
A lot of people talk about 'tooling' quality and no one hears them. I just spent a couple weeks porting a fairly small library to some fairly common personal hardware and hit all the same problems you see everywhere. Bugs aren't handled gracefully. Instead of returning "you messed up here", the hardware locks up, and power cycling is the only solution. Not a problem when your writing hello world, but trolling through tens of thousands of lines of GPU kernel code to find the error is going to burn engineer time without anything to show for it. Then when its running, spending weeks in an open feedback loop trying to figure out why the GPU utilization metrics are reporting 50% utilization (if your lucky enough to even have them) and the kernel is running at 1/4 the expected performance is again going to burn weeks. All because there isn't a functional profiler.
And the vendors can't even get this stuff working. People rant about the ROCm support list not supporting, well the hardware people actually have. And it is such a mess, that in some cases it actually works but AMD says it doesn't. And of course, the only reason you hear people complaining about AMD is because they are literally the only company that has a hardware ecosystem that in theory spans the same breadth of devices from small embedded systems to giant data center grade products that NVIDIA does. Everyone else wants a slice of the market, but take apple here, they have nothing in the embedded/edge space that isn't a fixed function device (ex a watch, or apple TV), and their GPU's while interesting are nowhere near the level of the datacenter grade stuff, much less even top of the line AIC boards for gamers.
And its all gotten to be such an industry wide pile of trash that people can't even keep track of basic feature capabilities. Like, a huge pile of hardware actually 'supports' openCL, but its buried to the point where actual engineers working on say ROCm are unaware its actually part of the ROCm stack (imagine my surprise!). And its been the same for nvidia, they have at times supported openCL, but the support is like a .dll they install with the GPU driver stack and don't even bother to document that its there. Or tensorflow that seems to have succumbed to the immense gravitational black hole it had become, where just building it on something that wasn't the blessed platform could take days.
Also I do wonder what the difference b/w a API and a set of libraries are, couldn't an API be exposed from that set of libraries which could be used? Its a little confusing I guess
And now you've entered that copyright violation territory.
A clean room reimplementation of cuda would avoid any copyright claims, but would not necessary avoid patents infringement.
https://en.wikipedia.org/wiki/Clean-room_design:
“Clean-room design is useful as a defense against copyright infringement because it relies on independent creation. However, because independent invention is not a defense against patents, clean-room designs typically cannot be used to circumvent patent restrictions.”
Assuming APIs are either not copyirghtable or that API reimplementation is always fair use of the API, neither of which there is sufficient precedent to justify as a conclusion; Oracle v. Google ended with “well, it would be fair use in the exact factual circumstances in this case so we don't have to reach the thornier general questions”.
You wouldn't believe me if you didn't try it and see for yourself, so try it.
NVidia's CUDA moat is no more.
You can get 90% of the way there with a small team of compiler devs. The rest 10% would take hundreds of people working ten years. The cost of this is suspiciously close to the billions in financial incentive you mentioned, funny how efficient markets work.
Can one really speak of efficient markets when there are multiple near molopolies at various steps in the production chain with massive integration, and infinity amounts of state spending in the process?
When a monopoly uses it's status in an attempt to gain another monopoly, that's a problem and governments eventually strike this behavior down.
Sometimes it takes time, because you'd rather not go on a ideology power trip and break something that's useful to the country/world.
> Yes, free markets and monopolies are not incompatible.
How did you get from "efficient markets" to "free markets"? The first could be accepted as inherently value, while the latter is clearly not, if this kind of freedom degrades to: "Sure you can start your business, it's a free country. For certain, you will fail, though, because there are monopolies already in place who have all the power in the market."
Also, monopolies are regularly used to squeeze exorbitant shares of the added values from the other market participants, see e.g. Apple's AppStore cut. Accepting that as "efficient" would be a really unusual usage of the term in regard to markets.
(The terminology is especially unfortunate because people tend to view it as praise for free markets, and since that's an ideological claim people respond with opposing ideological claims, and now the conversation is about ideology instead of about understanding a specific phenomenon in economics.)
This is fully compatible with Apple's App Store revenue share existing and not creating value (i.e., being rent). What the efficient markets principle tells us is that, if it were possible for someone else to start their own app store with a smaller revenue share and steal Apple's customers that way, then their revenue share would already be much lower, to account for that. Since this isn't the case, we can conclude that there's some reason why starting your own competing app store wouldn't work. Of course, we already separately know what that reason is: an app store needs to be on people's existing devices to succeed, and your competing one wouldn't be.
Similarly, if it were possible to spend $10 million to create an API-compatible clone of CUDA, and then save more than $10 million by not having to pay huge margins to Nvidia, then someone would have already done it. So we can conclude that either it can't be done for $10 million, or it wouldn't create $10 million of value. In this case, the first seems more likely, and the comment above hypothesizes why: because an incomplete clone wouldn't produce $10 million of value, and a complete one would cost much more than $10 million. Alternatively, if Nvidia could enforce intellectual property rights against someone creating such a clone, that would also explain it.
(Technically it's possible that this could instead be explained by a free-rider problem; i.e., such a clone would create more value than it would cost, but no company wants to sponsor it because they're all waiting for some other company to do it and then save the $10 million it would cost to do it themselves. But this seems unlikely; big tech companies often spend more than $10 million on open source projects of strategic significance, which a CUDA clone would have.)
https://www.techpowerup.com/319016/amd-develops-rocm-based-s...
Then, now they had to stop working on some part of the source code and had to rewrite a lot of things again, they are still not as close to as they were before amd lawyer shenanigan
Thanks
You can see similar things if you buy datacenter-grade CPUs from AMD or Intel and compare their per-model optimized BLAS builds and compilers to using OpenBLAS or swapping them around. The difference is not world ending but you can see maybe 50% in some cases.
But a lot of the most useful libraries are closed source and available on NVIDIA hardware only.
You could probably get most open source CUDA to run on other vendors hardware without crazy work. But you’d spend a ton more work getting to parity on ecosystem and lawyer fees when NVIDIA come at you.
Haven’t really explored MLX so can’t speak about it.
(Software patents can, though.)
This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.
I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?
edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.
Does this means you can use MLX on linux now?
Edit:
Just tested it and it's working but only python 3.12 version is available on pypi right now: https://pypi.org/project/mlx-cuda/#files
Academia and companies continue to write proprietary code. Its as if we continue to write code for Adobe Flash or Microsoft Silverlight in year 2025.
Honestly, I don't mind as Nvidia shareholder.
I'm not even sure what the equivalent would be for CUDA tbh.
Nvidia won because they don't deal with this level of asinine infighting. If Khronos could bring back a level of mutual respect to their consortium, they could present a serious threat. Apple is the only business still on their high horse; AMD, Intel and Qualcomm all recognize that they need to cooperate.
I guess there might be a way to develop apps for iOS or even PlayStation in Java, but my knees hurt just thinking about how many hoops one needs to jump through.
I think you mean:
"write once, test everywhere"
Normally I write something snide about not seeing where the puck was headed. But Apple did skate to the puck the puck here, they just did nothing with it.
Idly wondering, is Apple bankrolling this but wants to keep it in the DL? There were also rumours the team was looking to move at one point ?
Pretty sure Claude Sonnet is actually doing most of the work.
looks like it allows MLX code to compile and run on x86 + GeForce hardware, not the other way around.
Edit: looks like it's "write once, use everywhere". Write MLX, run it on Linux CUDA, and Apple Silicon/Metal.
I’ll note Apple hasn’t shipped an Nvidia card in a very very long time. Even on the Mac pros before Apple Silicon they only ever sold AMD cards.
My understanding from rumors is that they had a falling out over the problems with the dual GPU MacBook Pros and the quality of drivers.
I have no idea if sticking one in on the PCI bus let you use it for AI stuff though.
I imagined the convo between Steve Jobs and Jensen Huang went like this:
S: your GPU is shit
J: your thermal design is shit
S: f u
J: f u too
Apple is the kind of company that hold a grudge for a very long time, their relationships with suppliers are very one way, their way or the highway.
GPU failures due to this also happened on Dell/HP/Sony laptops, some desktop models, as well as early models of the PS3.
Some reading: https://www.badcaps.net/forum/troubleshooting-hardware-devic...
[0] https://developer.arm.com/documentation/102376/0200/Device-m...
[1] Raspberry Pi 4's PCIe has the same problem AFAIK
So my MLX workloads can soon be offloaded to the cloud!?
I’ve been hoping Apple Silicon becomes a serious contender for Nvidia chips; I wonder if the CUDA support is just Embrace, extend, and extinguish (EEE).
I know standard GPUs don’t.
The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.
[1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...
Yes we can dream!
It would great if Apple continues pushing M processors to next levels, in part, to go vertical into the cloud.
Or if they start supporting nVidia.
The latter seems less Apple-y. But they must be considering the value of a cloud level Apple-friendly AI computing solution, so something is likely (?) to happen.
I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).
Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.
The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".
At 200GB/s, that upper limit is not very high at all. So it doesn't really matter if the compute is there or not.
https://www.anandtech.com/show/17024/apple-m1-max-performanc...
But my overarching point still stands: LLM inference needs memory bandwidth, and 200GB/s is not very much (especially for the higher ram variants).
If the M1 Max is actually 90GBs that just means it's a poor choice for LLM inference.
Each token requires a forward pass through all transformer layers, involving large matrix multiplications at every step, followed by a final projection to the vocabulary.
This is time tested Apple strategy is now undermining their AI strategy and potential competitiveness
tl;dr they could have done 1600GB/s
Whatever story you want to create, if customers are happy year after year then Apple is serving them well.
Maybe not with same feature dimension balance you want, or other artificial/wishful balances you might make up for them.
(When Apple drops the ball it is usually painful, painfully obvious and most often a result of a deliberate and transparent priority tradeoff. No secret switcherooos or sneaky downgrading. See: Mac Pro for years…)
Competitive AMD GPU neural compute has been any day now for at least 10 years.
That one stands out to me as a mac user.
I wouldn’t be surprised if within the next few years we see a return of Nvidia hardware to the Mac, probably starting with low volume products like the MacPro, strictly for professional/high-end use cases.
Do you have some links for this?
tl;dr - they used Google TPUs
https://www.investors.com/news/technology/apple-stock-apple-...
There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.
But I guess we have a VR device nobody wants.
M1 was launched 9 years after Jobs died. You're saying they had everything ready to go back then and just sat on their asses for a decade?
Didn't Jobs himself essentially die of delusion?
However, this might make mlx into a much stronger competitor for Pytorch.
The gist is the API specification in itself is copyright, so it is copyright infringement then.
Supreme Court ruled that by applying the Four Factors of Fair Use, Google stayed within Fair Use.
An API specification ends up being a system of organizing things, like the Dewey Decimal System (and thus not really something that can be copyrighted), which in the end marks the first factor for Google. Because Google limited the Android version of the API to just things that were useful for smart phones it won on the second factor too. Because only 0.4% of the code was reused, and mostly was rewritten, Google won on the third factor. And on the market factor, if they held for Oracle, it would harm the public because then "Oracle alone would hold the key. The result could well prove highly profitable to Oracle (or other firms holding a copyright in computer interfaces) ... [but] the lock would interfere with, not further, copyright's basic creativity objectives." So therefore the fourth factor was also pointing in Google's favor.
Whether "java" won or lost is a question of what is "java"? Android can continue to use the Java API- so it is going to see much more activity. But Oracle didn't get to demand license fees, so they are sad.
I always thought it was resolved as infringement and they had to license the Java APIs or something ...
Wow.
This is part of why patents and copyrights can't be the moat for your company. 11 years, with lots of uncertainty and back-and-forth, to get a final decision.
backend
Also apparently this is not a re-implementation of CUDA.
MLX is a PyTorch-like framework.
Though I imagine that if Apple is doing this themselves, they likely know what they’re doing, whatever it is.