Apple's MLX adding CUDA support

Story

531
184
nsagent
1 day ago
github.com

lukev
·
23 hours ago
·
[ - ]

So to make sure I understand, this would mean:

1. Programs built against MLX -> Can take advantage of CUDA-enabled chips

but not:

2. CUDA programs -> Can now run on Apple Silicon.

Because the #2 would be a copyright violation (specifically with respect to NVidia's famous moat).

Is this correct?

quitit
·
23 hours ago
·
[ - ]

It's 1.

It means that a developer can use their relatively low-powered Apple device (with UMA) to develop for deployment on nvidia's relatively high-powered systems.

That's nice to have for a range of reasons.

chvid
·
18 hours ago
·
[ - ]

If Apple cannot do their own implementation of CUDA due to copyright second best is this; getting developers to build for LMX (which is on their laptops) and still get NVIDIA hardware support.

Apple should do a similar thing for AMD.

xd1936
·
13 hours ago
·
[ - ]

I thought that the US Supreme Court decision in Google v. Oracle and the Java reimplementation provided enough case precedent to allow companies to re-implement something like CUDA APIs?

https://www.theverge.com/2021/4/5/22367851/google-oracle-sup...

https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

timhigins
·
6 hours ago
·
[ - ]

Exactly and see also ROCM/HIP which is AMD’s reimplementation of CUDA for their gpus.

·
15 hours ago
·
[ - ]

karmakaze
·
12 hours ago
·
[ - ]

It would be great for Apple if enough developers took this path and Apple could later release datacenter GPUs that support MLX without CUDA.

nightski
·
7 hours ago
·
[ - ]

It's the other way around. If Apple released data center GPUs then developers might take that path. Apple has shown time and again they don't care for developers, so it's on them.

randomNumber7
·
13 hours ago
·
[ - ]

What is the performance penalty compared to a program in native CUDA?

·
15 hours ago
·
[ - ]

_zoltan_
·
19 hours ago
·
[ - ]

"relatively high powered"? there's nothing faster out there.

chvid
·
18 hours ago
·
[ - ]

Relative to what you can get in the cloud or on a desktop machine.

sgt101
·
18 hours ago
·
[ - ]

I wonder what Apple would have to do to make metal + its processors run faster than nVidia? I guess that it's all about the interconnects really.

summarity
·
16 hours ago
·
[ - ]

Right now, for LLMs, the only limiting factor on Apple Silicon is memory bandwidth. There hasn’t been progress on this since the original M1 Ultra. And since abandoning UltraFusion, we won’t see progress here anytime soon either.

glhaynes
·
15 hours ago
·
[ - ]

Have they abandoned UltraFusion? Last I’d heard, they’d just said something like “not all generations will get an Ultra chip” around the time the M4 showed up (the first M chip lacking an Ultra variation), which makes me think the M5 or M6 is fairly likely to get an Ultra.

librasteve
·
14 hours ago
·
[ - ]

this is like saying the only limiting factor on computers is the von neumann bottleneck

szundi
·
16 hours ago
·
[ - ]

[dead]

MangoToupe
·
18 hours ago
·
[ - ]

Is this true per watt?

spookie
·
17 hours ago
·
[ - ]

It doesn't matter for a lot of applications. But fair, for a big part of them it is either essential or a nice to have. But completely off the point if we are waging fastest compute no matter what.

johnboiles
·
12 hours ago
·
[ - ]

...fastest compute no matter watt

quitit
·
16 hours ago
·
[ - ]

Relative to the apple hardware, the nvidia is high powered.

I appreciate that English is your second language after your Hungarian mother-tongue. My comment reflects upon the low and high powered compute of the apple vs. nvidia hardware.

sitkack
·
20 hours ago
·
[ - ]

#2 is not a copyright violation. You can reimplement APIs.

7734128
·
19 hours ago
·
[ - ]

The famous Android Java fight is probably the most important case of that discussion.

hnfong
·
18 hours ago
·
[ - ]

Indeed.

Unfortunately when that case went to the Supreme Court they basically just said "yeah for this case it's fair use, but we're not going to comment on whether APIs in general are copyrightable"...

adastra22
·
19 hours ago
·
[ - ]

CUDA is not an API, it is a set of libraries written by NVIDIA. You'd have to reimplement those libraries, and for people to care at all you'd have to reimplement the optimizations in those libraries. That does get into various IP issues.

pjmlp
·
19 hours ago
·
[ - ]

CUDA is neither an API, nor a set of libraries, people get this wrong all the time.

CUDA is an ecosystem of programming languages, libraries and developer tools.

Composed by compilers for C, C++, Fortran, Python JIT DSLs, provided by NVidia, plus several others with either PTX or NVVM IR.

The libraries, which you correctly point out.

And then the IDE integrations, the GPU debugger that is on par with Visual Studio like debugging, profiler,...

Hence why everyone that focus on copying only CUDA C, or CUDA C++, without everything else that makes CUDA relevant keeps failing.

CamperBob2
·
9 hours ago
·
[ - ]

Only the runtime components matter, though. Nobody cares about the dev tools beyond the core compiler. What people want is to be able to recompile and run on competitive hardware, and I don't understand why that's such an intractable problem.

int_19h
·
8 hours ago
·
[ - ]

It's the same essential problem as with e.g. Wine - if you're trying to reimplement someone else's constantly evolving API with a closed-source implementation, it takes a lot of effort just to barely keep up.

As far as portability, people who care about that already have the option of using higher-level APIs that have CUDA backend among several others. The main reason why you'd want to do CUDA directly is to squeeze that last bit of performance out of the hardware, but that is also precisely the area where deviation in small details starts to matter a lot.

outworlder
·
8 hours ago
·
[ - ]

It is not.

However, companies may still be hoping to get their own solutions in place instead of CUDA. If they do implement CUDA, that cements its position forever. That ship has probably already sailed, of course.

StillBored
·
8 hours ago
·
[ - ]

Because literally the entire rest of the ecosystem is immature demoware. Rather than each vendor buying into opencl+SPIRV and building a robust stack around it, they are all doing their own half baked tech demos hoping to lock up some portion of the market to duplicate nvidia's success, or at least carve out a niche. While nvidia continues to extend and mature their ecosystem. Intel has oneAPI, AMD has ROCM, Arm has ACL/Kleidi/etc, and a pile of other stacks like MLX, Windows ML, whatever. Combined with a confusing mix of pure software plays like pytorch and windows ML.

A lot of people talk about 'tooling' quality and no one hears them. I just spent a couple weeks porting a fairly small library to some fairly common personal hardware and hit all the same problems you see everywhere. Bugs aren't handled gracefully. Instead of returning "you messed up here", the hardware locks up, and power cycling is the only solution. Not a problem when your writing hello world, but trolling through tens of thousands of lines of GPU kernel code to find the error is going to burn engineer time without anything to show for it. Then when its running, spending weeks in an open feedback loop trying to figure out why the GPU utilization metrics are reporting 50% utilization (if your lucky enough to even have them) and the kernel is running at 1/4 the expected performance is again going to burn weeks. All because there isn't a functional profiler.

And the vendors can't even get this stuff working. People rant about the ROCm support list not supporting, well the hardware people actually have. And it is such a mess, that in some cases it actually works but AMD says it doesn't. And of course, the only reason you hear people complaining about AMD is because they are literally the only company that has a hardware ecosystem that in theory spans the same breadth of devices from small embedded systems to giant data center grade products that NVIDIA does. Everyone else wants a slice of the market, but take apple here, they have nothing in the embedded/edge space that isn't a fixed function device (ex a watch, or apple TV), and their GPU's while interesting are nowhere near the level of the datacenter grade stuff, much less even top of the line AIC boards for gamers.

And its all gotten to be such an industry wide pile of trash that people can't even keep track of basic feature capabilities. Like, a huge pile of hardware actually 'supports' openCL, but its buried to the point where actual engineers working on say ROCm are unaware its actually part of the ROCm stack (imagine my surprise!). And its been the same for nvidia, they have at times supported openCL, but the support is like a .dll they install with the GPU driver stack and don't even bother to document that its there. Or tensorflow that seems to have succumbed to the immense gravitational black hole it had become, where just building it on something that wasn't the blessed platform could take days.

Imustaskforhelp
·
19 hours ago
·
[ - ]

Even if its not as optimized, it would still be nice to see a CUDA alternative really

Also I do wonder what the difference b/w a API and a set of libraries are, couldn't an API be exposed from that set of libraries which could be used? Its a little confusing I guess

adastra22
·
18 hours ago
·
[ - ]

> couldn't an API be exposed from that set of libraries which could be used

And now you've entered that copyright violation territory.

Someone
·
17 hours ago
·
[ - ]

IP infringement, not copyright violation.

A clean room reimplementation of cuda would avoid any copyright claims, but would not necessary avoid patents infringement.

https://en.wikipedia.org/wiki/Clean-room_design:

“Clean-room design is useful as a defense against copyright infringement because it relies on independent creation. However, because independent invention is not a defense against patents, clean-room designs typically cannot be used to circumvent patent restrictions.”

dragonwriter
·
8 hours ago
·
[ - ]

> A clean room reimplementation of cuda would avoid any copyright claims,

Assuming APIs are either not copyirghtable or that API reimplementation is always fair use of the API, neither of which there is sufficient precedent to justify as a conclusion; Oracle v. Google ended with “well, it would be fair use in the exact factual circumstances in this case so we don't have to reach the thornier general questions”.

vFunct
·
12 hours ago
·
[ - ]

So if people aren't aware, you can have AI reimplement CUDA libraries for any hardware, as well as develop new ones.

You wouldn't believe me if you didn't try it and see for yourself, so try it.

NVidia's CUDA moat is no more.

tho234j2344
·
20 hours ago
·
[ - ]

I don't think #2 is really true - AMDs HIP is doing this exact thing after giving up on OpenCL way back in ~'17/'18.

NekkoDroid
·
20 hours ago
·
[ - ]

I haven't looked into it, but doesn't HIP need everything to be recompiled against it? To my understanding it was mostly a source code translation effectivly.

pjmlp
·
19 hours ago
·
[ - ]

For CUDA C++, not CUDA the ecosystem.

saagarjha
·
23 hours ago
·
[ - ]

No, it's because doing 2 would be substantially harder.

lukev
·
23 hours ago
·
[ - ]

There's a massive financial incentive (billions) to allow existing CUDA code to run on non-NVidia hardware. Not saying it's easy, but is implementation difficulty really the blocker?

ivell
·
6 hours ago
·
[ - ]

Modular is trying with Mojo + Max offering. It has taken quite a bit of effort to target NVidia and get parity. They are now focusing on other hardware.

lmm
·
22 hours ago
·
[ - ]

I think it's ultimately a project management problem, like all hard problems. Yes it's a task that needs skilled programmers, but if an entity was willing to pay what programmers of that caliber cost and give them the conditions to let them succeed they could get it done.

int_19h
·
8 hours ago
·
[ - ]

From the market perspective, it's down to whether the amount of money needed to get there and stay there (keeping in mind that this would have to be an ongoing effort given that CUDA is not a static target) is more or less than the amount of money needed to just buy NVIDIA GPUs.

fooker
·
20 hours ago
·
[ - ]

Existing high performance cuda code is almost all first party libraries, written by NVIDIA and uses weird internal flags and inline ptx.

You can get 90% of the way there with a small team of compiler devs. The rest 10% would take hundreds of people working ten years. The cost of this is suspiciously close to the billions in financial incentive you mentioned, funny how efficient markets work.

lcnielsen
·
20 hours ago
·
[ - ]

> funny how efficient markets work.

Can one really speak of efficient markets when there are multiple near molopolies at various steps in the production chain with massive integration, and infinity amounts of state spending in the process?

fooker
·
19 hours ago
·
[ - ]

Yes, free markets and monopolies are not incompatible.

When a monopoly uses it's status in an attempt to gain another monopoly, that's a problem and governments eventually strike this behavior down.

Sometimes it takes time, because you'd rather not go on a ideology power trip and break something that's useful to the country/world.

Perseids
·
18 hours ago
·
[ - ]

> > Can one really speak of efficient markets

> Yes, free markets and monopolies are not incompatible.

How did you get from "efficient markets" to "free markets"? The first could be accepted as inherently value, while the latter is clearly not, if this kind of freedom degrades to: "Sure you can start your business, it's a free country. For certain, you will fail, though, because there are monopolies already in place who have all the power in the market."

Also, monopolies are regularly used to squeeze exorbitant shares of the added values from the other market participants, see e.g. Apple's AppStore cut. Accepting that as "efficient" would be a really unusual usage of the term in regard to markets.

ameliaquining
·
10 hours ago
·
[ - ]

The term "efficient markets" tends to confuse and mislead people. It refers to a particular narrow form of "efficiency", which is definitely not the same thing as "socially optimal". It's more like "inexploitability"; the idea is that in a big enough world, any limited opportunities to easily extract value will be taken (up to the opportunity cost of the labor of the people who can take them), so you shouldn't expect to find any unless you have an edge. The standard metaphor is, if I told you that there's a $20 bill on the sidewalk in Times Square and it's been there all week, you shouldn't believe me, because if it were there, someone would have picked it up.

(The terminology is especially unfortunate because people tend to view it as praise for free markets, and since that's an ideological claim people respond with opposing ideological claims, and now the conversation is about ideology instead of about understanding a specific phenomenon in economics.)

This is fully compatible with Apple's App Store revenue share existing and not creating value (i.e., being rent). What the efficient markets principle tells us is that, if it were possible for someone else to start their own app store with a smaller revenue share and steal Apple's customers that way, then their revenue share would already be much lower, to account for that. Since this isn't the case, we can conclude that there's some reason why starting your own competing app store wouldn't work. Of course, we already separately know what that reason is: an app store needs to be on people's existing devices to succeed, and your competing one wouldn't be.

Similarly, if it were possible to spend $10 million to create an API-compatible clone of CUDA, and then save more than $10 million by not having to pay huge margins to Nvidia, then someone would have already done it. So we can conclude that either it can't be done for $10 million, or it wouldn't create $10 million of value. In this case, the first seems more likely, and the comment above hypothesizes why: because an incomplete clone wouldn't produce $10 million of value, and a complete one would cost much more than $10 million. Alternatively, if Nvidia could enforce intellectual property rights against someone creating such a clone, that would also explain it.

(Technically it's possible that this could instead be explained by a free-rider problem; i.e., such a clone would create more value than it would cost, but no company wants to sponsor it because they're all waiting for some other company to do it and then save the $10 million it would cost to do it themselves. But this seems unlikely; big tech companies often spend more than $10 million on open source projects of strategic significance, which a CUDA clone would have.)

privatelypublic
·
10 hours ago
·
[ - ]

You scuttled your argument by using apple AppStore as an example.

bigyabai
·
19 hours ago
·
[ - ]

Sure they can. CUDA used to have a competitor, sponsored by Apple. It's name is OpenCL.

dannyw
·
17 hours ago
·
[ - ]

And after Apple dropped NVIDIA, they stopped caring about openCL performance on their GPUs.

pjmlp
·
18 hours ago
·
[ - ]

And the tooling, people keep forgeting about CUDA tooling.

saagarjha
·
23 hours ago
·
[ - ]

Yes. See: AMD

lukev
·
23 hours ago
·
[ - ]

AMD has never implemented the CUDA API. And not for technical reasons.

gpm
·
22 hours ago
·
[ - ]

They did, or at least they paid someone else to.

https://www.techpowerup.com/319016/amd-develops-rocm-based-s...

Imustaskforhelp
·
19 hours ago
·
[ - ]

But I think then there was some lawsuit and the rocm guy/team had gone really ahead but amd dropped it because of either fear of lawsuit or lawsuit in general.

Then, now they had to stop working on some part of the source code and had to rewrite a lot of things again, they are still not as close to as they were before amd lawyer shenanigan

hangonhn
·
23 hours ago
·
[ - ]

Is CUDA tied very closely to the Nvidia hardware and architecture so that all the abstraction would not make sense on other platforms? I know very little about hardware and low level software.

Thanks

lcnielsen
·
20 hours ago
·
[ - ]

The kind of CUDA you or I would write is not very hardware specific (a few constants here and there) but the kind of CUDA behind cuBLAS with a million magic flags, inline PTX ("GPU assembly") and exploitation of driver/firmware hacks is. It's like the difference between numerics code in C and and numerics code in C with tons of in-line assembly code for each one of a number of specific processors.

You can see similar things if you buy datacenter-grade CPUs from AMD or Intel and compare their per-model optimized BLAS builds and compilers to using OpenBLAS or swapping them around. The difference is not world ending but you can see maybe 50% in some cases.

dagmx
·
22 hours ago
·
[ - ]

CUDA isn’t really that hyper specific to NVIDIA hardware as an api.

But a lot of the most useful libraries are closed source and available on NVIDIA hardware only.

You could probably get most open source CUDA to run on other vendors hardware without crazy work. But you’d spend a ton more work getting to parity on ecosystem and lawyer fees when NVIDIA come at you.

pjmlp
·
18 hours ago
·
[ - ]

CUDA is an ecosystem, many keep failing to understand that, trying to copy only the C++ compiler.

saagarjha
·
22 hours ago
·
[ - ]

Yes, also it's a moving target where people don't just want compatibility but also good performance.

tekawade
·
21 hours ago
·
[ - ]

I want #3 be able to connect NVIDIA GPU with Apple Silicon and run CUDA. Take advantage of apple silicon + unified memory + GPU + CUDA with PyTorch, JAX or TensorFlow.

Haven’t really explored MLX so can’t speak about it.

pxc
·
13 hours ago
·
[ - ]

Copyright can't prohibit compatible implementations that are developed independently through reverse engineering, if the implementers are very careful about the way they work.

(Software patents can, though.)

dagmx
·
22 hours ago
·
[ - ]

2 also further cements CUDA as the de facto API to target, and nobody would write MLX targeted code instead.

This way, you’re more incentivized to write MLX and have it run everywhere. It’s a situation of everyone wins, especially Apple because they can optimize it further for their platforms.

ls612
·
23 hours ago
·
[ - ]

#2 would be Google v. Oracle wouldn’t it?

nxobject
·
1 day ago
·
[ - ]

If you're going "wait, no Apple platform has first-party CUDA support!", note that this set of patches also adds support for "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".

https://ml-explore.github.io/mlx/build/html/install.html

paulirish
·
1 day ago
·
[ - ]

It's coming from zcbenz who created Electron among others https://zcbenz.com/ Nice.

zdw
·
1 day ago
·
[ - ]

How does this work when one of the key features of MLX is using a unified memory architecture? (see bullets on repo readme: https://github.com/ml-explore/mlx )

I would think that bringing that to all UMA APUs (of any vendor) would be interesting, but discreet GPU's definitely would need a different approach?

edit: reading the PR comments, it appears that CUDA supports a UMA API directly, and will transparently copy as needed.

freeone3000
·
23 hours ago
·
[ - ]

Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.

neurostimulant
·
20 hours ago
·
[ - ]

> Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.

Does this means you can use MLX on linux now?

Edit:

Just tested it and it's working but only python 3.12 version is available on pypi right now: https://pypi.org/project/mlx-cuda/#files

mattfrommars
·
11 hours ago
·
[ - ]

It's year 2025 and we have yet to have impact of CUDA like what Java had in the idea, "write once, run it anywhere"

Academia and companies continue to write proprietary code. Its as if we continue to write code for Adobe Flash or Microsoft Silverlight in year 2025.

Honestly, I don't mind as Nvidia shareholder.

int_19h
·
8 hours ago
·
[ - ]

Back in the day, the reason why people kept targeting Flash is because all the other alternatives were worse. If you recall, the only thing that made a difference was mobile, where Flash ended up being a liability due to performance and battery lifetime issues. And even then it took a company like Apple, which could rely on its cult status to draw the hard line on mobile Flash, ship iPhone without it (unlike Android which had it, warts and all), and steadfastly refuse to even consider adding it, forcing everybody else to choose between using Flash and supporting the lucrative iPhone ecosystem.

I'm not even sure what the equivalent would be for CUDA tbh.

bigyabai
·
3 hours ago
·
[ - ]

Apple could just, talk to Khronos again. In any protracted discussion of "how can the industry kill Nvidia", we always circle back around to the lack of communication. There was an era where Apple, AMD and even Nvidia all worked on Open Source non-CUDA acceleration primitives. There were working drivers, (a handful of) users, and bindings in multiple native languages. All they needed was industry applications, which would arrive with the crypto mining boom that Nvidia profited off of hand-over-fist. And by then, Apple refused to cooperate with their industry partners, and refused to support OpenCL on iPhone GPUs or Apple Silicon. Metaphorically, this would be like Apple refusing to implement HTML because they wanted to promote their own Flash alternative.

Nvidia won because they don't deal with this level of asinine infighting. If Khronos could bring back a level of mutual respect to their consortium, they could present a serious threat. Apple is the only business still on their high horse; AMD, Intel and Qualcomm all recognize that they need to cooperate.

raincole
·
8 hours ago
·
[ - ]

In the end Java doesn't achieve "write once, run it anywhere" either.

I guess there might be a way to develop apps for iOS or even PlayStation in Java, but my knees hurt just thinking about how many hoops one needs to jump through.

m463
·
5 hours ago
·
[ - ]

> "write once, run it anywhere"

I think you mean:

"write once, test everywhere"

bigyabai
·
8 hours ago
·
[ - ]

I'll never get over the way Apple treated OpenCL. They saw the train coming down the tracks, spent so long hedging their bet against CUDA, and threw in the towel the moment actual demand started cropping up. CUDA very nearly had a serious, corporate-funded and write-once-run-anywhere competitor.

Normally I write something snide about not seeing where the puck was headed. But Apple did skate to the puck the puck here, they just did nothing with it.

dnchdnd
·
22 hours ago
·
[ - ]

Random aside: A lot of the people working on MLX don't seem to be officially affiliated with Apple at least in a superficial review. See for example: https://x.com/prince_canuma

Idly wondering, is Apple bankrolling this but wants to keep it in the DL? There were also rumours the team was looking to move at one point ?

jpcompartir
·
14 hours ago
·
[ - ]

It seems more like Open Source devs who are looking to build clout/rep with MLX?

Pretty sure Claude Sonnet is actually doing most of the work.

numpad0
·
1 day ago
·
[ - ]

> This PR is an ongoing effort to add a CUDA backend to MLX

looks like it allows MLX code to compile and run on x86 + GeForce hardware, not the other way around.

MuffinFlavored
·
1 day ago
·
[ - ]

Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple Silicon speaking CUDA?... I can't really tell.

Edit: looks like it's "write once, use everywhere". Write MLX, run it on Linux CUDA, and Apple Silicon/Metal.

MBCook
·
1 day ago
·
[ - ]

Seems you already found the answer.

I’ll note Apple hasn’t shipped an Nvidia card in a very very long time. Even on the Mac pros before Apple Silicon they only ever sold AMD cards.

My understanding from rumors is that they had a falling out over the problems with the dual GPU MacBook Pros and the quality of drivers.

I have no idea if sticking one in on the PCI bus let you use it for AI stuff though.

xuki
·
1 day ago
·
[ - ]

That particular MBP model had a high rate of GPU failure because it ran too hot.

I imagined the convo between Steve Jobs and Jensen Huang went like this:

S: your GPU is shit

J: your thermal design is shit

S: f u

J: f u too

Apple is the kind of company that hold a grudge for a very long time, their relationships with suppliers are very one way, their way or the highway.

narism
·
22 hours ago
·
[ - ]

The MBPs didn’t run too hot, the Nvidia GPUs used an underfill that stopped providing structural support at a relatively normal temperature for GPUs (60-80 degrees C).

GPU failures due to this also happened on Dell/HP/Sony laptops, some desktop models, as well as early models of the PS3.

Some reading: https://www.badcaps.net/forum/troubleshooting-hardware-devic...

sciencesama
·
12 hours ago
·
[ - ]

Are you watching the bear ?

sciencesama
·
23 hours ago
·
[ - ]

And so is the same with nvidea too

rcruzeiro
·
1 day ago
·
[ - ]

I think the ones that failed were the AMD ones, specifically the old 17 inches MacBook Pro.

MBCook
·
22 hours ago
·
[ - ]

I had 15” MBP, maybe a 2010, that was dual GPU with an Nvidia that was definitely a problem.

roboror
·
23 hours ago
·
[ - ]

D700s dying in the trash can Mac Pros cost me (and many others) a lot of time and money.

cindyllm
·
1 day ago
·
[ - ]

[dead]

bobmcnamara
·
1 day ago
·
[ - ]

S: omg so thin!!1!1!!l!

VladVladikoff
·
1 day ago
·
[ - ]

Won’t work. No driver support.

kmeisthax
·
1 day ago
·
[ - ]

On Apple Silicon, writing to memory on a PCIe / Thunderbolt device will generate an exception. ARM spec says you're allowed to write to devices as if they were memory but Apple enforces that all writes to external devices go through a device memory mapping[0]. This makes using an external GPU on Apple Silicon[1] way more of a pain in the ass, if not impossible. AFAIK nobody's managed to write an eGPU driver for Apple Silicon, even with Asahi.

[0] https://developer.arm.com/documentation/102376/0200/Device-m...

[1] Raspberry Pi 4's PCIe has the same problem AFAIK

bobmcnamara
·
1 day ago
·
[ - ]

Ewww, that kills out of order CPU performance. If it's like ARMv7, it effectively turns each same-page access into it's own ordering barrier.

saagarjha
·
23 hours ago
·
[ - ]

Writing to device memory does not generate an exception.

hbcondo714
·
1 day ago
·
[ - ]

> "write once, use everywhere"

So my MLX workloads can soon be offloaded to the cloud!?

dkga
·
1 day ago
·
[ - ]

This is the only strategy humble me can see working for CUDA in MLX

whatever1
·
23 hours ago
·
[ - ]

This is the right answer. Local models will be accelerated by Apple private cloud.

cowsandmilk
·
1 day ago
·
[ - ]

Neither, it is for Linux computers with NVIDIA cards

Abishek_Muthian
·
22 hours ago
·
[ - ]

I’ve been very impressed with MLX models; I can open up local models to everyone in the house, something I wouldn’t dare with my Nvidia computer for the risk of burning down the house.

I’ve been hoping Apple Silicon becomes a serious contender for Nvidia chips; I wonder if the CUDA support is just Embrace, extend, and extinguish (EEE).

albertzeyer
·
1 day ago
·
[ - ]

This is exciting. So this is using unified memory of CUDA? I wonder how well that works. Is the behavior of the unified memory in CUDA actually the same as for Apple silicon? For Apple silicon, as I understand, the memory is anyway shared between GPU and CPU. But for CUDA, this is not the case. So when you have some tensor on CPU, how will it end up on GPU then? This needs a copy somehow. Or is this all hidden by CUDA?

zcbenz
·
1 day ago
·
[ - ]

In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.

·
23 hours ago
·
[ - ]

fenced_load
·
1 day ago
·
[ - ]

There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.

benreesman
·
1 day ago
·
[ - ]

Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.

saagarjha
·
23 hours ago
·
[ - ]

This seems like it would be slow…

freeone3000
·
23 hours ago
·
[ - ]

Matches my experience. It’s memory stalls all over the place, aggravated (on 12.3 at least) there wasn’t even a prefetcher.

nickysielicki
·
1 day ago
·
[ - ]

ethan_smith
·
12 hours ago
·
[ - ]

CUDA's Unified Memory uses page migration with on-demand faulting to create the illusion of shared memory, whereas Apple Silicon has true shared physical memory, resulting in different performance characteristics despite the similar programming model.

MBCook
·
1 day ago
·
[ - ]

This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?

I know standard GPUs don’t.

The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.

ajuhasz
·
1 day ago
·
[ - ]

The Jetsons[1] have unified memory[2].

[1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...

tonyarkles
·
1 day ago
·
[ - ]

They sure do and it's pretty amazing. One iteration of a vision system I worked on got frames from a camera over a Mellanox NIC that supports RDMA (Rivermax), preprocessed the images using CUDA, did inference on them with TensorRT, and the first time a single byte of the inference pipeline hit the CPU itself was when we were consuming the output.

patrickkrusiec
·
1 day ago
·
[ - ]

The physical memory is not be unified, but on modern rack scale Nvidia systems, like Grace Hopper or NVL72, the CPU and the GPU(s) share the same virtual address space and have non-uniform memory access to each other's memory.

freeone3000
·
23 hours ago
·
[ - ]

Standard GPUs absolutely do. Since CUDA 11, all CUDA cards expose the same featureset at differing speeds (based on backing capability). You can absolutely (try to) run CUDA UMA on your 2060, and it will complete the computation.

Y_Y
·
1 day ago
·
[ - ]

The servers don't, but the Jetsons do

sciencesama
·
23 hours ago
·
[ - ]

Apple is planing to build data centers with mseries of chips for both app development, testing and to host external services!

mr_toad
·
7 hours ago
·
[ - ]

If they were doing that, they wouldn’t need CUDA support. More likely they have internal developers who want to do development on Apple hardware and deploy to Nvidia hardware in production.

·
23 hours ago
·
[ - ]

qwertox
·
19 hours ago
·
[ - ]

If Apple would support Nvidia cards it would be the #1 solution for developers.

Nevermark
·
19 hours ago
·
[ - ]

If Apple doubled the specs of their Ultra M processor every year, in numbers of cores, RAM cells, internal and external bandwidth, until both the Ultra processor and its RAM plane took up full wafers, .... but still fit in a Mac Studio case, with a new white reverse-power heat extraction USB-C+ cable designed to be terminated at a port on a small wireless heat exchanger dish, which instantly beamed all the waste heat into space, at such high efficiency that the Studio internals could operate at -100 Celsius, and all those cores overclocked, oh they over clocked, ...

Yes we can dream!

It would great if Apple continues pushing M processors to next levels, in part, to go vertical into the cloud.

Or if they start supporting nVidia.

The latter seems less Apple-y. But they must be considering the value of a cloud level Apple-friendly AI computing solution, so something is likely (?) to happen.

neuroelectron
·
12 hours ago
·
[ - ]

Just remember to name for fp8 kernels "cutlass" for +50% performance.

benreesman
·
1 day ago
·
[ - ]

I wonder how much this is a result of Strix Halo. I had a fairly standard stipend for a work computer that I didn't end up using for a while so I recently cashed it in on the EVO-X2 and fuck me sideways: that thing is easily competitive with the mid-range znver5 EPYC machines I run substitors on. It mops the floor with any mere-mortal EC2 or GCE instance, like maybe some r1337.xxxxlarge.metal.metal or something has an edge, but the z1d.metal and the c6.2xlarge or whatever type stuff (fast cores, good NIC, table stakes), blows them away. And those things are 3-10K a month with heavy provisioned IOPS. This thing has real NVME and it cost 1800.

I haven't done much local inference on it, but various YouTubers are starting to call the DGX Spark overkill / overpriced next to Strix Halo. The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).

Flawless CUDA on Apple gear would make it really tempting in a way that isn't true with Strix so cheap and good.

hamandcheese
·
1 day ago
·
[ - ]

For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.

The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).

It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".

Rohansi
·
23 hours ago
·
[ - ]

You're comparing theoretical maximum memory bandwidth. It's not enough to only look at memory bandwidth because you're a lot more likely to be compute limited when you have a lot of memory bandwidth available. For example, M1 had so much bandwidth available that it couldn't make use of even when fully loaded.

hamandcheese
·
4 hours ago
·
[ - ]

Memory bandwidth puts an upper limit on LLM tokens per second.

At 200GB/s, that upper limit is not very high at all. So it doesn't really matter if the compute is there or not.

Rohansi
·
3 hours ago
·
[ - ]

The M1 Max's GPU can only make use of about 90GB/s out of the 400GB/s they advertise/support. If the AMD chip can make better use of its 200GB/s then, as you say, it will manage to have better LLM tokens per second. You can't just look at what has the wider/faster memory bus.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

hamandcheese
·
40 minutes ago
·
[ - ]

This mainly shows that you need to watch out when it comes to unified architectures. The sticker bandwidth might not be what you can get for GPU-only workloads. Fair point. Duly noted.

But my overarching point still stands: LLM inference needs memory bandwidth, and 200GB/s is not very much (especially for the higher ram variants).

If the M1 Max is actually 90GBs that just means it's a poor choice for LLM inference.

zargon
·
20 hours ago
·
[ - ]

GPUs have both the bandwidth and the compute. During token generation, no compute is needed. But both Apple silicon and Strix Halo fall on their face during prompt ingestion, due to lack of compute.

supermatt
·
17 hours ago
·
[ - ]

Compute (and lots of it) is absolutely needed for generation - 10s of billions of FLOPs per token on the smaller models (7B) alone - with computations of the larger models scaling proportionally.

Each token requires a forward pass through all transformer layers, involving large matrix multiplications at every step, followed by a final projection to the vocabulary.

zargon
·
17 hours ago
·
[ - ]

Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.

supermatt
·
13 hours ago
·
[ - ]

It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.

zargon
·
8 hours ago
·
[ - ]

That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.

yieldcrv
·
1 day ago
·
[ - ]

Apple is just being stupid, handicapping their own hardware so they can sell the fixed one next year or the year after

This is time tested Apple strategy is now undermining their AI strategy and potential competitiveness

tl;dr they could have done 1600GB/s

Nevermark
·
21 hours ago
·
[ - ]

So their products are so much better, in customer demand terms that they don’t need to rush tech out the door?

Whatever story you want to create, if customers are happy year after year then Apple is serving them well.

Maybe not with same feature dimension balance you want, or other artificial/wishful balances you might make up for them.

(When Apple drops the ball it is usually painful, painfully obvious and most often a result of a deliberate and transparent priority tradeoff. No secret switcherooos or sneaky downgrading. See: Mac Pro for years…)

yieldcrv
·
16 hours ago
·
[ - ]

Apple is absolutely fumbling on their AI strategy despite their vertical hardware integration, there is no strategy. Its a known problem inside Apple, not a 4-D chess thing to wow everyone with a refined version in 2030

saagarjha
·
23 hours ago
·
[ - ]

They could have shipped a B200 too. Obviously there are reasons they don't do that.

jitl
·
1 day ago
·
[ - ]

It’s pretty explicitly targeting cloud cluster training in the PR description.

ivape
·
23 hours ago
·
[ - ]

If we believe that there’s not enough hardware to meet demand, then one could argue this helps Apple meet demand, even if it’s just by a few percentage points.

attentive
·
1 day ago
·
[ - ]

how is it vs m4 mac mini?

adultSwim
·
13 hours ago
·
[ - ]

Do you need to copy to load a model from CPU memory into GPU memory?

drcongo
·
16 hours ago
·
[ - ]

This was nice to read, I ordered an EVO-X2 a week ago though I'm still waiting for them to actually ship it - I was waiting on a DGX Spark but ended up deciding that was never actually going to ship. Got any good resources for getting the thing up and running with LLMs, diffusion models etc.?

benreesman
·
13 hours ago
·
[ - ]

However excited you are, it's merited. Mine took forever too, and it's just completely worth it. It's like a flagship halo product, they won't make another one like this for a while I don't think. You won't be short on compute relative to a trip to best buy for many years.

nl
·
1 day ago
·
[ - ]

> The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).

Competitive AMD GPU neural compute has been any day now for at least 10 years.

bigyabai
·
1 day ago
·
[ - ]

The inference side is fine, nowadays. llama.cpp has had a GPU-agnostic Vulkan backend for a while, it's the training side that tends to be a sticking point for consumer GPUs.

orliesaurus
·
1 day ago
·
[ - ]

Why is this a big deal, can anyone explain if they are familiar with the space?

elpakal
·
1 day ago
·
[ - ]

> NVIDIA hardware is widely used for academic and massive computations. Being able to write/test code locally on a Mac and then deploy to super computers would make a good developer experience.

That one stands out to me as a mac user.

radicaldreamer
·
1 day ago
·
[ - ]

MacBooks used to use Nvidia GPUs, then Apple had a falling out with Nvidia and the beef stands to this day (Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence).

I wouldn’t be surprised if within the next few years we see a return of Nvidia hardware to the Mac, probably starting with low volume products like the MacPro, strictly for professional/high-end use cases.

fooker
·
1 day ago
·
[ - ]

> Apple didn’t use Nvidia hardware when training it’s own LLMs for Apple Intelligence

Do you have some links for this?

dialup_sounds
·
23 hours ago
·
[ - ]

https://arxiv.org/abs/2407.21075

tl;dr - they used Google TPUs

almostgotcaught
·
1 day ago
·
[ - ]

People on hn make up more BS than your local bar

https://www.investors.com/news/technology/apple-stock-apple-...

tgma
·
17 hours ago
·
[ - ]

What did the poster make up? There's one line where they speculated about future and a commentary about beef existing to this day which is subjective but the rest of it was 100% factual: Apple relied on Google for training their LLM for various reasons and they did have a beef with NVIDIA re MacBooks a long time ago after which they switched the entire line to AMD Graphics.

Keyframe
·
1 day ago
·
[ - ]

Now do linux support / drivers for Mac hardware!

bigyabai
·
1 day ago
·
[ - ]

I think we're seeing the twilight of those efforts. Asahi Linux was an absolute powerhouse of reverse-engineering prowess, and it took years to get decent Vulkan coverage and half of the modern lineup's GPUs supported. Meanwhile AMD and even Intel are shipping Vulkan 1.3 drivers day-one on new hardware. It's a cool enthusiast effort to extend the longevity of the hardware, but it bears repeating; nobody is disrupting Nvidia's bottom-line here. Apple doesn't sell hardware competitive with Nvidia's datacenter hardware, and even if they did it's not supported by the community. It's doubtful that Apple would make any attempt to help them.

There seems to a pervading assumption that Apple is still making a VolksComputer in 2025, blithely supporting a freer status-quo for computing. They laid out their priorities completely with Apple Silicon, you're either on Apple's side or falling behind. Just the way they want it.

lvl155
·
1 day ago
·
[ - ]

Seriously. Those Apple guys became delusional especially after Jobs passed away. These guys just sat on their successes and did nothing for a decade plus. M1 was nice but that was all Jobs doing and planning. I don’t like this Apple. They forgot how to innovate.

But I guess we have a VR device nobody wants.

marcellus23
·
1 day ago
·
[ - ]

> M1 was nice but that was all Jobs doing and planning

M1 was launched 9 years after Jobs died. You're saying they had everything ready to go back then and just sat on their asses for a decade?

lvl155
·
1 day ago
·
[ - ]

Who bought Semi? Jobs knew they had to make their own. M1 is just a product of their iPhone chips hence all the efficiency.

marcellus23
·
22 hours ago
·
[ - ]

Jobs knew they had to make their own chips, and in your mind that constitutes "all the doing and planning"?

lvl155
·
15 hours ago
·
[ - ]

I said “[Jobs’] doing and planning” whereas you make it sound like Semi and M1 have nothing to do with Jobs. Apple has M1 because Jobs had a vision. Tell me one thing Apple did since Jobs’ passing that show such a vision. Maybe Watch? Hardly ground breaking. VR? ‘nuff said.

saagarjha
·
23 hours ago
·
[ - ]

Ok, but did you ever think about PA Semi being the Alpha guys? Maybe the DEC leadership deserves credit for M1

jjtheblunt
·
1 day ago
·
[ - ]

It would be funny if you were typing out your response on an iPhone that has been running for 36 hours without recharging.

macinjosh
·
1 day ago
·
[ - ]

if only their batteries would last that long.

can16358p
·
19 hours ago
·
[ - ]

Unless one constantly browses Instagram or TikTok, they do.

pxc
·
10 hours ago
·
[ - ]

> Seriously. Those Apple guys became delusional especially after Jobs passed away.

Didn't Jobs himself essentially die of delusion?

adultSwim
·
13 hours ago
·
[ - ]

This is great to see. I had wrongly assumed MLX was Apple-only.

m3kw9
·
20 hours ago
·
[ - ]

I thought you either use MLX for apple silicone or you compile it for cudaw

teaearlgraycold
·
1 day ago
·
[ - ]

I wonder if Jensen is scared. If this opens up the door to other implementations this could be a real threat to Nvidia. CUDA on AMD, CUDA on Intel, etc. Might we see actual competition?

jsight
·
1 day ago
·
[ - ]

I think this is the other way around. It won't be cuda on anything except for nvidia.

However, this might make mlx into a much stronger competitor for Pytorch.

mayli
·
1 day ago
·
[ - ]

Yeah, nice to have MLX-opencl or MLX-amd-whatever

baby_souffle
·
1 day ago
·
[ - ]

If you implement compatible apis, are you prohibited from calling it cuda?

15155
·
1 day ago
·
[ - ]

Considering 100% of the low-level CUDA API headers have the word "CUDA" in them, this would be interesting to know.

moralestapia
·
1 day ago
·
[ - ]

I'm sure I saw this lawsuit somewhere ...

The gist is the API specification in itself is copyright, so it is copyright infringement then.

·
1 day ago
·
[ - ]

wyldfire
·
1 day ago
·
[ - ]

Too subtle - was this oracle vs java one? Remind me: java won or lost that one?

mandevil
·
1 day ago
·
[ - ]

Oracle sued Google, and Google won, 6-2 (RBG was dead, Barrett had not yet been confirmed when the case was heard).

Supreme Court ruled that by applying the Four Factors of Fair Use, Google stayed within Fair Use.

An API specification ends up being a system of organizing things, like the Dewey Decimal System (and thus not really something that can be copyrighted), which in the end marks the first factor for Google. Because Google limited the Android version of the API to just things that were useful for smart phones it won on the second factor too. Because only 0.4% of the code was reused, and mostly was rewritten, Google won on the third factor. And on the market factor, if they held for Oracle, it would harm the public because then "Oracle alone would hold the key. The result could well prove highly profitable to Oracle (or other firms holding a copyright in computer interfaces) ... [but] the lock would interfere with, not further, copyright's basic creativity objectives." So therefore the fourth factor was also pointing in Google's favor.

Whether "java" won or lost is a question of what is "java"? Android can continue to use the Java API- so it is going to see much more activity. But Oracle didn't get to demand license fees, so they are sad.

moralestapia
·
1 day ago
·
[ - ]

Oh man, thanks for this.

I always thought it was resolved as infringement and they had to license the Java APIs or something ...

Wow.

mandevil
·
1 day ago
·
[ - ]

The district court ruled for Google over patents and copyright- that it was not a copyright at all, the Court of Appeals then reversed and demanded a second court trial on whether Google was doing fair use of Oracle's legitimate copyright, which the district court again held for Google, and then the Court of Appeals reversed the second ruling and held for Oracle that it was not fair use of their copyright, and then Google appealed that to the the Supreme Court ... and won in April 2021, putting an end to this case which was filed in August 2010. But the appeals court in between the district court and the Supreme Court meant that for a long while in the middle Oracle was the winner.

This is part of why patents and copyrights can't be the moat for your company. 11 years, with lots of uncertainty and back-and-forth, to get a final decision.

tough
·
1 day ago
·
[ - ]

Yeah this case made me think using llms to clean-room reverse engineer any API exposing SaaS or private codebase would be game

teaearlgraycold
·
1 day ago
·
[ - ]

Oh bummer. Almost got excited.

int_19h
·
8 hours ago
·
[ - ]

Abstraction layers for GPU compute already exist; this is yet another one, so it doesn't change anything substantially. Most of the time code written using such layers ends up running on NVIDIA hardware in prod anyway, so if anything that is a net positive for the company - it means that more people can now develop for its hardware on their devices.

tekacs
·
1 day ago
·
[ - ]

This instance is the other way around, but that's what this is – CUDA on AMD (or other platforms): https://docs.scale-lang.com/stable/

pjmlp
·
18 hours ago
·
[ - ]

Why, everyone keeps trying to copy CUDA while failing to understand why many of us love it.

almostgotcaught
·
1 day ago
·
[ - ]

> CUDA backend

backend

gsibble
·
1 day ago
·
[ - ]

Awesome

DidYaWipe
·
23 hours ago
·
[ - ]

[dead]

natas
·
22 hours ago
·
[ - ]

that means the next apple computer is going to use nvidia gpu(s).

MBCook
·
21 hours ago
·
[ - ]

There’s no evidence of that. The post clearly identifies a far more probable reason in letting things be developed in Mac’s then deployed on Nvidia supercomputers.

meepmorp
·
22 hours ago
·
[ - ]

but it's not an apple-submitted pr

natas
·
22 hours ago
·
[ - ]

they can't make it that obvious

meepmorp
·
21 hours ago
·
[ - ]

nerdsniper
·
1 day ago
·
[ - ]

Edit: I had the details of the Google v Oracle case wrong. SCOTUS found that re-implementing an API does not infringe copyright. I was remembering the first and second appellate rulings.

Also apparently this is not a re-implementation of CUDA.

liuliu
·
1 day ago
·
[ - ]

You misunderstood and this is not re-implementing CUDA API.

MLX is a PyTorch-like framework.

Uehreka
·
1 day ago
·
[ - ]

This is exactly the kind of thing I wouldn’t opine on until like, an actual lawyer weighs in after thoroughly researching it. There are just too many shades of meaning in this kind of case law for laymen to draw actionable conclusions directly from the opinions.

Though I imagine that if Apple is doing this themselves, they likely know what they’re doing, whatever it is.

skyde
·
1 day ago
·
[ - ]

this is CUDA backend to MLX not MLX backend for CUDA!