Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"
I have a sneaking suspicion that the real real reason for an NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make sure we stick some AI stuff in our products too."
Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth.
* On CPU: SIMD NEON
* On CPU: custom matrix multiply accelerator, separate from SIMD unit
* On CPU package: NPU
* GPU
Then they go and hide it all in proprietary undocumented features and force you to use their framework to access it :cThe M processors changed the game. My teams support 250k users. I went from 50 MacBooks in 2020 to over 10,000 today. I added zero staff - we manage them like iPhones.
Coffe shops, trains and airports in Europe? Nope, rare animal on tables.
European schools? Most countries parents buy their kids a computer, and most often it is a desktop used by the whole family, or a laptop of some kind running Windows, unless we are talking about the countries where buying Apple isn't an issue on the monthly expenses.
Popular? In Germany, the few times they get displayed on shopping mall stores, they get rountinely discounted, or bundled with something else, until finally they get rid of them.
Valve is heavily dependent on game studios producing Windows games.
The M processor really did completely eliminate all sense of “lag” for basic computing (web browsing, restarting your computer, etc). Everything happens nearly instantly, even on the first generation M1 processor. The experience of “waiting for something to load” went away.
Not to mention these machines easily last 5-10 years.
As someone who used windows laptops, I was amazed when I saw someone sitting next to me on a public transit subway on her MacBook Pro editing images on photoshop with just her trackpad. The standard for windows laptops used to be that low (about ten or twelve years ago?) that seeing a MacBook trackpad just woke someone is a part of my permanent memory.
In companies where I worked where the IT team rolled out "security" software to the Mac-based developers, their computers were not noticeably faster than Windows PCs at all, especially given the majority of containers are still linux/amd64, reflecting the actual deployment environment. Meanwhile Windows also runs on ARM anyway, so it's not really something useful to generalize about.
[0] https://support.microsoft.com/en-us/topic/how-to-add-a-file-...
There are also insurance, compliance, and other constraints that IT folks have that make them unwilling to turn off scanning for you.
To be fair, the average employee doesn’t have much more than idiot-level knowledge when it comes to security.
The majority of employees would rather turn off automatic OS updates simply because it’s a hassle to restart your computer because god forbid they you loose those 250 chrome tabs waiting for you to never get around to revisiting!
So the pro-part of the tip does not apply.
On my own machines anti-virus is one the very first things to be removed. Most of the time I'd turn off all the swap file, yet Windows doesn't overcommit and certain applications are notorious for allocating memory w/o even using it.
Even modern macs can be brought to their knees by something that rhymes with FrowdStrike Calcon and interrupts all IO.
One can bury the machine and lose very little basic interactivity. That part users really like.
Frankly the only downside of the MacBook Air is the tiny storage. The 8GB RAM is actually enough most of the time. But general system storage with only 1/4 TB is cramped consistently.
Been thinking about sending the machine out to one of those upgrade shops...
I could 3d print a couple of brackets and probably lodge a bigger SSD or the smaller form factor eMMC I think and pack it all into a little package one just plugs in. The port extender is currently shaped such that it fits right under the Air tilting it nicely for general use.
The Air only has external USB... still, I don't need to boot from it. The internal one can continue to do that. Storage is storage for most tasks.
So yeah, great deal. And I really wanted to run the new CPU.
Frankly, I can do more and generally faster than I would expect running on those limited resources. It has been a quite nice surprise.
For a lot of what I do, the RAM and storage are enough.
I think part of the reason is that we manage Mac pretty strictly now but we're getting there with Linux too.
We also tried to get them to use WSL 1 and 2 but they just laugh at it :) And point at its terrible disk performance and other dealbreakers. Can't blame them.
As a company, if customers are willing to pay a premium for a NPU, or if they are unwilling to buy a product without one, it is not your place to say “hey we don’t really believe in the AI hype so we’re going to sell products people don’t want to prove a point”
If they shove it in every single product and that’s all anyone advertises, whether consumers know it will help them or not, you don’t get a lot of choice.
If you want the latest chip, you’re getting AI stuff. That’s all there is to it.
"How many models to we have without logos?"
"Huh? Why would we do that?"
To some degree I understand it, because as we’ve all noticed computers have pretty much plateaued for the average person. They last much longer. You don’t need to replace them every two years anymore because the software isn’t out stripping them so fast.
AI is the first thing to come along in quite a while that not only needs significant power but it’s just something different. It’s something they can say your old computer doesn’t have that the new one does. Other than being 5% faster or whatever.
So even if people don’t need it, and even if they notice they don’t need it, it’s something to market on.
The stuff up thread about it being the hotness that Wall Street loves is absolutely a thing too.
Microsoft is built around the broken Intel tick/tick model of incremental improvement — they are stuck with OEM shitware that will take years to flush out of the channel. That means for AI, they are stuck with cloud based OpenAI, where NVIDIA has them by the balls and the hyperscalers are all fighting for GPU.
Apple will deliver local AI features as software (the hardware is “free”) at a much higher margin - while Office 365 AI is like $400+ a year per user.
You’ll have people getting iPhones to get AI assisted emails or whatever Apple does that is useful.
The stuff they've been trying to sell AI to the public with is increasingly looking as absurd as every 1978 "you'll store your recipes on the home computer" argument.
AI text became a Human Centipede story: Start with a coherent 10-word sentence, let AI balloon it into five pages of flowery nonsense, send it to someone else, who has their AI smash it back down to 10 meaningful words.
Coding assistance, even as spicy autocorrect, is often a net negative as you have to plow through hallucinations and weird guesses as to what you want but lack the tools to explain to it.
Image generation is already heading rapidly into cringe territory, in part due to some very public social media operations. I can imagine your kids' kids in 2040 finding out they generated AI images in the 2020s and looking at them with the same embarrassment you'd see if they dug out your high-school emo fursona.
There might well be some more "closed-loop" AI applications that make sense. But are they going to be running on every desktop in the world? Or are they going to be mostly used in datacentres and purpose-built embedded devices?
I also wonder how well some of the models and techniques scale down. I know Microsoft pushed a minimum spec to promote a machine as Copilot-ready, but that seems like it's going to be "Vista Basic Ready" redux as people try to run tools designed for datacentres full of Quadro cards, or at least high-end GPUs, on their $299 HP laptop.
Bela Lugosi Died in 1979, and Peter Murphy was onto his next band by 1984.
By 2000 Goth was fully a distant dot in the rear view mirror for the OG's
In 2002, Murphy released *Dust* with Turkish-Canadian composer and producer Mercan Dede, which utilizes traditional Turkish instrumentation and songwriting, abandoning Murphy's previous pop and rock incarnations, and juxtaposing elements from progressive rock, trance, classical music, and Middle Eastern music, coupled with Dede's trademark atmospheric electronics.
https://www.youtube.com/watch?v=Yy9h2q_dr9kThat people recalling Goths in that period should beware of thinking that was a source and not an echo.
In 2006 Noel Fielding's Richmond Felicity Avenal was a basement dwelling leftover from many years past.
#Ostrogoth #TwueGoth
People are clearly finding LLM tech useful, and we’re barely scratching the surface.
And I'm pretty sure this is only Introductory pricing. As people get used to it and use it more it won't cover the cost. I think they rely on the gym model currently; many people not using the ai features much. But eventually that will change. Also, many companies figured that out and pull the copilot license from users that don't use it enough.
Local inference does have privacy benefits. I think at the moment it might make sense to send most of queries to a beefy cloud model, and send sensitive queries to a smaller local one.
We’re seeing these things in traditional PCs now because Microsoft has demanded it so that Microsoft can use it in Windows 11.
Any use by third party software is a lower priority
I can’t find TDP for Apple’s Neural Engine (https://en.wikipedia.org/wiki/Neural_Engine), but the first version shipped in the iPhone 8, which has a 7 Wh battery, so these are targeting different markets.
It's also often about offload. Depending on the use case, the CPU and GPU may be busy with other tasks, so the NPU is free bandwidth that can be used without stealing from the others. Consider AI-powered photo filters: the GPU is probably busy rendering the preview, and the CPU is busy drawing UI and handling user inputs.
Without those, wouldn't it be better to use the NPUs silicon budget on more CPU?
In this environment it makes some sense to use more efficient RISC cores, and to spread out cores a bit with dedicated bits that either aren't going to get used all the time, or that are going to be used at lower power draws, and combining cores with better on-die memory availability (extreme L2/L3 caches) and other features. Apple even has some silicon in the power section left as empty space for thermal reasons.
Emily (formerly Anthony) on LTT had a piece on the Apple CPUs that pointed out some of the inherent advantages of the big-chip ARM SOC versus the x86 motherboard-daughterboard arrangement as we start to hit Moore's Wall. https://www.youtube.com/watch?v=LFQ3LkVF5sM
NPUs focus on one specific type of computation, matrix multiplication, and usually with low precision integers, because that’s all a neural net needs. That vast reduction in flexibility means you can take lots of shortcuts in your design, allowing you cram more compute into a smaller footprint.
If you look at the M1 chip[1], you can see the entire 16-Neural engine has a foot print about the size of 4 performance cores (excluding their caches). It’s not perfect comparison, without numbers on what the performance core can achieve in terms of ops/second vs the Neural Engine. But it seems reasonable to be that the Neural Engine and handily outperform the performance core complex when doing matmul operations.
[1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
This is Nvidia's moat. Everything has optimized kernels for CUDA, and maybe Apple Accelerate (which is the only way to touch the CPU matrix unit before M4, and the NPU at all). If you want to use anything else, either prepare to upstream patches in your ML framework of choice or prepare to write your own training and inference code.
AMD is not taking ML applications seriously, outside of their marketing hype.
Because functional compatibility is hardly useful if the performance is not up to par, and cuDNN will run specific kernels that are particularly tuned to not only a specific model of GPU, but also to the specific inputs that the user is submitting. NVidia is doing a ton of work behind the scenes to both develop high-performance kernels for their exact architecture, but also to know which ones are best for a particular application.
This is probably the main reason why I was hesitant to join AMD a few years ago and to this day it seems like it was a good decision.
(In the end python just calls C, but it's pretty interesting how much performance is lost)
You cannot compare python with a onxx executor.
I don't know what you used in Python, but if it's pytorch or similar, those are built with flexibility in mind, for optimal performance you want to export those to onxx and use whatever executor that is optimized for your env. onxxruntime is one of them, but definitely not the only one, and given it's from Microsoft, some prefer to avoid it and choose among the many free alternatives.
Why?
If you're naively doing `time.time()` then what happens is this
start = time.time() # cpu records time
pred = model(input.cuda()).cuda() # push data and model (if not already there) to GPU memory and start computation. This is asynchronous
end = time.time() # cpu records time, regardless of if pred stores data
You probably aren't expecting that if you don't know systems and hardware. But python (and really any language) is designed to be smart and compile into more optimized things than what you actually wrote. There's no lock, and so we're not going to block operations for cpu tasks. You might ask why do this? Well no one knows what you actually want to do. And do you want the timer library now checking for accelerators (i.e. GPU) every time it records a time? That's going to mess up your timer! (at best you'd have to do a constructor to say "enable locking for this accelerator") So you gotta do something a bit more nuanced.If you want to actually time GPU tasks, you should look at cuda event timers (in pytorch this is `torch.cuda.Event(enable_timing=True)`. I have another comment with boilerplate)
Edit:
There's also complicated issues like memory size and shape. They definitely are not being nice to the NPU here on either of those. They (and GPUs!!!) want channels last. They did [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also the issue of how memory is allocated (and they noted IO being an issue). 1500 is a weird number (as is 6) so they aren't doing any favors to the NPU, and I wouldn't be surprised that this is a surprisingly big hit considering how new these things are
And here's my longer comment with more details: https://news.ycombinator.com/item?id=41864828
For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.
I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues
You know which chip has the lowest power consumption ? The one which is turned off. /s
https://www.techpowerup.com/325035/amd-strix-point-silicon-p...
Apple originally added their NPUs before the current LLM wave to support things like indexing your photo library so that objects and people are searchable. These features are still very popular. I don't think these NPUs are fast enough for GenAI anyway.
The camera is real good though.
“Recent” seems to mean everything; I’ve got 6k+ photos, I think since the last fresh install, which is many devices ago.
Sounds like the view you’re looking for and will stick as the default once you find it, but you do have to bat away some BS at first.
I'd be willing to bet that the amount of money they are missing out on is miniscule and is by far offset by people's money who care about other stuff. Like you know, performance and battery life, just to stick to your examples.
Why would anyone care about die size? And if you do why not get one of the many low power laptops with Atoms etc that do have small die size?
The realities of mass manufacturing and supply chains and whatnot mean it's cheaper to get a laptop with a webcam I don't use, a fingerprint reader I don't use, and an SD card reader I don't use. It's cheaper to get a CPU with integrated graphics I don't use, a trusted execution environment I don't use, remote management features I don't use. It's cheaper to get a discrete GPU with RGB LEDs I don't use, directx support I don't use, four outputs when I only need one. It's cheaper to get a motherboard with integrated wifi than one without.
Fwiw there should be no power downside to having an unused unit. It’ll just not be powered.
In practice, everyone is paying a premium for NPUs that only a minority desires, and only a fraction of that minority essentially does "something" with it.
This thread really helps to show that the use-cases are few, non-essential, and that the general application landscape hasn't adopted NPUs and has very little incentive to do so (because of the alien programming model, because of hardware compat across vendors, because of the ecosystem being a moving target with little stability in sight, and because of the high-effort/low-reward in general).
I do want to be wrong, of course. Tech generally is exciting because it offers new tools to crack old problems, opening new venues and opportunities in the process. Here it looks like we have a solution in search for a problem that was set by marketing departments.
The more they say the future will be better the more that it looks like the status quo.
Instead of an NPU, they could have used those transistors and die space for any number of things. But they wouldn't have put additional high performance CPU cores there - that would increase the power density too much and cause thermal issues that can only be solved with permanent throttling.
Every successive semiconductor node uses less power than the previous per transistor at the same clock speed. Its just that we then immediately use this headroom to pack more transistors closer and run them faster, so every chip keeps running into power limits, even if they continually do more with said power.
But I wasn’t on the design team and have no basis for second-guessing them. I’m just saying that cramming more performance CPU cores onto this die isn’t a realistic option.
This has been my thinking. Today you have to go out of your way to buy a system with an NPU, so I don't have any. But tomorrow, will they just be included by default? That seems like a waste for those of us who aren't going to be running models. I wonder what other uses they could be put to?
They're useful for more things than just LLM's.
As to why, I think it's along the lines of this: the CPU does 100 things, one of those is AI acceleration. Let's take the AI acceleration and give it its own space instead so we can keep the power down a bit, add some specialization, and leave the CPU to do other stuff.
Again, I'm coming at this from a high-level as if explaining it to my ageing parents.
So, I'm not sure that you're wasting much with the NPU. But I'm not an expert.
That's already the way things are going due to Microsoft decreeing that Copilot+ is the future of Windows, so AMD and Intel are both putting NPUs which meet the Copilot+ performance standard into every consumer part they make going forwards to secure OEM sales.
Actually, could they be used to make better AI in games? That'd be neat. A shooter character with some kind of organic tactics, or a Civilisation/Stellaris AI that doesn't suck.
Presumably you have a GPU? If so there is nothing an NPU can do that a discrete GPU can’t (and it would be much slower than a recent GPU).
The real benefits are power efficiency and cost since they are built into the SoC which are not necessarily that useful on a desktop PC.
It's not unlike why Apple puts so many video engines in their SoCs - they don't actually have much else to do with the transistor budget they can afford. Making single thread performance better isn't limited by transistor count anymore and software is bad at multithreading.
It seems like the NPUs are for very optimized models that do small tasks, like eye contact, background blur, autocorrect models, transcription, and OCR. In particular, on Windows, I assumed they were running the full screen OCR (and maybe embeddings for search) for the rewind feature.
AMD are doing some fantastic work at the moment, they just don't seem to be shouting about it. This one is particularly interesting https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
edit: not an FPGA. TIL. :'(
The teeny-tiny "NPU," which is actually an FPGA, is 10 TOPS.
Edit: I've been corrected, not an FPGA, just an IP block from Xilinx.
A̵n̵d̵ ̵i̵t̵'̵s̵ ̵a̵n̵ ̵F̵P̵G̵A̵ It's not an FPGA
nope it's not.
I don't know the details of your use case, but I work with low level hardware driven by GPIOs and after a bit of investigation, concluded that having direect GPIO access in a modern PC was not necessary or desirable compared to the alternatives.
It just seems like this would be better in terms of firmware/security/bootloading because you would be more able to fix it if an exploit gets discovered, and it would be leaner because different operating systems can implement their own stuff (for example linux might not want pluton in-chip security, windows might not want coreboot or linux-based boot, bare metal applications can have much simpler boot).
You can see this in action when evaluating a CoreML model on a macOS machine. The ANE takes half as long as the GPU which takes half as long as the CPU (actual factors being model dependent)
If you and I have the same calculator but I'm working on a set of problems and you're not, and we're both asked to do some math, it may take me longer to return it, even though the instantaneous performance of the math is the same.
Wouldn't it be odd for OP to present examples that are the opposite of their claim, just to get us thinking about "well the CPU is busy?"
Curious for their input.
The CPU is zero latency to get started, but takes longer because it isn't specialized at any one task and isn't massively parallel, so that is why the CPU takes even longer.
The NPU often has a simpler bytecode to do more complex things like matrix multiplication implemented in hardware, rather than having to instantiate a generic compute kernel on the GPU.
The Nvidia killer would be chips and memory that are affordable enough to run a good enough model on a personal device, like a smartphone.
I think the future of this tech, if the general populace buys into LLMs being useful enough to pay a small premium for the device, is personal models that by their nature provide privacy. The amount of personal information folks unload on ChatGPT and the like is astounding. AI virtual girlfriend apps frequently get fed the most darkest kinks, vulnerable admissions, and maybe even incriminating conversations, according to Redditors that are addicted to these things. This is all given away to no-name companies that stand up apps on the app store.
Google even states that if you turn Gemini history on then they will be able to review anything you talk about.
For complex token prediction that requires a bigger model the personal could switch to consulting a cloud LLM, but privacy really needs to be ensured for consumers.
I don't believe we need cutting edge reasoning, or party trick LLMs for day to day personal assistance, chat, or information discovery.
I'm a bit suspicious of the article's specific conclusion, because it is Qualcomm's ONNX, and it be out of date. Also, Android loved talking shit about Qualcomm software engineering.
That being said, its directionally correct, insomuch as consumer hardware AI acceleration claims are near-universally BS unless you're A) writing 1P software B) someone in the 1P really wants you to take advantage.
> but to be able to run small models with very little power usage
yesBut first, I should also say you probably don't want to be programming these things with python. I doubt you'll get good performance there, especially as the newness means optimizations haven't been ported well (even using things like TensorRT is not going to be as fast as writing it from scratch, and Nvidia is throwing a lot of man power at that -- for good reason! But it sure as hell will get close and save you a lot of time writing).
They are, like you say, generally optimized for doing repeated similar tasks. That's also where I suspect some of the info gathered here is inaccurate.
(I have not used these NPU chips so what follows is more educated guesses, but I'll explain. Please correct me if I've made an error)
Second, I don't trust the timing here. I'm certain the CUDA timing (at the end) is incorrect, as the code written wouldn't properly time. Timing is surprisingly not easy. I suspect the advertised operations are only counting operations directly on the NPU while OP would have included CPU operations in their NPU and GPU timings[0]. But the docs have benchmarking tools, so I suspect they're doing something similar. I'd be interested to know the variance and how this holds after doing warmups. They do identify the IO as an issue, and so I think this is evidence of this being an issue.Third, their data is improperly formatted.
MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, 256)
INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]
You want "channels last" here. I suspected this (do this in pytorch too!) and the docs they link confirm.1500 is also an odd choice and this could be cause for extra misses. I wonder how things would change with 1536, 2048, or even 256. Might (probably) even want to look smaller, since this might be a common preprocessing step. Your models are not processing full res images and if you're going to optimize architecture for models, you're going to use that shape information. Shape optimization is actually pretty important in ML[1]. I suspect this will be quite a large miss.
Fourth, a quick look at the docs and I think the setup is improper. Under "Model Workflow" they mention that they want data in 8 or 16 bit *float*. I'm not going to look too deep, but note that there are different types of floats (e.g. pytorch's bfloat is not the same as torch.half or torch.float16). Mixed precision is still a confusing subject and if you're hitting issues like these it is worth looking at. I very much suggest not just running a standard quantization procedure and calling it a day (start there! But don't end there unless it's "good enough", which doesn't seem too meaningful here.)
FWIW, I still do think these results are useful, but I think they need to be improved upon. This type of stuff is surprisingly complex, but a large amount of that is due to things being new and much of the details still being worked out. Remember that when you're comparing to things like CPU or GPU (especially CUDA) that these have had hundreds of thousands of man hours put into then and at least tens of thousands into high level language libraries (i.e. python) to handle these. I don't think these devices are ready for the average user where you can just work with them from your favorite language's abstraction level, but they're pretty useful if you're willing to work close to the metal.
[0] I don't know what the timing is for this, but I do this in pytorch a lot so here's the boilerplate
times = torch.empty(rounds)
# Don't need use dummy data, but here
input_data = torch.randn((batch_size, *data_shape), device="cuda")
# Do some warmups first. There's background actions dealing with IO we don't want to measure
# You can remove that line and do a dist of times if you want to see this
# Make sure you generate data and save to a variable (write) or else this won't do anything
for _ in range(warmup):
data = model(input_data)
for i in range(rounds):
starter = torch.cuda.Event(enable_timing=True)
ender = torch.cuda.Event(enable_timing=True)
starter.record()
data = model(input_data)
ender.record()
torch.cuda.synchronize()
times[i] = starter.elapsed_time(ender)/1000
total_time = times.sum()
The reason we do it this way is if we just wrap the model output with a timer then we're looking at CPU time but the GPU operations are asynchronous so you could get deceptively fast (or slow) times[1] https://www.thonking.ai/p/what-shapes-do-matrix-multiplicati...
I just gave you a use case, mine in particular uses it for background blur and eye contact filters with the webcam and uses essentially no power to do it. If I do the same filters with nvidia broadcast, the power usage is dramatically higher.
Eye contact filters seem like a horrible thing, autocorrect won't work better than a dictionary with a tiny model and I doubt these things can come even close to running whisper for decent voice transcription. Background blur alright, but that's kind of stretching it. I always figured Zoom/Teams do these things serverside anyway.
And alright, if it's not MS making them do it, then they're just chasing the fad themselves while also shipping subpar hardware. Not sure if that makes it better.
https://github.com/ggerganov/whisper.cpp/pull/566
"The performance gain is more than x3 compared to 8-thread CPU"
And this is on the 3 year old M1 Pro
Whisper runs almost realtime on a single core of my very old CPU. I'd be very surprised if it can't fit in an NPU.
Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.
gcc -O0 and -O2 has a HUGE performance gain. We don't really have anything to auto-magically do this for models, yet. Compilers are intimately familiar with x86.
Having cache friendly memory access patterns is perhaps the biggest one. Though automatic vectorization is also still not quite there, so in cases where there's a severe bottleneck, doing that manually may still considerably improve performance, if the workload is vectorizable.
When running int8 matmul using onnx performance is ~0.6TF.
While it might be possible it would not surprise me if a number of possible optimizations had not made it into Onnx. It appears that Qualcomm does not give direct access to the NPU and users are expected to use frameworks to convert models over to it, and in my experience conversion tools generally suck and leave a lot of optimizations on the table. It could be less of NPUs suck and more of the conversions tools suck. I'll wait until I get direct access - I don't trust conversion tools.
My view of NPUs is that they're great for tiny ML models and very fast function approximations which is my intended use case. While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.
Can you give some examples? Preferably examples that will run continuously enough for even a small model to stay in cache, and are valuable enough to a significant number of users to justify that cache footprint?
I am not saying there aren't any, but I also honestly don't know what they are and would like to.
NNs can be used as a general function approximators so any function which can be approximated is a candidate for using a NN in it's place. I have a very complex trig function that produces a high dimensional smooth manifold which I know will only be used within a narrow range of inputs and I can sacrifice some accuracy for speed. My inner loops have inner loops which have inner loops with inner loops. So when you're 4+ inner loops deep the speed becomes essential. I can sweep the entire input domain to make sure the error always stays within limits.
If you're doing things such as counting instructions, intrinics, inline assembly, bit-twiddling, fast math, polynomial approximations, LUTs, fixed point math, etc. you could probably add NNs to your toolkit.
Stockfish uses a 'small' 82K parameter neural net of 3 dense integer only layers (https://news.ycombinator.com/item?id=27734517). I think Stockfish performance would be a really good candidate for testing NPUs as there is a time / accuracy tradeoff.
Suggestions, predictive text, smart image search, automatic image classification, text selection in images, image processing. These don't run continuously, but I think they are valuable to a lot of users. The predictive text is quite good, and it's very nice to be able to search for vague terms like "license plate" and get images in my camera roll. Plus, selecting text and copying it from images is great.
For desktop usecases, I'm not sure.
I would hope the NPU on Elite X is easier to get to considering the whole copilot+ thing, but I bring this up mainly to make the point that I doubt it’s just as easy as “run general purpose model, expect it to magically teleport onto the NPU”.
E.g. Llama 3.1 8B which is one of the smaller models has matrix multiplications like `(batch, 14336, 4096) x (batch, 4096, 14336)`.
I just don't think this benchmark is realistic enough.
The workload is relatively small, which results in underutilization of the hardware capacity due to the overhead associated with input/output quantization-dequantization and NCHW-NHCW mapping. Padding the weights and inputs to be a multiple of 64 would also help the performance.
Edit: Link to the profiling graph https://imgur.com/a/2OKR93e
Estimated HVX compute capability 421.43*1024/8 = 1.46TOPS in int8,
in which 4 is number of vector cores
2 is number operation per cycle
1.43GHz is HVX frequency
1024bit is vector register width
8bit is precision
So ROCm already sucks whereas QNN sucks even harder!
The conclusion here is NVIDIA knows how to make software that just works. AMD makes software that might work. Qualcomm, however, knows zero piece of shit of how to make a useful software.
The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!
So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).
Back to the benchmark script. There is a lot of flaws as I can see.
1. the session is not warmed up and the iteraion is too small. 2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively. 3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.
A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.
[1]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn... [2]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn...
> it's not enough time to get new silicon designs specifically for <blahblah>
Where blahblah stands for a model architecture that has caused a paradigm shift.
When you need a new silicon for a new model, you are already losing.
Because this isn't about NPUs. It's about a specific NPU, on a specific benchmark, with a specific set of libraries and frameworks. So basically, this proves nothing.
This all depends on model architecture, conversions and tuning. Apple provides good tooling in XCode for benchmarking models up to execution time of single operators and where such operator got executed (CPU, GPU, NPU) in case couldn't been executed on NPU and have to fallback to CPU/GPU. Sometimes model have be tweaked to slightly different operator if it's not available in NPU. On top of that ML frameworks/runtimes such as ONNX/Pytorch/TensorflowLite sometimes don't implement all operators in CoreML or MPS.
Given that you should take his NPU results with a truckload of salt.
To me that suggests that the test is wrong.
I could see intel massaging results, but that far off seems incredibly improbable
For example Apple's m3 neural engine is mere 18 TOPS but it’s FP16.
So windows has bigger number but it’s not apple to apple comparison.
Did author test int8 performance?
This benchmark is horribly flawed in many ways, and was so evidently useless that I'm surprised that they still decided to "publish" this. When your test gets 1% of the published performance, it's a good indication that things aren't being done correctly.
I know Apple's Neural Engine is used to power Face ID and the facial recognition stuff in Photos, among other things.
Print Screen takes images on demand, Recall does so effectively at random. This means Recall could inadvertently screenshot and store information you didn't intend to keep a record of (To give an extreme example: Imagine an abuser uses Recall to discover their spouse browsing online domestic violence resources).
Also, it's security risk which already been exploited. Sure, MS fixed it, but can you be certain it won't be exploited some time in the future again?
Sure, just post the source code and I'll point out where it does so, I somehow misplaced my copy. /s
The core problem here is trust, and over the last several years Microsoft has burned a hell of a lot of theirs with power-users of Windows. Even their most strident public promises of Recall being "opt-in" and "on-device only" will--paradoxically--only be kept as long as enough people remain suspicious.
Glance away and MS go back to their old games, pushing a mandatory "security update" which reset or entirely-removes your privacy settings and adding new "telemetry" streams which you cannot inspect.
> We've seen similar performance results to those shown here using the Qualcomm QNN SDK directly.
Why not include those results?
It just gets so hard to take this industry seriously.
So, OK, yeah, I concede that the NPU may have even worse access to memory than the CPU, but the bottom line is that neither one of them has anything close to what it needs to to actually delivering anything like the marketing headline performance number on any realistic workload.
I bet a lot of people have bought those things after seeing "45 TOPS", thinking that they'd be able to usefully run transformers the size of main memory, and that's not happening on CPU or NPU.
So microsoft takes some of the criticisms on twitter and gets them in before shipping. Free appsec, nice.
Now, microsoft doesnt care about your benchmarks, dude. Grandma isnt gonna notice these workloads finish faster on a different compiled program utilizing different chips. Her last PC was EOL'd 10 years ago, it certainly cant keep up with this new ai laptop.
Either way, these are some of the first personal computers to have NPUs. They will improve. CPUs are 20 years optimized, this is literally the first try for some of these companies
so what this means if NPUs are anywhere close to CPUs in the benchmarks is that NPUs are going to blow past CPUs very soon, because CPUs dont have much more weight to shed whereas NPUs are just getting started.
"Avoid", "Nothing works", "Worthless for any AI use"
In retrospect, the fact that Intel and AMD's stock prices both closed slightly up when Microsoft announced the Snapdragon X on Windows 11 was a dead giveaway that the major players knew behind the scenes that it was being released seriously under baked.
Such a spec should be ideally be accompanied by code demonstrating or approximating the claimed performance. I can't imagine a sports car advertising a 0-100km/h spec of 2.0 seconds where a user is unable to get below 5 seconds.