There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.
NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.
The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.
When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.
Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.
There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.
I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.
https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...
I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.
The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.
Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.
I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.
Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).
> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources
> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)
I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).
The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.
(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)
I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.
SIMD/SIMT probably isn’t ever going away, and vector computers have been around since before segmented memory; the 32 threads in a CUDA warp is the source of its performance superpower, and the reason we can even fit all the transistors for 20k simultaneous adds & multiplies, among other things, on the die. This is conceptually different from your analogy too, the segmented memory was a constraint designed to get around pointer size limits, but 32 threads/warp isn’t getting us around any limits, it’s just a design that provides high performance if you can organize your threads to all do the same thing at the same time.
There were also attempts like OpenCL, that the industry rejected early-on because they thought they'd never need a CUDA alternative. Nvidia's success is mostly built on the ignorance of their competition - if Nvidia was allowed to buy ARM then they could guarantee the two specs never overlap.
Apple's M4 as described in its iPad debut has about the same chip area for CPU and GPU/vector functions. It's as much a vector machine as not.
https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...
We're talking about vector operations. CUDA is not a GPU but a library of hardware-accelerated functions, not necessarily different from OpenCL or even NEON for ARM. You can reimplement everything CUDA does on a CPU, and if you're using a modern CPU you can vectorize it too. x86 handles this well, because it's still got dedicated logic that keeps pace with the SIMD throughput an integrated GPU might offer. ARM leaves out the operations entirely (which is smart for efficiency), and therefore either relies on someone porting CUDA code to an ARM GPU shader (fat chance) or offloading to a remote GPU. It's why ARM is excellent for sustained simple ops but collapses when you benchmark it bruteforcing AI or translating AVX to NEON. SIMD is too much for a base-spec ARM core.
> Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.
Xeon Phi or Itanium flavored?
I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"
That aside, I'm building something that relies on llama.cpp for inference on every platform.
In this scenario, Android is de facto "ARM" to me.
The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)
iOS is de facto "not ARM" to me because it runs on the GPU.
I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.
That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"
That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.
Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)
Our phone camera pipelines are doing lots of numerical compute already.
It's just a hard problem. You can code ignorantly with high level libraries but you're leaving 2x to 10x performance on the table.
> My mental model [for GPU threads] is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.
For example, LLM AI inference on a desktop (where you don’t have a dozen of concurrent sessions from multiple users) is guaranteed to be memory bound, fetching these gigabytes of model’s tensors for each generated token. For use cases like that, specialized tensor cores deliver about the same performance as well-written compute shaders running on general purpose GPU cores.
However, AVX512 is way slower than GPUs, because modern GPUs have memory with very high bandwidth. In my desktop computer the system memory is dual channel DDR5 which delivers 75 GB/s, VRAM in the discrete GPU 670 GB/sec.
Also the 9950x has 32 threads, but is hyperthreaded, so it only has 16 actual cores, so the correct scaling factor is 16 cores * 16 SIMD lanes. Anyway the final number is 8.678 32 bit float TFLOPS.
The RTX 4090 has 82.58 32 bit TFLOPS according to Nvidia, but it also costs far more than the 9950x($1,600 vs $650), so I find this comparison rather odd.
So it costs 2.46 as much and delivers 9.5x the perf.
If you normalize for cost the perf advantage is about 3.8x, which is roughly the same numbers Intel reported years ago when they debunked the whole GPU is 100x better nonsense.
Anyway, I really hate the Cuda terminology where they refer to SIMD lanes as "threads".
There are also alot of the things to consider, where either the CPU or GPU has an advantage such as..
GPU advantages:
Hardware sin/cos support(with Nivida at least)
abs/saturate are often just modifiers
scaling by small powers of 2 is often free
16bit floats are fully supported
CPU advantages:
doubles are full speed and you can interleave with floats if you just need for a few calculations
access to wide variety of integer sizes and bit manipulation functions, GPU has some of this but not nearly as broad
lower level programing model
> lower level programming model
Do you mean how SASS (and the AMD equivalent) is not properly documented and is tool-less, as opposed to the assembly languages of different CPU architectures? Because otherwise, remember that one can write PTX code, and that is pretty low-level.
It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.
We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.
As an example, this is one of the core methods of our new graph query language GFQL (think cypher on pandas/spark, w optional GPU runtime), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, V1 competes surprisingly high, and is easy to maintain & extend.
(The CPU algos for doing this work great on CPUs but are woeful on GPUs.)
For the 'easy' way we do in gfql, which is basically vector ops on bulk wavefronts, we can do massive cypher traversals like you're asking, like 100M edges touched in a result substep, and on a tiny GPU. There are other such bulk patterns we want to add such as Pregel style, which open other algorithms here. In practice we can often just call cudf/cugraph as building blocks so haven't had the pressure to do so yet.
The weak spot I find is more like small OLTP lookups. Ex: Imagine a taxi routing traffic service pinging for one car to do a couple hops out, where you just want a KV store in cheap RAM. But if you are batching those queries, like in a heavy city, and going deeper on them, maybe more interesting.
none of what was mentioned in this blog post is news if you've ever had more than 2 hours of a CUDA course...
compare a 500W GPU to all the cores of a 500W CPU, please. I'm not expecting the CPU (say, a 192-core AMD that does fast AVX512) to beat the GPU on all data-parallel workloads, but it won't be the silly sort of graphs shown in this blog.
or compare one SM to one CPU core - that has merit as well.
best yet, we're finally getting some CPUs (well, APUs...) with in-package RAM. that makes the comparison more interesting as well.
Comparing SMs to CPU cores 1:1 also makes no sense. They don’t do the same things.
For FP64 computations, the reverse happens, a 9950X buries a 4090, despite the latter having a 3-times higher price and a 2.5-times higher power consumption.
For FP64 operations, 4090 and 9950X are able to do a similar number of operations per clock cycle (288 vs. 256), but 9950X can do them at a double clock frequency and it is easier to reach a high fraction of the maximum theoretical throughput on a 9950X than on a 4090.
FP8 is a more recent quantization format and AFAIK no CPU implements it.
I do not know which is the throughput of these instructions for Zen 5. It must be higher than for older CPUs, but it must be slower than for the Intel Xeon models that support AMX (which are much more expensive, so despite having a higher absolute performance for inference, they might have lower performance per dollar) and obviously it must be slower than for the tensor cores of a big NVIDIA GPU.
Nevertheless, for models that do not fit inside the memory of a GPU, inference on a Zen 5 CPU may become competitive.