I like this writeup as it summarizes my journey with optimizing some cuda code I wrote for an LHC experiment trigger. But there are few comments on some details.

There are 65536 registers per SM not thread block and while you can indirectly control that by making your block takes all the SM but this presents its own problems.

NVIDIA hardware limits the threads max number to 1024 (2048) and shared memory to 48 KB (64 KB) per SM. So if you consume all of that in one thread block or near the maximum then you are using one thread block per SM. You don't usually want to do that because it will lower your occupancy. Additionaly , If the kernel you’re running is not compute-bound and does not need all the registers or shared memory allocated to it, having fewer blocks on the SM could leave some compute resources idle. GPUs are designed to thrive on parallelism, and limiting the number of active blocks could cause underutilization of the SM’s cores, leading to poor performance. Finally, If each thread block occupies an entire SM, you limit the scalability of your kernel to the number of SMs on the GPU. For example, if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel, even if the problem you’re solving could benefit from more parallelism. This can reduce the efficiency of the GPU for very large problem sizes.

For devices with compute capability of 7.0 or greater (anything from the Volta series on), a single thread block can address up to the entire shared memory size of the SM; the 48 kB limit that older hardware had is no more. Most contemporary applications are going to be running on hardware that doesn’t have the shared memory limit you mentioned.

The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.

When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.

Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.

Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.

Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.
Oh I think so, but I’m certainly not the most expert of CUDA users there is. ;) Still, you will often see CUDA try to alloc local and smem space for at least 3 blocks per SM when you configure a kernel. That can’t possibly always be true, but is for kernels that are using modest amounts of smem, lmem, and registers. In general I’d say desynced mem ops are harder to make performant than highly homogeneous workloads, since those are more likely to be uncoalesced as well as cache misses. Think about it this way: a kernel can stall for many many reasons (which Nsight Compute can show you), especially memory IO, but even for compute bound work, the math pipes can fill, the instruction cache can miss, some instructions have higher latency than others, etc. etc. Even a cache hit load can take dozens of cycles to actually fill. Because stalls are everywhere, these machines are specifically designed to juggle multiple blocks and always look for ways to make forward progress on something without having to sit idle, that is how to get higher throughput and hide latency.
Well, yes, but "desynced" warps don't use shared memory - because writes to it require some synchronization for other warps to be able to read the information.
Why would that be true? Certainly there are algorithms (or portions of them) in which warps can just read whichever values exist in shared mem at the time, no need to sync. And I think we were mostly talking about global memory?
I don’t think it’s possible to use shared memory without syncing, and I don’t think there are any algorithms for that. I think shared memory generally doesn’t have values that exist before the warps in a block get there. If you want to use it, you usually (always?) have to write to smem during the same kernel you read from smem, and use synchronization primitives to ensure correct order.

There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.

I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.

there is no guarantee of order of actions taking effect. i.e. warp 1 writes to some shared memory address; warp 2 reads from that address. How can you guarantee the write happens before the read?
  • jhj
  • ·
  • 2 months ago
  • ·
  • [ - ]
Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...

I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.

I agree that optimizing for lower occupancy can yield significant performance gains in specific cases, especially when memory latencies are the primary bottleneck. Leveraging ILP and storing more data in registers can indeed help reduce the need for higher occupancy and lead to more efficient kernels. The examples in the GTC2010 talks highlighted that quite well. However, I would argue that occupancy still plays an important role, especially for scalability and general-purpose optimization. Over-relying on low occupancy and fewer threads, while beneficial in certain contexts, has its limits.

The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.

Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.

I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.

  • jhj
  • ·
  • 2 months ago
  • ·
  • [ - ]
> The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted

Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).

> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources

> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)

I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).

The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.

(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)

In the 90s we had segmented memory programming with near and far pointers, and you had to be very careful about when you used what type of pointer and how you'd organize your memory accesses. Then we got processors like the 286 that finally relieved us from this constrained way of programming.

I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.

While reading I thought you were going to suggest unified memory between RAM and VRAM, since that’s somewhat analogous, though that does exist with various caveats depending on how it’s setup & used.

SIMD/SIMT probably isn’t ever going away, and vector computers have been around since before segmented memory; the 32 threads in a CUDA warp is the source of its performance superpower, and the reason we can even fit all the transistors for 20k simultaneous adds & multiplies, among other things, on the die. This is conceptually different from your analogy too, the segmented memory was a constraint designed to get around pointer size limits, but 32 threads/warp isn’t getting us around any limits, it’s just a design that provides high performance if you can organize your threads to all do the same thing at the same time.

You can blame ARM for the popularity of CUDA. At least x86 had a few passable vector ISA ops like SSE and AVX - the ARM spec only supports the piss-slow NEON in it's stead. Since you're not going to unify vectors and mobile hardware anytime soon, the majority of people are overjoyed to pay for CUDA hardware where GPGPU compute is taken seriously.

There were also attempts like OpenCL, that the industry rejected early-on because they thought they'd never need a CUDA alternative. Nvidia's success is mostly built on the ignorance of their competition - if Nvidia was allowed to buy ARM then they could guarantee the two specs never overlap.

> Since you're not going to unify vectors and mobile hardware anytime soon

Apple's M4 as described in its iPad debut has about the same chip area for CPU and GPU/vector functions. It's as much a vector machine as not.

https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...

  • oivey
  • ·
  • 2 months ago
  • ·
  • [ - ]
CUDA clobbered x86, not ARM. Maybe if x86’s vector ops were better and more usable ARM would have been motivated to do better.
Whole concept sounds like groping in the dark for a Take to me: GPUs (CUDA) are orthogonal to consumer processors (ARM / X86). Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.
> GPUs (CUDA) are orthogonal to consumer processors (ARM / X86).

We're talking about vector operations. CUDA is not a GPU but a library of hardware-accelerated functions, not necessarily different from OpenCL or even NEON for ARM. You can reimplement everything CUDA does on a CPU, and if you're using a modern CPU you can vectorize it too. x86 handles this well, because it's still got dedicated logic that keeps pace with the SIMD throughput an integrated GPU might offer. ARM leaves out the operations entirely (which is smart for efficiency), and therefore either relies on someone porting CUDA code to an ARM GPU shader (fat chance) or offloading to a remote GPU. It's why ARM is excellent for sustained simple ops but collapses when you benchmark it bruteforcing AI or translating AVX to NEON. SIMD is too much for a base-spec ARM core.

> Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.

Xeon Phi or Itanium flavored?

I've read this 10x and get more out of it each time.

I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"

That aside, I'm building something that relies on llama.cpp for inference on every platform.

In this scenario, Android is de facto "ARM" to me.

The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)

iOS is de facto "not ARM" to me because it runs on the GPU.

I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.

That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"

That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.

Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)

  • oivey
  • ·
  • 2 months ago
  • ·
  • [ - ]
Yeah, that’s true. CUDA is in large part for big HPC servers, where ARM historically wasn’t a player and still isn’t dominant. x86 got clobbered for HPC by CUDA.
ARM has SVE these days. This comment makes no sense, anyway: people don’t do numerical computing on phones.
I bet the majority of AI inference FLOPS will be executed on phones before long.

Our phone camera pipelines are doing lots of numerical compute already.

I'll believe it when autovectorization is actually useful in day to day high performance coding work.

It's just a hard problem. You can code ignorantly with high level libraries but you're leaving 2x to 10x performance on the table.

386?
In the conclusion, I like the image:

> My mental model [for GPU threads] is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.

Note that not all problems are compute bound. Many practical problems bottleneck on memory bandwidth.

For example, LLM AI inference on a desktop (where you don’t have a dozen of concurrent sessions from multiple users) is guaranteed to be memory bound, fetching these gigabytes of model’s tensors for each generated token. For use cases like that, specialized tensor cores deliver about the same performance as well-written compute shaders running on general purpose GPU cores.

However, AVX512 is way slower than GPUs, because modern GPUs have memory with very high bandwidth. In my desktop computer the system memory is dual channel DDR5 which delivers 75 GB/s, VRAM in the discrete GPU 670 GB/sec.

CPU numbers are off, as FMA is considered 2 instructions, and Zen5 can do 2 of them per cycle in addition to two adds, so it would be 6 instructions per cycle not 4(GPU numbers are always quoted this way, so it is only fair to do the same for the CPU).

Also the 9950x has 32 threads, but is hyperthreaded, so it only has 16 actual cores, so the correct scaling factor is 16 cores * 16 SIMD lanes. Anyway the final number is 8.678 32 bit float TFLOPS.

The RTX 4090 has 82.58 32 bit TFLOPS according to Nvidia, but it also costs far more than the 9950x($1,600 vs $650), so I find this comparison rather odd.

So it costs 2.46 as much and delivers 9.5x the perf.

If you normalize for cost the perf advantage is about 3.8x, which is roughly the same numbers Intel reported years ago when they debunked the whole GPU is 100x better nonsense.

Anyway, I really hate the Cuda terminology where they refer to SIMD lanes as "threads".

There are also alot of the things to consider, where either the CPU or GPU has an advantage such as..

GPU advantages:

Hardware sin/cos support(with Nivida at least)

abs/saturate are often just modifiers

scaling by small powers of 2 is often free

16bit floats are fully supported

CPU advantages:

doubles are full speed and you can interleave with floats if you just need for a few calculations

access to wide variety of integer sizes and bit manipulation functions, GPU has some of this but not nearly as broad

lower level programing model

Decent points regarding relative strengths and weaknesses, but:

> lower level programming model

Do you mean how SASS (and the AMD equivalent) is not properly documented and is tool-less, as opposed to the assembly languages of different CPU architectures? Because otherwise, remember that one can write PTX code, and that is pretty low-level.

Nice!

It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.

We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.

As an example, this is one of the core methods of our new graph query language GFQL (think cypher on pandas/spark, w optional GPU runtime), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, V1 competes surprisingly high, and is easy to maintain & extend.

Had any exposure to r=2 hypergraph implementations on the GPU? Ideally with an efficient way to determine if the graph is acyclic?

(The CPU algos for doing this work great on CPUs but are woeful on GPUs.)

Pretty good - r=2 is a regular graph afaict, and basically anything that maps to a frontier-based pattern works well. Ex: level synchronous bfs during topological sort.

For the 'easy' way we do in gfql, which is basically vector ops on bulk wavefronts, we can do massive cypher traversals like you're asking, like 100M edges touched in a result substep, and on a tiny GPU. There are other such bulk patterns we want to add such as Pregel style, which open other algorithms here. In practice we can often just call cudf/cugraph as building blocks so haven't had the pressure to do so yet.

The weak spot I find is more like small OLTP lookups. Ex: Imagine a taxi routing traffic service pinging for one car to do a couple hops out, where you just want a KV store in cheap RAM. But if you are batching those queries, like in a heavy city, and going deeper on them, maybe more interesting.

Definitely not an expert, but trying to use AVX instructions explicitly in a c++ program can also produce un-optimal performance vs. just letting the optimizer decide, much like this article points out with not shaping your memory and compute to fit the GPU model.
What are some actually good resources to learn this stuff?
Programming Massively Parallel Processors: A Hands-on Approach by David Kirk and Wen-mei Hwu.
there are a thousand excellent CUDA programming courses/sites.

none of what was mentioned in this blog post is news if you've ever had more than 2 hours of a CUDA course...

little annoying to see the one-core-compared-to-whole-gpu comparisons - now decades past when this was an innocent wrong.

compare a 500W GPU to all the cores of a 500W CPU, please. I'm not expecting the CPU (say, a 192-core AMD that does fast AVX512) to beat the GPU on all data-parallel workloads, but it won't be the silly sort of graphs shown in this blog.

or compare one SM to one CPU core - that has merit as well.

best yet, we're finally getting some CPUs (well, APUs...) with in-package RAM. that makes the comparison more interesting as well.

  • oivey
  • ·
  • 2 months ago
  • ·
  • [ - ]
The first example plot is a 9950X that includes all threads with AVX512 vs a 4090. The 9950X has a 170W TDP, which doesn’t include any other components like the RAM or motherboard. The 4090’s total max power is ~450W. The chart shows the 4090 burying the 9950X by far more than 450/170.

Comparing SMs to CPU cores 1:1 also makes no sense. They don’t do the same things.

It should be kept in mind that a 4090 only buries a 9950X for FP32 computations.

For FP64 computations, the reverse happens, a 9950X buries a 4090, despite the latter having a 3-times higher price and a 2.5-times higher power consumption.

For FP64 operations, 4090 and 9950X are able to do a similar number of operations per clock cycle (288 vs. 256), but 9950X can do them at a double clock frequency and it is easier to reach a high fraction of the maximum theoretical throughput on a 9950X than on a 4090.

What about FP8? It is a target that is very popular for LLM inference.
AMD Zen 5 has the so-called "Vector Neural Network Instructions", which can be used for inference with INT8 quantization and also instructions for computing inference with BF16 quantization.

FP8 is a more recent quantization format and AFAIK no CPU implements it.

I do not know which is the throughput of these instructions for Zen 5. It must be higher than for older CPUs, but it must be slower than for the Intel Xeon models that support AMX (which are much more expensive, so despite having a higher absolute performance for inference, they might have lower performance per dollar) and obviously it must be slower than for the tensor cores of a big NVIDIA GPU.

Nevertheless, for models that do not fit inside the memory of a GPU, inference on a Zen 5 CPU may become competitive.