The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...
Would be interesting to see if auto vec performs better with that addition.
auto aligned_p = std::assume_aligned<16>(p)
No reason for the compiler to balk at vectorizing unaligned data these days.
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
A great example would be a convolution-kernel style code - with AVX512 you are using 64 bytes at a time (a whole cacheline), and sampling a +/- N neighborhood around a pixel. By definition most of those reads will be unaligned!
A lot of other great use cases for SIMD don't let you dictate the buffer alignment. If the code is relatively bandwidth limited, I have found it to be worth doing a head/body/tail situation where you do one misaligned iteration before doing the bulk of the work in alignment, but honestly for that to be worth it you have to be working almost completely out of L1 cache which is rare... otherwise you're going to be slowed down to L2 or memory speed at which point the half rate penalty doesn't really matter.
The early SSE-style instructions often favored making two aligned reads and then extracting your sliding window from that, but there's just no point doing that on modern hardware - it will be slower.