PSHUFB wins in case of unpredictable access patterns. Though I don't remember how much it typically wins.
PMOVMSKB can replace several conditionals (up to 16 in SSE2 for byte operands) with only one, winning in terms of branch prediction.
PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP rounding that doesn't require changing the rounding mode, etc. The weird string functions in SSE4.2. Non-temporal moves and prefetching in some cases.
The cool thing with SIMD is that it's a lot less stress for the CPU access prediction and branch prediction, not only ALU. So when you optimize it will help unrelated parts of your code to go faster.
Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.
Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.
If you want to make 4 at a time though, you have to keep the thing fed. You need your ingredients in the cache, or you are just going to waste time finding them.
They were notable for several reasons, although they are no longer included in modern silicon.
Why do we even need SIMD instructions? - https://news.ycombinator.com/item?id=44850991 - Aug 2025 (8 comments)
Of course memory bandwidth should increase proportionally otherwise the cores might have no data to process.