Splitting into 4x4 blocks is typically very nice, though. Maybe it doesn’t matter so much to practical runtime.
I am aware of being able use hardware counters for profiling systems, but the more fine-grained exposure of what's happening at a lower level seems to be just hidden... Or I haven't found the right reference resource yet.
For example swapping columns and rows before applying a product, so that accuracy is best kept (then reswapping back at the end of the operation). It seems to me the Strassen-like algorithm choose arbitrarily some elements of the matrices over others to make their temporary sum-products variables, so we could exploit this asymmetry (not sure if it's the correct word here) to choose which elements are to be chosen
All of these papers/algos are for the ML hype-train. ML algos are approximate anyway so no one cares about absolute accuracy, only the precision of the overall pipeline (class labels shouldn't change, at least not too much). Consider that very many papers/techniques quantize down to 8 or even 4 bits (yes sometimes even during training) for the purposes of perf.
This whole research area should just be renamed to something like approximate computing so that people don't confuse it (and the goals) with classical numerical analysis.
E.G. a 7B model at FP32 will perform worse than a 14B model at BF16, all else being equal.
FP32 -> BF16 is really good, because modern GPUs are actually far faster at BF16 multiplications than at multiplications in FP32, not to mention the decreased memory and memory throughput requirements. 4-bit quants are much more of a mixed bag. You get the memory savings, which often means a difference between a cheap consumer GPU and a much more expensive datacenter GPU, or a single GPU versus a node with multiple ones and all the interconnect that entails. Your raw speed won't improve as dramatically, as you still need to convert from BF16 and back to do the actual computation. BF16 would probably make you memory-bound anyway, and 4-bit effectively gives you more throughput, so you get some savings, but the difference won't be as dramatic.
>This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless.
Certainly not true. In general different orders of magnitude are a problem. With multiplication and division these are more serve problems, as they lead to greater changes in magnitude, although even that is a generalization.
They're a problem with fixed point: might you have been thinking of that?
So they're making a point that if you apply more multiplications before the inevitable addition, you are likely increasing the danger levels of that addition.
In the end, all data structures are functions. The academic applications of that may make your eyes glaze over, but the practical application is that if you want a reversed array/tree/etc., you just need the reading/iteration "function" for the array to return the data as if it was reversed. A matrix can be "transposed" just by reading its transpose. (Caching or other consequences may result in you choosing to manifest it in memory anyhow, but if it was stored cleverly in the first place, or you're only going to read it once anyhow such that using it once is as expensive as reading it once to reconstruct it anyhow, then there's no reason to.) An array can be randomly permuted simply by wrapping the read function with a permutation on the index, it need not be manifested in memory. etc.
This only works if you can easily find the end during iteration. Which you cannot in e.g. the case where you reverse a part of a larger array and want to keep the result contiguous.
I also don't understand why people seem to think that all you can do is maybe write one line of code or something. You can write as much code as it takes. Presumably, you know, not thousands and thousand just to read the array backwards or in some funky order, but you're not limited to calling the standard-library-provided "reverse" function and giving up if it doesn't exist.
These protests that "I can't do that because I'm just so helpless" make no sense to me. Read the array in whatever order you want, whenever you want. It's not that hard. It's just about the easiest thing there is. (Too much vibe coding?)
tmpfn = tree.left
tree.left = tree.right
tree.right = tmpfn
Now when tree.left(node) is applied, it returns the right child, and vice versa. All traversals use the accessors, QED.That's an array, as in, things already in memory. Not a stream. An array. A collection of fixed-size-in-RAM elements, contiguously in order.
That is not "an environment which doesn't provide a one-liner to do it". C, for instance, may not provide a backwards iterator, but then, it doesn't exactly provide a forwards iterator either. It is trivial to iterate backwards on an array in C. To the extent that it's a pain to provide a function that does it either way, that's C's problem, not the fact it can't be done. Everything's a pain in C.
Cache coherency is not an issue here either because I have specified it as being read once and already called out that if you're going to read it multiple times it may be worth it. Reading it once to do whatever it is you are doing is faster than reading it once, writing it in a new order, then reading it in that new order, at least on anything remotely resembling real, existing hardware.
There is almost never a good reason to invert a full 4x4/3x3 transform, yet I see it all the time.
But in large scale applications you may not want to store XX’ but instead are interested in computing products of the form XX’ v on the fly.
XX^T is the matrix of all piecewise dot products of vectors in X and as others have pointed out there are legitimate applications. Another one being e.g. transforming an undetermined linear system into a least squares problem.
[ a b ] [ a c ] = [ (a b) . (a b) (a b) . (c d) ] = [ |(a b)]^2 (a b) . (c d) ]
[ c d ] [ b d ] [ (c d) . (a b) (c d) . (c d) ] [ (c d) . (a b) |(c d)|^2 ]
Down the diagonal we have squared norms of (a b) and (c d) which is interesting and could somehow be used.We also have repeated values between the upper and lower triangle because (a b) . (c d) is the same as (c d) . (a b): the dot product commutes.
Whereas just squaring the matrix:
[ a b ] [ a b ] = [ (a b) . (a c) (a b) . (b d) ]
[ c d ] [ c d ] [ (c d) . (a c) (c d) . (b d) ]
all four dot products are unique.So right of the bat, we can find interesting things about X X^t without going deep, and an obvious shortcut.
Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.
Aka a galactic algorithm:
Whereas many algorithms are theoretically fine for real data but don't make good use of cache, or instruction sets...
If they're wrong to speculate, well, there's a whole paper you can just go skim to find the bit that rebuts them.