RISC-V has no carry bit and this whole thing becomes awkward.
I am under the impression that boost::multiprecision has specialized templates for 128 and 256 bit math, but maybe I'm wrong. In practice when I've wanted extended precision, I've just used GMP or a language with bignums.
I would expect the best x86 machine code for many 128 bit operations would use XMM instructions, no? But I haven't investigated.
int64_t a, b, c, r;
r = (a * b) / c; /* multiplication step could overflow so use 128bits */They're typically implemented with arrays of 64-bit or 32-bit unsigned integers, but if 128-bits were available in hardware, we could get a performance boost. Any arbitrary precision integer library would benefit from 128-bit hardware integers.
If we were performing 128-bit arithmetic in parallel over many values, then a SIMD implementation may help, but without a SIMD equivalent of `addcarry`, there's a limit to how much it can help.
Something like this could potentially be added to AVX-512 for example by utilizing the `k` mask registers for the carries.
The best we have currently is `adcx` and `adox` which let us use two interleaved addcarry chains, where one utilizes the carry flag and the other utilizes the overflow flag, which improves ILP. These instructions are quite niche but are used in bigint libraries to improve performance.
I keep meaning to see if work will let me open source it.
[1]: https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...
I32 didn't cover enough time span and f64 has edge cases from the nature of floats. This was for Windows (MACC not GCC) so I had to roll out my own i128.
If you like your CAD accurate, you have to operate in integer space.
> using u64 = unsigned long long;
? Although in practice, this is _usually_ an unsigned 64 bit integer, the C++ Standard does not technically guarantee this, all it says is that the type need to be _at least_ 64 bits. [0]
I would use std::uint64_t which guarantees a type of that size, provided it is supported. [1]
Re: Multiplication: regrouping our u64 digits
I am aware more advanced and faster algorithms exist, but I wonder if something simple like Karatsuba's Algorithm [2] which uses 3 multiplications instead of 4, could be a quick win for performance over the naive method used in the article. Though since it was mentioned that the compiler-specific unsigned 128 integers more closely resembles the ones created in the article, I suppose there must be a reason for that method to be used instead, or something I missed that makes this method unsuitable here.
Speaking of which, I would be interested to see how all these operations fair against compiler-specific implementations (as well as the comparisons between different compilers). [3]. The article only briefly mentioned their multiplication method is similar for the builtin `__uint128_t` [4], but did not go into detail or mention similarities/differences with their implementation of the other arithmetic operations.
[0] https://en.cppreference.com/w/cpp/language/types.html The official standard needs to be purchased, which is why I did not reference that. But it should be under the section basic.fundamental
[1] https://en.cppreference.com/w/cpp/types/integer.html
[2] https://en.wikipedia.org/wiki/Karatsuba_algorithm
[3] I suppose I could see for myself using godbolt, but I would like to see some commentary/discussion on this.
[4] And did not state for which compiler, though by context, I suppose it would be MSVC?
The comment on the typedef points out that the signature of intrinsics uses `unsigned long long`, though he incorrectly states that `uint64_t` is `unsigned long` - which isn't true, as long is only guaranteed to be at least 32-bits and at least as large as `int`. In ILP64 and LLP64 for example, `long` is only 32-bits.
I don't think this really matters anyway. `long long` is 64-bits on pretty much everything that matters, and he is using architecture-specific intrinsics in the code so it is not going to be portable anyway.
If some future arch had 128-bit hardware integers and a data model where `long long` is 128-bits, we wouldn't need this code at all, as we would just use the hardware support for 128-bits.
But I agree that `uint64_t` is the correct type to use for the definition of `u128`, if we wanted to guarantee it occupies the same storage. The width-specific intrinsics should also use this type.
> I would be interested to see how all these operations fair against compiler-specific implementations
There's a godbolt link at the top of the article which has the comparison. The resulting assembly is basically equivalent to the built-in support.
(We wanted something UUID like but deterministic that we could easily decompose and do RBAC with, this was prior to the invention of JWT’s, OAuth, and scopes, worked at the time).
Wait, what? I'm fairly certain that you can do a 128-bit by 128-bit division using a x64's 128-bit by 64-bit division instruction that gives you only 64-bit quotient and remainder. The trick is to pre-multiply both dividend and divisor by a large enough power of 2 so that the "partial" quotient and remainders that the hardware instruction would need to produce will fit into 64 bits. On the whole, IIRC you need either 1 or 2 division instructions, depending on how large the divisor is (if it's too small, you need two divisions).
Why 564 bits? That’s 70.5 bytes.