What I’ve come to believe is this: you should work at a level of abstraction you’re comfortable with, but you should also understand the layer beneath it.
If you’re a C programmer, you should have some idea of how the C runtime works, and how it interacts with the operating system. You don’t need every detail, but you need enough to know what’s going on when something breaks. Because one day printf won’t work, and if the layer below is a total mystery, you won’t even know where to start looking.
So: know one layer well, have working knowledge of the layer under it, and, most importantly, be aware of the shape of the layer below that.
https://corecursive.com/godbolt-rule-matt-godbolt/Also this article in acmqueue by Matt is not new at all, but super great introduction to these types of optimizations.
And also, it’s just fun to understand the lower layers.
What are some articles/books/videos that you would recommend to go from beginner-to-expert in your domain ?
Many thanks to him for that.
Between that and compiler explorer, it is fair to say he made the world a better place for many of us, developers.
-O2 is basically all you usually need. As you update your compiler, it'll end up tweaking exactly what that general optimization does based on what they know today.
Because that's the thing about these flags, you'll generally set them once at the beginning of a project. Compiler authors will reevaluate them way more than you will.
Also, a trap I've observed is setting flags based on bad benchmarks. This applies more to the JVM than a C++ compiler, but never the less, a system's current state is somewhat random. 1->2% fluctuations in performance for even the same app is normal. A lot of people won't realize that and ultimately add flags based on those fluctuations.
But further, how code is currently layed out can affect performance. You may see a speed boost not because you tweaked the loop unrolling variable, but rather your tweak may have relocated a hot path to be slightly more cache friendly. A change in the code structure can eliminate that benefit.
To be able to support multiple arch levels in the same binary I think you still need to do manual work of annotating specific functions where several versions should be generated and dispatched at runtime.
If you know the architecture and oldest CPU model, we're better served with added a bunch more flags, no?
I wish I could compile my server code to target CPU released on/after a particular date like:
-O2 -cpu-newer-than=2019-O2 in gcc has vectorization flags set which will use avx if the target CPU supports it. It is less aggressive on vectorization than -O3.
Flags from -O3 often flow down into -O2 as they are proven generally beneficial.
That said, I don't think -O3 has the problems it once did.
Of course, even with a solid grasp of the language(s), it's still by no means easy to write correct C or C++ code, but if your plan it to go with this seems to work, you're setting yourself up for trouble.
For cases where -O2 is too slow to compile, dropping a single nasty TU down to -O1 is often beneficial. -O0 is usually not useful - while faster for tiny TUs, -O1 is still pretty fast for them, and for anything larger, the increased binary size bloat of -O0 is likely to kill your link time compared to -O1's slimness.
Also debuggability matters. GCC's `-O2` is quite debuggable once you learn how to work past the possibility of hitting an <optimized out> (going up a frame or dereferencing a casted register is often all you need); this is unlike Clang, which every time I check still gives up entirely.
The real argument is -O1 vs -O2 (since -O1 is a major improvement over -O0 and -O3 is a negligible improvement over -O2) ... I suppose originally I defaulted to -O2 because that's what's generally used by distributions, which compile rarely but run the code often. This differs from development ... but does mean you're staying on the best-tested path (hitting an ICE is pretty common as it is); also, defaulting to -O2 means you know when one of your TUs hits the nasty slowness.
While mostly obsolete now, I have also heard of cases where 32-bit x86 inline asm has difficulty fulfilling constraints under register pressure at low optimization levels.
In my experience a team of 200 developers will see 1 compiler bug affect them every 10 years. This isn't scientific, but it is a good rule of thumb and may put the above in perspective.
In the case of open source compilers the bug was generally fixed upstream and we just needed to get on a newer release.
You generally avoid O3 because it's slower. Slower to compile, and slower to run. Aggressively unrolling loops and larger inlining windows bloat code size to the degree it impacts icache.
The optimization levels aren't "how fast do you want to code to go", they're "how aggressive do you want the optimizer to be." The most aggressive optimizations are largely unproven and left in O3 until they are generally useful, at which point they move to O2.
For example, people showed me
extern void g(int x);
int f(int a, int b)
{
g(b ? 42 : 43);
return a / b;
}
as an example on how compilers exploit "time-travelling" UB to optimize code, but it is just a compiler bug that got fixed once I reported it:https://developercommunity.visualstudio.com/t/Invalid-optimi...
Other compilers have similar issues.
GCC version: 11.3 target: Cortex-A9 Qt version: 5.15
I think we tested single core and quad core, also possibly a newer GCC version, but I'm not sure. Just wanted to add my two cents.
And programs full of pointer-chasing are quite pessimized; highly-OO code is a common example, which includes almost all GUIs, even in C++.
In any case even with whole program optimization, O would expect that effectively devirtualizing an heavily object oriented application to be very hard.
where is the problem to be solved?