Reorganizing the binary is an interesting approach to minimize the cost, but I think that any performance oriented developer should keep in mind that most projects are rarely dependent on a single hot loop but on many systems working together and competing for space in the cache(s).
I generally use -Os instead of -O2 and -O3 in my projects, while trying to reduce code bloat to a minimum for that reason.
https://vondra.me/posts/playing-with-bolt-and-postgres/
"results are unexpectedly good, in some cases up to 40%"
If memory serves, this was with MPW C or maybe CodeWarrior.
You could see the jump (jmp) instructions use short jumps rather than long ones.
I worked on the Profiler and I seem to remember that Microsoft was one of the developers that put a bunch of effort into using this to optimize the Office suite on Mac. I remember the release of Word that used it was snappier.
you have far and near pointers modifiers
Reading the link, there's several that sound like they match what BOLT is applying (Basic Block Optimization, Function Layout, Conditional Branch Optimization, and Dead Code Separation).