> As an example, in my desktop, it takes ~25min to compile a Debug version of LLVM, but ~23min to compile a Release version.

oh I think I know what might cause this: TableGen. The `llvm-tblgen` run time accounts for a good chunk of LLVM build time. In a debug build `llvm-tblgen` is also unoptimized, hence the long run time generating those .inc / .def files. You can enable cmake variable `LLVM_OPTIMIZED_TABLEGEN` to build `llvm-tblgen` in release mode while leaving the rest of the project in debug build (or whatever `CMAKE_BUILD_TYPE` you choose).

TIL that LLVM_OPTIMIZED_TABLEGEN exists, thanks! Unfortunately, it didn't make the build any faster on my machine.
The biggest factor in improving LLVM compile/link times is sheer RAM capacity. LLVM compilation is extremely voracious with memory and the more you have the faster it is; otherwise, much time is spent swapping in and out of disk.

I've been compiling it with MSVC on Windows and the swap file after an LLVM compile balloons to ~60 GB. I inserted a couple more sticks to bring my 32 GB to 64 GB and my compile times shortened significantly (I don't have exact stopwatch times right now).

Use clang-cl instead - way better RAM usage profile and way faster too - I'm talking like 50% reduction.
  • yshui
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
i'd say it's mostly due to linking. the debug info generated is absolutely enormous.
  • vient
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
LLVM_USE_SPLIT_DWARF may help with this, some recent measurements: https://www.tweag.io/blog/2023-11-23-debug-fission/#an-examp...
> good chunk of LLVM build time

Nah no way it's more than 5%. 5% is definitely a chunk but not big enough that if you made it completely vanish I'd notice. I rebuild LLVM with no cache many times a day (debug, release, all targets, no targets, etc). Debug build takes longer because of all the symbols - LLVM has some enormous symbols due to templating hijinks.

Protips for speeding up building LLVM:

1. LLVM_TARGETS_TO_BUILD=host

2. LLVM_ENABLE_PROJECTS only what you're interested in (do you really care about lld or flang whatever?)

3. Use clang/lld instead of gcc - way better RAM usage

4. Use ccache

4. Disable tests

> The compiler optimizes for data locality

> So, we have a single array in which every entry has the key and the value paired together. But, during lookups, we only care about the keys. The way the data is structured, we keep loading the values into the cache, wasting cache space and cycles. One way to improve that is to split this into two arrays: one for the keys and one for the values.

Recently someone proposed this on LLVM: https://discourse.llvm.org/t/rfc-add-a-new-structure-layout-...

Also, I think what you meant by data locality here is really optimizing data layout, which, as you also mentioned, is a hard problem. But if it's just optimizing (cache's) locality, I think the classic loop interchange also qualifies. Though it's not enabled by default in LLVM, despite being there for quite a while.

> Recently someone proposed this on LLVM

Well, to be honest, it's not any extraordinary idea and it's not even mine. Mike Acton used this exact example in his CppCon video (https://youtu.be/rX0ItVEVjHc?t=1558 at 25:58). I just happened to know that LLVM's DenseMap has this problem.

On the other comment, data locality is a property (well... it's not really a binary thing (i.e, an execution has it or it doesn't), but really a spectrum; but ok, it will do) of a program. For example, a simple definition which I like is in the paper "Improving Data Locality with Loop Transformations (https://dl.acm.org/doi/pdf/10.1145/233561.233564):

> Data locality is the property that references to the same memory location or adjacent locations are reused within a short period of time.

Changing the the layout is just a transformation to achieve this property. So, the two things are different in the sense that one is a property while the other is a transformation, but they're not different in the sense that data layout transformations are not relevant to data locality.

Regarding cache locality, that is also a property that we could define let's based on this (https://doi.org/10.1109/12.752657) say as: "The property that data brought into cache should be reused as much as possible before it is replaced". Data locality depends on the hardware on which the program is executed. Since we use caches in our hardware, to have data locality you usually need to need have cache locality so these two coincide.

Finally, yes you're absolutely right (well, last time I checked...). LLVM has a bunch of loop transformations implemented---some relevant to data locality---but they're not turned on by default. One reason for that is that LLVM loops are too unstructured while these transformations usually need nice for-loop nests. They're unstructured on the one hand because they're coming e.g., from C, and OTOH because LLVM doesn't have a loop structure, something Chris Lattner has said he regretted.

Tangent: cases like these make me think we're often working on too low level of abstraction, reimplementing relational databases poorly without realizing it.

Like, if you're thinking, "Hey, I need a key-value map so I can efficiently look up Bars given some Foos", then this isn't equivalent to "I need std::map<Foo,Bar>" - it's closer much closer to "I need a table with Foos and Bars, where Foo is a key, and I need an index on it". Now that may translate to more than one object - e.g. like two arrays, one with Foos, other with pointers to Bars. That achieves the intended data locality.

Now, imagine a new requirement, "I also want to efficiently look up Foos corresponding to a given Bar". Should you add another array to support reverse mapping? Well, in database terms, you've said, "I also need an index on Bar". So yes, add one. And add standard boilerplate to keeping all the arrays in sync.

It feels tedious and arbitrary, and people may call it "premature optimization", but the problem really is that we don't have those basic database concepts built into languages. "I need to map Foos to Bars, primarily for efficient lookup, but also for efficient back-referencing" is a lot of work, we're tempted to skip the efficiency bits to keep the code simple; however all that just translates to something like `auto m_bars = Table<Foo, Bar> index(Foo) index(Bar)`; if we could express that directly, we'd skip the boilerplate and compilers could give us very efficient implementations (and we couldn't accidentally break pieces of them out of sync).

I encourage everyone to try and look out for this pattern. I find it shows up frequently, especially in more complex cases, possibly across object boundaries. I first realized this when I was making a roguelike game with Entity Component System, and realized I need fast lookup of entities based on X/Y coordinates in the game world. I started adding some data structures to support that, and then it hit me - I'm just adding a spatial index to the entity list; this isn't complex idea, it's an atomic concept - if you think in database terms, instead of usual programming terms.

boost.multiindex is close to what you want. You define your table schema (as a struct definition) then ask for indices (of various kinds) on specific subsets of your schema. IIRC It doesn't support anything other than row major layout, but an extension to it is not too far fetched.

I dream of a language with first class support for relational tables as the primary data structure.

  • pjmlp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
PL/SQL, half-joking. I will serve myself out. :)
This is such an underrated comment haha
Regarding a fast string lookup: you really need a short string optimization here, if not static perfect hashes. Short strings directly from the word in the cache.
About C/C++ compilers, mostly.

C++ has terrible template and include compilation problems, for historical reasons.

Yes, but... First, it is what it is. Second, it's not that C++ people made bad decisions in the 80s but the decisions were great afterwards. Let's take C++ metaprogramming in general, which templates are part of. Let's take if constexpr (or static if) as an example, which C++ took straight from Dlang. It is useful for introspection which is almost indispensable for metaprogramming.

Walter Bright and Andrei Alexandrescu introduced static if in Dlang early and around 2012, they co-authored a static if proposal for C++ (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n33...). Bjarne Stroustrup, in his infinite wisdom, co-authored a rebuttal that rejected it, only to realize long overdue that it's an incredibly useful feature and introduce it in C++17.

  • pjmlp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Kudos for pointing out D compilers capabilities, I feel it is quite often left out in similar postings.

Using import std in VC++ gives me some hopes this will improve, however given the current mess of C++ compilers catching up to recent ISO C++ revisions, we are several years away from this to matter in portable code.

If constexpr and static if have different semantics, specifically if constexpr can't inject names outside it's scope which is what Bjarne was worried about.
  • pjmlp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
While true, if reflection actually lands, it will bring symbol injection with it.
I think the injection will still properly respects scoping.
  • pjmlp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Might be, so far I only watched the few talks that touched upon reflection ideas and upcoming roadmap, some stuff might even only land on C++29, assuming reflection MVP does indeed land on C++26.
How about reproducibility?

I always took for granted most compilers will generally output consistent binary code given identical inputs. Then the other day I saw a post here about optimization timeouts and such.

To me this 'problem' of reproducibility from time-outs had been 'solved' in many static analysis tools or solvers, with an actual 'steps' counter or some form of tuneable constant-across-runs metric... one way to do it anyway.
All of the major C/C++ compilers are deterministic functions of their inputs. This is a pretty major property, and any regression is rapidly detected and fixed.

Optimization "timeouts" aren't done in wall time, but in something that can be deterministically counted: number of instructions, number of recurrences up a call stack or expression tree, etc.

  • eru
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
> Knowing which paths are hot in any Javascript bytecode is not enough to produce optimized code; you also need to know the types. And you only know the types at runtime. This is the main reason why JIT compilers compile code at runtime.

You can often make good guesses statically. Especially if your JavaScript was produced from something like a TypeScript source.

Well, the truth is there's a lot more to JITing than what I wrote in the article, just because JITing didn't fit the vibe of the rest of the article that well.

Anyway, yes theoretically you can. Recently folks creating Codon (https://github.com/exaloop/codon) did it for Python using a cool form of type inference.

That said, I don't think any mainstream JS VMs do it. Most recent work I know in AOT compilation for JS seems to use some profile-guided optimization and not static type inference. For example, look at the architecture at 6:38 and also discussion about types here: https://youtu.be/pVhcyKf3efM?t=469). Something similar seems to be true for the project described here: https://www.youtube.com/watch?v=_T3s6-C38JI

The story seems to be a bit different for hopc: https://www.youtube.com/watch?v=iY1EXHQ6IeQ The video starts by giving good examples of why it is very hard to have good type inference for JS (which agrees with my intuition) but continues to talk about using type inference. However, the benchmark in this talk and in their paper are not super convincing.

All that said, JS is not my expertise so I would be interested to know more about what mainstream VMs do in practice and any latest research.

  • eru
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Thanks!

I'm friends with someone working on the 'Faster CPython' project. There's lots of interesting stuff happening there. Both in what makes it into CPython, and even more that they are experimenting with.

Oh, that's cool! I think these are the folks that made the Specializing Adaptive Interpreter which does type specialization at runtime. If these folks have any thoughts on ahead-of-time type inference and/or Codon, please let us know! :)
I wonder if anyone has made a compiler that goes from TypeScript sources to binaries directly, taking advantage of types to create something much faster than Node could?

Also, would it be possible to keep types in JS bytecode so that an JIT can take advantage of that (similar to JVM bytecode)?

For the first question, check this: https://youtu.be/pVhcyKf3efM?t=447 at 7:27. I don't know any details about Porforr but you may want to look into that.
> You can often make good guesses statically. Especially if your JavaScript was produced from something like a TypeScript source.

No, you really can't. TypeScript types don't give you the sort of data you want for a compiler to make optimization decisions.

  • ape4
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
I thought the same. For example `for (let i=0; i <1000; i++)`
Great example!

Should you represent the `i` as an int or should you represent it as a double?

Depends on what you do with it!

Say that the body of the loop is just `foo(i)` and you don't know what `foo` is at compile time. Whether it's better for `i` to be an int or a double depends on what `foo` does:

- If `foo` uses `i` as an array index, then you absolutely want `i` to be an int.

- If `foo` uses `i` for float math (say, multiplies it by 1.5), then you absolutely want `i` to be a double.

- If `foo` does a mix of those two things, then who knows. It depends on what happens more.

So, even in your simple example, it's not obvious what types to use when compiling JavaScript.

Why does optimal substructure work for code size? Couldn't you take a compiled non-minimal function and break each instruction into a function call (with some assumptions about the ISA maybe), then assuming each single-instruction function is minimal by itself, infer the entirety is minimal, contradicting?
Ok, I see a nice discussion going here. I may have disproved myself here and save ourselves some effort. First, just to be sure what we're talking about: If split the function in two and compile the two parts to minimum sizes, combining them gives you the global minimum. So, we're not starting from assembly programs.

Now, you can compile a loop to a single instruction: https://godbolt.org/z/6E38dx8jr

And, if you split the loop and compile parts of it, you will most certainly not get this single instruction. So, I'm sorry! It seems I have proved myself wrong. I'll try to find some time to revisit this. I hope this helps!

The definition of optimal substructure in the article isn’t quite right.

A problem can have optimal substructure even if merging two optimal solutions to two sub-problems that compose to form the complete problem does not provide an optimal solution to the complete problem. This happens when there is more than one way of splitting the complete problem into two sub-problems.

For example, the shortest path problem on a DAG has optimal substructure, but if you pick two subproblems by splitting the path at an intermediate node that does not lie on the shortest path, then you will not find an optimal solution by combining these.

When this happens, one needs to optimise over all the possible ways of splitting into sub-problems. This is the basis of dynamic programming.

Thanks, it looks like I had forgotten how dynamic programming works, e.g. constructing from possibly all subproblem solutions rather than just some way of breaking into some disjoint union. In this case I guess for code optimization a "subproblem" needs to be defined. I'm not sure if just breaking the code into chunks works as I was imagining. Maybe optimal substructure applies to some graph-like model of computation but not the naive assumption I made on how this would work.
Or graphically, this sounds like the spring paradox: https://www.youtube.com/watch?v=Cg73j3QYRJc
Sure, but the article doesn't define optimal substructure, nor does it give a general statement about it. It just talks about a specific implication that _would be_ true if code size had optimal substructure as I thought (but based on my comment above, it seems I proved myself wrong).
> Now, let's say you split the function in two parts and you find their minimums independently. Optimal substructure tells us that if you just merge these two, you'll get the minimum version of the whole function.

Optimal substructure doesn't tell us this. I don't know whether or not this problem has optimal substructure, but even if it did, it still wouldn't tell us this.

Optimal substructure means it is possible to construct an optimal solution from optimal solutions to its subproblems; it doesn't mean that optimal solutions to arbitrary pair of subproblems (that compose to form the complete problem) can be used to form an optimal solution, which is what I understood the quoted section of the article to be saying.

Yes I was surprised at that sentence because I think (considering theoretical properties of code size are the same as instruction count) that the main initial reason compiler optimization is non-trivial is because these kinds of global optimizations are possible, like your loop example.

Also I am really enjoying your article, still reading it with wikipedia open on the side haha.

Nice to hear that! I hope nothing else is questionable. And thanks for pointing it out! I guess the moral of the story is don't do "proofs" in your head after 3am.
> break each instruction into a function call

With values falling into just the right registers magically?

> each single-instruction function

So the RET is also implicit?

Well, in this case taking a non-minimal function and replacing its body with function calls to single-instruction functions would indeed reduce (or leave the same) the code size because some instructions are probably repeated, and now this duplication will be removed. What's the contradiction?

In the article he says "Now, let's say you split the function in two parts and you find their minimums independently", so I am trying to think of what that means. I was thinking something like splitting "func() { ins1; ins2; }" into "ins1func() { ins1; } ins2func() { ins2; } func() { ins1func(); ins2func() }", but I agree this is an unwieldy/unrealistic example, all the register context would need to be passed in, stack memory addressing etc., and this changes instruction counts non-trivially.

So I guess I don't need to think too much of the functionness of splitting up a code-block, I suppose this example contradiction would also apply and more cleanly if just thinking of the compiler considering blocks of code individually e.g. in a greedy algorithm.

(Actually reading the intro again I think I overcomplicated this and he actually does mean just splitting into code blocks. My mistake)

I just realised in the above comment I was making a mistake in an analogy to shortest path optimal substructure. In shortest path, starting with an optimal solution, "subproblems" have optimal solutions. This is not true starting from a non-optimal path, although length-1 subpaths are optimal. But still not sure of a particular way to phrase optimal substructure for code size. Sorry for the confusion!
Nice write-up and overall very informative blog. Kind of off topic, but are you Stefanos, doing your PhD at UIUC? That's my alma mater, I hope you find the academic* environment there stimulating!

*Let's not talk about the social environment of UC.

Indeed, I'm a 4-th year CS PhD at UIUC. Yes, the environment is great and it's surprising how non-antagonistic it is for a top school. About the social environment, that's exactly how you put it: let's not talk about it haha.
> -O3 produces much faster code than -O2

Also “-Ox gives similar results all the time”. Any optimisation is a gamble, some of them quite unsure gambles. Different cache sizes and management strategies between processing units mean that what might be a small benefit for one could be a detriment when the code is running on another, and even if you specifically tune the optimisations for a given set of chips you are at the mercy of the incoming data which could invalidate any educated guesses made about branch likelihood and so forth.

If you have both an interpreter and compiler, then boostrapping the (self hosted) language is a lot simpler. The interpreter is written in some widely available language like C. The interpreter runs the compiler and standard library code of your language in order to compile that compiler and standard library. After that, those parts execute as native machine code or byte code (with or without JIT).

Without the interpreter, you have to ship the compiled versions of those parts, even to someone building your language from scratch. Or perhaps ship an entire prebuilt package of the language.

An interpreter creates a third opinion on what code should do, resulting in a documentation-interpreter-compiler triangle, where in a situation in which it is unclear what the right behavior is, weight can be given to the situation that two out of those three pieces agree.

Optimized versus unoptimized isn't exactly the same, because there is a dependency. If unoptimized translation has a certain behavior, and optimization doesn't wreck it (prevalent case), then of course the two are going to agree. Their agreement doesn't carry weight against documentation. An interpreter and compiler are almost totally independent, except for where they get the code to share run-time support and library functions.

Re: Why does link-time optimization (LTO) happen at link-time?

I think maybe LLVM's ThinLTO is what you're looking for where whole-program optimization (more or less) happens in middle end.

I don't know that much about ThinLTO but ThinLTO just works on summaries instead of the full code right? It still seems the linker is the one who reads these summaries (doc: https://clang.llvm.org/docs/ThinLTO.html).
  • pjmlp
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
As far as I remember from some LLVM talks, apparently only ThinLTO is kind of usable, full LTO doesn't seem to be fully used.

Just my perception.

  • norir
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Good observations but a more accurate title would be common misconceptions about c/c++ compilers. I point this out because I think that there is a lot of room for compiler innovation that will come from developing new compilers for new languages and I don't want us to continue being anchored in the mistakes of the past.
The article has valid points that are specifically not about C++:

- Javascript interpreters JIT at runtime because they don't know which paths are hot ahead of time

- If you have a compiler, you don't need an interpreter

And a bunch of other points that are also valid for non-C++ compilers.

So, I think the title is accurate as is.

> For example, minijava-cpp is a MiniJava compiler written in C++ which has some pretty heavy code in it. It compiles in ~0.7s in my laptop, while linking all these separate files takes much longer.

I can't reproduce this result at all. Using compile.sh it takes my 7950x 0.9s to compile the whole project. A quick meson/ninja setup took 0.8s from a clean build directory, and just 0.043s to link using gnu-ld. Using faster linkers yields even better results: 0.025s for gold and 0.013s for mold. I'm curious how you got your results?

> SQLite, too, compiles everything into a single .c program.

It takes my computer 25 seconds to compile sqlite when making incremental changes. My work's similarly sized C++(!) repo links 3 executables in 0.07s using mold.

I'm having a really hard time seeing how unity builds could possibly be faster. Maybe with an exceedingly terrible linker or spinning rust?

Hey folks, author here. I'm really happy the article has gotten such attention. I hope it's good attention and that you folks learned something! Unfortunately, though, it's _very_ hard to keep track of all the comments in HackerNews. So, if you have a comment you'd like me to reply to and I haven't, please either use the Reddit post: https://www.reddit.com/r/Compilers/comments/1hapetl/common_m...

or head over to my website: https://sbaziotis.com/ for more contact info.

Best, Stefanos

  • eru
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
If the author is here: the link from the table of contents to '-O0 gives you fast compilation' is broken.
Thanks, I fixed it!
>Separate compilation is always worth it

>Separate compilation is the idea that you compile a different object file for each source file.22 The reasoning of why this is beneficial is that if you change one source file, you only have to re-compile that one file and "just" link it with the other object files; you don't have to re-compile the whole project.

That reflects my feelings that linker is bullshit step

How would you implement separate compilation then?
I wouldnt use separate compilation approach at all. I dont like it
So you would recompile a, say, 10 million lines application from scratch every time you make a small change?
If you put everything in one file you can live your dream! :)
[flagged]
Bad bot.