oh I think I know what might cause this: TableGen. The `llvm-tblgen` run time accounts for a good chunk of LLVM build time. In a debug build `llvm-tblgen` is also unoptimized, hence the long run time generating those .inc / .def files. You can enable cmake variable `LLVM_OPTIMIZED_TABLEGEN` to build `llvm-tblgen` in release mode while leaving the rest of the project in debug build (or whatever `CMAKE_BUILD_TYPE` you choose).
I've been compiling it with MSVC on Windows and the swap file after an LLVM compile balloons to ~60 GB. I inserted a couple more sticks to bring my 32 GB to 64 GB and my compile times shortened significantly (I don't have exact stopwatch times right now).
Nah no way it's more than 5%. 5% is definitely a chunk but not big enough that if you made it completely vanish I'd notice. I rebuild LLVM with no cache many times a day (debug, release, all targets, no targets, etc). Debug build takes longer because of all the symbols - LLVM has some enormous symbols due to templating hijinks.
Protips for speeding up building LLVM:
1. LLVM_TARGETS_TO_BUILD=host
2. LLVM_ENABLE_PROJECTS only what you're interested in (do you really care about lld or flang whatever?)
3. Use clang/lld instead of gcc - way better RAM usage
4. Use ccache
4. Disable tests
> So, we have a single array in which every entry has the key and the value paired together. But, during lookups, we only care about the keys. The way the data is structured, we keep loading the values into the cache, wasting cache space and cycles. One way to improve that is to split this into two arrays: one for the keys and one for the values.
Recently someone proposed this on LLVM: https://discourse.llvm.org/t/rfc-add-a-new-structure-layout-...
Also, I think what you meant by data locality here is really optimizing data layout, which, as you also mentioned, is a hard problem. But if it's just optimizing (cache's) locality, I think the classic loop interchange also qualifies. Though it's not enabled by default in LLVM, despite being there for quite a while.
Well, to be honest, it's not any extraordinary idea and it's not even mine. Mike Acton used this exact example in his CppCon video (https://youtu.be/rX0ItVEVjHc?t=1558 at 25:58). I just happened to know that LLVM's DenseMap has this problem.
On the other comment, data locality is a property (well... it's not really a binary thing (i.e, an execution has it or it doesn't), but really a spectrum; but ok, it will do) of a program. For example, a simple definition which I like is in the paper "Improving Data Locality with Loop Transformations (https://dl.acm.org/doi/pdf/10.1145/233561.233564):
> Data locality is the property that references to the same memory location or adjacent locations are reused within a short period of time.
Changing the the layout is just a transformation to achieve this property. So, the two things are different in the sense that one is a property while the other is a transformation, but they're not different in the sense that data layout transformations are not relevant to data locality.
Regarding cache locality, that is also a property that we could define let's based on this (https://doi.org/10.1109/12.752657) say as: "The property that data brought into cache should be reused as much as possible before it is replaced". Data locality depends on the hardware on which the program is executed. Since we use caches in our hardware, to have data locality you usually need to need have cache locality so these two coincide.
Finally, yes you're absolutely right (well, last time I checked...). LLVM has a bunch of loop transformations implemented---some relevant to data locality---but they're not turned on by default. One reason for that is that LLVM loops are too unstructured while these transformations usually need nice for-loop nests. They're unstructured on the one hand because they're coming e.g., from C, and OTOH because LLVM doesn't have a loop structure, something Chris Lattner has said he regretted.
Like, if you're thinking, "Hey, I need a key-value map so I can efficiently look up Bars given some Foos", then this isn't equivalent to "I need std::map<Foo,Bar>" - it's closer much closer to "I need a table with Foos and Bars, where Foo is a key, and I need an index on it". Now that may translate to more than one object - e.g. like two arrays, one with Foos, other with pointers to Bars. That achieves the intended data locality.
Now, imagine a new requirement, "I also want to efficiently look up Foos corresponding to a given Bar". Should you add another array to support reverse mapping? Well, in database terms, you've said, "I also need an index on Bar". So yes, add one. And add standard boilerplate to keeping all the arrays in sync.
It feels tedious and arbitrary, and people may call it "premature optimization", but the problem really is that we don't have those basic database concepts built into languages. "I need to map Foos to Bars, primarily for efficient lookup, but also for efficient back-referencing" is a lot of work, we're tempted to skip the efficiency bits to keep the code simple; however all that just translates to something like `auto m_bars = Table<Foo, Bar> index(Foo) index(Bar)`; if we could express that directly, we'd skip the boilerplate and compilers could give us very efficient implementations (and we couldn't accidentally break pieces of them out of sync).
I encourage everyone to try and look out for this pattern. I find it shows up frequently, especially in more complex cases, possibly across object boundaries. I first realized this when I was making a roguelike game with Entity Component System, and realized I need fast lookup of entities based on X/Y coordinates in the game world. I started adding some data structures to support that, and then it hit me - I'm just adding a spatial index to the entity list; this isn't complex idea, it's an atomic concept - if you think in database terms, instead of usual programming terms.
I dream of a language with first class support for relational tables as the primary data structure.
C++ has terrible template and include compilation problems, for historical reasons.
Walter Bright and Andrei Alexandrescu introduced static if in Dlang early and around 2012, they co-authored a static if proposal for C++ (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n33...). Bjarne Stroustrup, in his infinite wisdom, co-authored a rebuttal that rejected it, only to realize long overdue that it's an incredibly useful feature and introduce it in C++17.
Using import std in VC++ gives me some hopes this will improve, however given the current mess of C++ compilers catching up to recent ISO C++ revisions, we are several years away from this to matter in portable code.
I always took for granted most compilers will generally output consistent binary code given identical inputs. Then the other day I saw a post here about optimization timeouts and such.
Optimization "timeouts" aren't done in wall time, but in something that can be deterministically counted: number of instructions, number of recurrences up a call stack or expression tree, etc.
You can often make good guesses statically. Especially if your JavaScript was produced from something like a TypeScript source.
Anyway, yes theoretically you can. Recently folks creating Codon (https://github.com/exaloop/codon) did it for Python using a cool form of type inference.
That said, I don't think any mainstream JS VMs do it. Most recent work I know in AOT compilation for JS seems to use some profile-guided optimization and not static type inference. For example, look at the architecture at 6:38 and also discussion about types here: https://youtu.be/pVhcyKf3efM?t=469). Something similar seems to be true for the project described here: https://www.youtube.com/watch?v=_T3s6-C38JI
The story seems to be a bit different for hopc: https://www.youtube.com/watch?v=iY1EXHQ6IeQ The video starts by giving good examples of why it is very hard to have good type inference for JS (which agrees with my intuition) but continues to talk about using type inference. However, the benchmark in this talk and in their paper are not super convincing.
All that said, JS is not my expertise so I would be interested to know more about what mainstream VMs do in practice and any latest research.
I'm friends with someone working on the 'Faster CPython' project. There's lots of interesting stuff happening there. Both in what makes it into CPython, and even more that they are experimenting with.
Also, would it be possible to keep types in JS bytecode so that an JIT can take advantage of that (similar to JVM bytecode)?
No, you really can't. TypeScript types don't give you the sort of data you want for a compiler to make optimization decisions.
Should you represent the `i` as an int or should you represent it as a double?
Depends on what you do with it!
Say that the body of the loop is just `foo(i)` and you don't know what `foo` is at compile time. Whether it's better for `i` to be an int or a double depends on what `foo` does:
- If `foo` uses `i` as an array index, then you absolutely want `i` to be an int.
- If `foo` uses `i` for float math (say, multiplies it by 1.5), then you absolutely want `i` to be a double.
- If `foo` does a mix of those two things, then who knows. It depends on what happens more.
So, even in your simple example, it's not obvious what types to use when compiling JavaScript.
Now, you can compile a loop to a single instruction: https://godbolt.org/z/6E38dx8jr
And, if you split the loop and compile parts of it, you will most certainly not get this single instruction. So, I'm sorry! It seems I have proved myself wrong. I'll try to find some time to revisit this. I hope this helps!
A problem can have optimal substructure even if merging two optimal solutions to two sub-problems that compose to form the complete problem does not provide an optimal solution to the complete problem. This happens when there is more than one way of splitting the complete problem into two sub-problems.
For example, the shortest path problem on a DAG has optimal substructure, but if you pick two subproblems by splitting the path at an intermediate node that does not lie on the shortest path, then you will not find an optimal solution by combining these.
When this happens, one needs to optimise over all the possible ways of splitting into sub-problems. This is the basis of dynamic programming.
Optimal substructure doesn't tell us this. I don't know whether or not this problem has optimal substructure, but even if it did, it still wouldn't tell us this.
Optimal substructure means it is possible to construct an optimal solution from optimal solutions to its subproblems; it doesn't mean that optimal solutions to arbitrary pair of subproblems (that compose to form the complete problem) can be used to form an optimal solution, which is what I understood the quoted section of the article to be saying.
Also I am really enjoying your article, still reading it with wikipedia open on the side haha.
With values falling into just the right registers magically?
> each single-instruction function
So the RET is also implicit?
Well, in this case taking a non-minimal function and replacing its body with function calls to single-instruction functions would indeed reduce (or leave the same) the code size because some instructions are probably repeated, and now this duplication will be removed. What's the contradiction?
So I guess I don't need to think too much of the functionness of splitting up a code-block, I suppose this example contradiction would also apply and more cleanly if just thinking of the compiler considering blocks of code individually e.g. in a greedy algorithm.
(Actually reading the intro again I think I overcomplicated this and he actually does mean just splitting into code blocks. My mistake)
*Let's not talk about the social environment of UC.
Also “-Ox gives similar results all the time”. Any optimisation is a gamble, some of them quite unsure gambles. Different cache sizes and management strategies between processing units mean that what might be a small benefit for one could be a detriment when the code is running on another, and even if you specifically tune the optimisations for a given set of chips you are at the mercy of the incoming data which could invalidate any educated guesses made about branch likelihood and so forth.
Without the interpreter, you have to ship the compiled versions of those parts, even to someone building your language from scratch. Or perhaps ship an entire prebuilt package of the language.
An interpreter creates a third opinion on what code should do, resulting in a documentation-interpreter-compiler triangle, where in a situation in which it is unclear what the right behavior is, weight can be given to the situation that two out of those three pieces agree.
Optimized versus unoptimized isn't exactly the same, because there is a dependency. If unoptimized translation has a certain behavior, and optimization doesn't wreck it (prevalent case), then of course the two are going to agree. Their agreement doesn't carry weight against documentation. An interpreter and compiler are almost totally independent, except for where they get the code to share run-time support and library functions.
I think maybe LLVM's ThinLTO is what you're looking for where whole-program optimization (more or less) happens in middle end.
Just my perception.
- Javascript interpreters JIT at runtime because they don't know which paths are hot ahead of time
- If you have a compiler, you don't need an interpreter
And a bunch of other points that are also valid for non-C++ compilers.
So, I think the title is accurate as is.
I can't reproduce this result at all. Using compile.sh it takes my 7950x 0.9s to compile the whole project. A quick meson/ninja setup took 0.8s from a clean build directory, and just 0.043s to link using gnu-ld. Using faster linkers yields even better results: 0.025s for gold and 0.013s for mold. I'm curious how you got your results?
> SQLite, too, compiles everything into a single .c program.
It takes my computer 25 seconds to compile sqlite when making incremental changes. My work's similarly sized C++(!) repo links 3 executables in 0.07s using mold.
I'm having a really hard time seeing how unity builds could possibly be faster. Maybe with an exceedingly terrible linker or spinning rust?
or head over to my website: https://sbaziotis.com/ for more contact info.
Best, Stefanos
>Separate compilation is the idea that you compile a different object file for each source file.22 The reasoning of why this is beneficial is that if you change one source file, you only have to re-compile that one file and "just" link it with the other object files; you don't have to re-compile the whole project.
That reflects my feelings that linker is bullshit step