i wish i could slap people in the face over standard tcp/ip for clickbait. it was ONE package and some gains were not realized by recompilation.
i have to give it to him, i have preloaded jemalloc to one program to swap malloc implementation and results have been very pleasant. not in terms of performance (did not measure) but in stabilizing said application's memory usage. it actually fixed a problem that appeared to be a memory leak, but probably wasn't fault of the app itself (likely memory fragmentation with standard malloc)
One easy solution is setting the "magic" environment variable MALLOC_ARENA_MAX=2, which limits the number of caches.
Another solution is having the application call malloc_trim() regularly, which purges the caches. But this requires application source changes.
https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...
in case someone is interested: https://github.com/Icinga/icinga2/issues/8737
(basically using jemalloc was the big fix)
https://icinga.com/docs/icinga-2/latest/doc/15-troubleshooti...
-O3 isn't buggy on either GCC or Clang. Are you thinking of -Ofast and/or -ffast-math that disregard standards compliance? Those aren't part of -O3.
EDIT: first one is -funswitch-loops, though
Micro-benchmarks would be testing e.g. a single library function or syscall rather than the whole application. This is the whole application, just not one you might care that much for the performance of.
Other applications will of course see different results, but stuff like enabling LTO, tuning THP and picking a suitable allocator are good, universal recommendations.
In his case, even a gain of ~20% was significant. It calculated into extra bandwidth to encode a few thousand more video files per year.
Reminds me of back in the day, when I was messing around with blender's cmake config files quite a bit, I noticed the fedora package was using the wrong flag -- some sort of debug only flag intended for developers instead of whatever they thought is was. I mentioned this to the package maintainer, it was confirmed by package sub-maintainer (or whomever) and the maintainer absolutely refused to change it because the spelling of the two flags was close enough they could just say "go away, contributing blender dev, you have no idea what you're talking about." Wouldn't doubt the fedora package still has the same mistaken flag to this day and all this occurred something like 15 years ago.
So, yeah, don't release debug builds if you're a distro package maintainer.
Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.
I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.
As already pointed out, engineering is a compromise.
1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.
2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).
> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.
I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.
If I have three libraries, biz, bar, and bif, and each has bugs. You've used biz. When the system was unstable, you'd debug until it works (either by finding the real bug, or by an innocuous change).
If you switch libraries, the bugs which go away are the ones which aren't affecting you, since your software works. On the other hand, you have a whole new set of bugs.
This comes up when upgrading libraries too. More bugs are usually fixed than introduced, but there's often a debug cycle.
A huge chunk of a modern GPU driver is part of the calling process, loaded like a regular library. Just spot checking Chrome's GPU thread, there's dozens of threads created by a single 80+mb nvidia DLL. And this isn't unusual, every GPU driver has massive libraries loaded into the app using the GPU - often including entire copies of LLVM for things like shader compilers.
https://www.dropbox.com/scl/fi/evnn6yoornh9p6l7nq1t9/Irrepro...
I’d be careful extrapolating from just one benchmark but generally if I had to choose I’d pick the new tcmalloc if I could. It seems to be a higher quality codebase.
I do note that rpmalloc (old), Hermes (not public? doesn't compare with mimalloc) and also snmalloc (also Microsoft) have benchmarks of their own showing themselves to be best in some circumstances.
https://github.com/mjansson/rpmalloc-benchmark
https://arxiv.org/abs/2109.02922
https://github.com/SchrodingerZhu/bench_suite/blob/master/ou...
Not OP, but the following logic shows why this claim is bogus. In short: If two non-garbage-collecting memory allocators do anything differently -- other than behave as perfect "mirror images" of each other, so that whenever one allocates byte i, the other allocates byte totalMem-i -- then there exists a program that crashes on one but not the other, and vice versa.
In detail:
If 2 allocators do not allocate exactly the same blocks of memory to the same underlying sequence of malloc() or free() calls, then there exists a program which, if built twice, once using each allocator, and then each executable is run with the same input, will after some time produce different patterns of memory fragmentation.
The first time this difference appears -- let's say, after the first n calls to either malloc() or free() -- the two executables will have the same total number of bytes allocated, but the specific ranges of allocated bytes will be different. The nth such call must be a malloc() call (since if it were a free() call, and allocated ranges were identical after the first n-1 such calls, they would still be identical after the first n, contradicting our assumption that they are different). Then for each executable, this nth malloc() call either allocates a block at or some distance past the end, or it subdivides some existing free block. We can remove the latter possibility (and simplify the proof) by assuming that there is no more memory available past the end of the highest byte thus far allocated (this is allowed, since a computer with that amount of memory could exist).
Now have both programs call free() on every allocated block except the one allocated in operation n. Let the resulting free range at the start of memory (before the sole remaining allocated block) have total length s1 in executable 1 and s2 in executable 2, and let the resulting free range at the end of memory (after that sole remaining allocated block) have length e1 in executable 1 and e2 in executable 2. By assumption, s1≠s2 and e1≠e2. Now have both executables call malloc() twice, namely, on s1 and e1 in descending order. Then, unless s1=e2, executable 1 can satisfy both malloc()s, but executable 2 can satisfy only the first. Similarly, calling malloc() on s2 and e2 in decreasing order will succeed in executable 2 but not executable 1, again unless s1=e2 holds.
What if s1=e2 does hold, though? This occurs when, say, one executable allocates the block 100 bytes from the start of memory, while the other allocates it 100 bytes from the end. In this case, all we need is to keep some second, symmetry-breaking block around at the end in addition to the block allocated by operation n -- that is, a block for which it does not hold that one allocator allocates the mirror-image memory range of the other. (If no such block exists, then the two allocators are perfect mirror images of each other.)
Also, nothing you’ve said actually says that the other allocator will be the worse one. Indeed, glibc is known to hold onto memory longer and have more fragmentation than allocators like mimalloc and tcmalloc so I’m still at a loss to understand how even if what you wrote is correct (which I don’t believe it is) that it follows that glibc is the one that won’t crash. If you’re confident in your proof by construction, please post a repro that we can all take a look at.
Swap space is finite too.
> Also, nothing you’ve said actually says that the other allocator will be the worse one.
I'm not claiming that either is worse. I'm showing mathematically that for any two allocators that behave differently at all (with the one tiny exception of a pair of allocators that are perfect mirror images of each other), it's possible to craft a program that succeeds on one but fails on the other.
I didn't say so explicitly as I thought it was obvious, but the upshot is: It's never completely safe to just change the allocator. Even if 99% if the time, one works better than the other, there's provably a corner case where it will fail but the other does not.
1. If malloc() is called when there exists a contiguous block of free memory with size >= the argument to malloc(), the call will succeed. I think you'll agree that this is reasonable.
2. Bookkeeping (needed at least for tracking the free list, plus any indexes on top) uses the same amount of malloc()able memory in each allocator. I.e., if malloc(x) at some point in time reduces the number of bytes that are available to future malloc() calls by y >= x bytes under allocator 1, it must reduce the number of bytes available to future malloc() calls by y under allocator 2 if called at the same point in time as well. This may not hold exactly in practice, but it's a very good approximation -- it's possible to store the free list "for free" by using the first few bytes as next and prev pointers in a doubly linked list.
To head one another possible objection off at the pass: If the OS allocates lazily (i.e., it doesn't commit backing store at malloc() time, instead waiting till there is an actual access to the page, like Linux does), this doesn't change anything: Address space (even 64-bit address space) is still finite, and that is still being allocated eagerly. In practice, you could craft the differentially crashing program to crash much faster if you call memset() immediately on every freshly malloc()ed block to render this lazy commit ineffective -- then you would only need to exhaust the physical RAM + swap, rather than the complete 64-bit virtual address space.
This allocation pattern idea is unlikely to show up in any real application except at the absolute limit where your exhausting RAM and the OOM killer gets involved. Even then I think you're going to not see the allocator be much of a differentiating factor.
I also work with large (8K) video frames [1]. If you're talking about the frames themselves, 60 allocations per second is nothing. In the case of glibc, it's slow for just one reason: each allocation exceeds DEFAULT_MMAP_THRESHOLD_MAX (= 32 MiB on 64-bit platforms), so (as documented in the mallopt manpage), you can not convince glibc to cache it. It directly requests the memory from the kernel with mmap and returns it with munmap each time. Those system calls are a little slow, and faulting in each page of memory on first touch is in my case slow enough that it's impossible to meet my performance goals.
The solution is really simple: use your own freelist (on top of the general-purpose allocator or mmap, whatever) for just the video frames. It's a really steady number of allocations that are exactly the same size, so this works fine.
[1] in UYVY format, this is slightly under 64 MiB; in I420 format, this is slightly under 48 MiB.
1: for a certain point of view
Do you have a minimal reproducing example?
There's one other element I didn't mention in my previous comment, which is a thread handoff. It may be significant because it trashes any thread-specific arena and/or because it introduces a little bit of variability over a single malloc at a time.
For whatever reason the absolute rate on my test machine is much higher than in my actual program (my actual program does other things with a more complex threading setup, has multiple video streams, etc.) but you can see the same effect of hitting the mmap, munmap, and page fault paths that really need not ever be exercised after program start.
In my actual (Rust-based) program, adding like 20 lines of code for the pooling was a totally satisfactory solution and took me less time than switching general-purpose allocator, so I didn't try others. (Also, my program supports aarch64 and iirc the vendored jemalloc in the tikv-jemallocator crate doesn't compile cleanly there.)
But then when I read this top comment, it makes me concerned I've completely misunderstood the article. From the tone of this comment, I assume that I shouldn't ever do what's talked about in this gist and it's a terrible suggestion that overlooks all these complexities that you understand and have referenced with rhetorical-looking questions.
Any chance you could help me understand if the original gist is good, makes any legitimate points, or has any value at all? Because I thought it did until I saw this was the top comment, and it made me realise I'm not smart enough to be able to tell. You sound like you're smart enough to tell, and you're telling me only bad things.
`jq` is a command-line program that fires up to do one job, and then dies. For such a program, the only property we really want to optimise is execution speed. We don't care about memory leaks, or how much memory the process uses (within reason). `jq` could probably avoid freeing memory completely, and that would be fine. So using a super-stupid allocator is a big win for `jq`. You could probably write your own and make it run even faster.
But for a program with different runtime characteristics, the results might be radically different. A long-lived server program might need to avoid memory bloat more than it needs to run fast. Or it might need to ensure stability more than speed or size. Or maybe speed does matter, but it's throughput speed rather than latency. Each of those cases need to be measured differently, and may respond better to different optimisation strategies.
The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.
I think that's what confused and irritated me. There's a lot of value and learning in the gist - I've used JQ in my previous jobs regularly, this is the real world, and valuable to many. But the top comment (at the time I responded) is largely rhetorically trashing the submission based on purely the title.
I get that the gist won't make _everything_ faster: but I struggle to believe that any HN reader would genuinely believe that's either true, or a point that the author is trying to make. The literal first sentence of the submission clarifies the discussion is purely about JQ.
Anyone can read a submission, ignore any legitimate value it in, pick some cases the submission wasn't trying to address, and then use those cases to rhetorically talk it down. I'm struggling to understand why/how that's bubbling to the top in a place of intellectual curiosity like HN.
Edit: I should practice what I preach. Conversation and feedback which is purely cautionary or negative isn't a world that anyone really wants! Thanks for the response, I really appreciated it:) It was helpful in confirming my understanding that this submission does genuinely improve JQ on Ubuntu. Cautionary is often beneficial and necessary, and I think the original comment I responded to could make a better world with a single sentence confirming that this gist is actually valuable in the context it defines.
Haphazard multithreading is not a sane default.
I understand a million decisions have been made so that we can’t go flip that switch back off, but we’ve got to learn these lessons for the future.
Even glibc claims to be multithreading safe even if it tends to not return or reuse all freed memory.
Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.
Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.
My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.
It's rather exhausting at times...
After initial setup, it’s pretty simple and easy to use, I remember making a ton of friends at matrix’s Gentoo Linux channel, was fun times.
Fun fact, initial ChromeOS was basically just custom Gentoo Linux install, I’m not sure if they still use Gentoo Linux internally.
That's true but worth noting that "optimize" here doesn't necessarily refer to performance.
I've been using Gentoo for 20 years and performance was never the reason. Gentoo is great if you know how you want things to work. Gentoo helps you get there.
Slackware was very manual and some bits were drowned in its low level and long command chains. Gentoo felt easy but highlighted dependencies with a hard cost associated with compilation times.
Being a newb back then I enjoyed user friendliness with access to the machinery beneath. Satisfaction of a 1s boot time speedu, a result of 48h+ compilation, was unparalleled, too ;)
A new hobby
> The goal of Gentoo is to have an operating system that builds all programs from source, instead of having pre-built binary packages. While this does allow for advanced speed and customizability, it means that even the most basic components such as the kernel must be compiled from source. It is known through out the Linux community as being a very complex operating system because of its daunting install process. The default Gentoo install boots straight to a command prompt, from which the user must manually partition the disk, download a package known as a "Stage 3 tarball", extract it, and build the system up by manually installing packages. New or inexperienced users will often not know what to do when they boot in to the installer to find there is no graphical display. Members of /g/ will often exaggerate the values of Gentoo, trying to trick new users in to attempting to install it.
Afterwards I moved to ArchLinux, and that has been mostly fine for me.
If you are using a fairly standard processor, then Gentoo shouldn't give you that much of an advantage?
Yes, Gentoo is great. I'm just saying that for me it was too much of a temptation.
These are the Arch packages built for x86-64-v2, x86-64-v3 and x86-64-v4, which are basically names for different sets of x86-64 extensions. Selecting the highest level supported by your processor should get you most of the way to -march=native, without the hassle of compiling it yourself.
It also enables -O3 and LTO for all packages.
LTO is great, but I have my doubts about -O3 (vs the more conservative -O2).
UPDATE: bah, ALHP repos don't support the nvidia drivers. And I don't want to muck around with setting everything up again.
Another update: I moved to nvidia-open, so now I can try the suggested repos.
doing this should allow you to use as many optimized packages as possible while still being able to install packages not supported by the alhp
Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.
I can believe that.
> Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.
I actually had way more trouble with stuff breaking with Ubuntu. That's because every six months, when I did the distro upgrade, lots of stuff broke at once and it was hard to do root cause analysis.
With a rolling distribution, it's usually only one thing breaking at a time.
I found Arch Linux to be more stable than Gentoo, but that is just my own experience.
https://en.m.wikipedia.org/wiki/Microsoft_POSIX_subsystem
The Zircon kernel does not support signals, so basic C is not going to work well.
"It is heavily inspired by Unix kernels, but differs greatly. For example, it does not support Unix-like signals, but incorporates event-driven programming and the observer pattern."
https://en.m.wikipedia.org/wiki/Fuchsia_(operating_system)#K...
Kidding... honestly that was a pretty fun distribution to play around with ~20 years ago. The documentation was really good and it was a great way to learn how a lot of the pieces of a Linux distribution fit together.
I was never convinced that the performance difference was really noticeable, though.
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
> * SECURITY UPDATE: Fix multiple invalid pointer dereference, out-of-bounds write memory corruption and stack buffer overflow.
(that one was for CVE-2017-9224, CVE-2017-9226, CVE-2017-9227, CVE-2017-9228 and CVE-2017-9229)
Can you explain a little more? Search has failed me on this one.
Also I think in earlier days the argument to build was so you can optimize the application for the specific capabilities of your system like the supported SIMD instruction set or similar. I think nowadays that is much less of a factor. Instead it would probably be better to do things like that on a package or distribution level (i.e. have one binary distribution package prebuilt by the distribution for different CPU capabilities).
Or do you see something deeper that ensures that the distro libonig is actually the one that gets used?
That is, I don't want to devalue the CVE system; but it is also undeniable that there are major differences in impact between findings?
You could, though. It's 99.9% stuff like this!
https://www.youtube.com/watch?v=gG4BJ23BFBE is a presentation that best represents my view on the kind of mindset that's long overdue to become the new norm in our industry.
Assertions in release builds are a bad idea since they can be fairly expensive. It is a better idea to have a different variety of assertion like the verify statements that OpenZFS uses, which are assertions that run even in release builds. They are used in situations where it is extremely important for an assertion to be done at runtime, without the performance overhead of the less important assertions that are in performance critical paths.
I think it's a philosophical difference of opinions and it's one of the things that drive Rust, Go, C# etc. ahead - not merely language ergonomics (I hope Zig ends up as the language that replaces C). The society at large is mostly happy to take a 1-3% perf hit to get rid of buffer overflows and other UB-inducing errors.
But I agree with you on not having "expensive" asserts in releases.
I don't need to outrun the bear. I just need to outrun you.
"ASLR obscures the memory layout. That is security by obscurity by definition. People thought this was okay if the entropy was high enough, but then the ASLR⊕Cache attack was published and now its usefulness is questionable."
Usually when a comment is removed, it's pretty obvious why, but in this case I'm really not seeing it at all. I read up (briefly) on the mentioned attack and can confirm that the claims made in the above comment are at the very least plausible sounding. I checked other comments from that user and don't see any other recent ones that were removed, so it doesn't seem to be a user-specific thing.
I realize this is completely off-topic, but I'd really like to understand why it was removed. Perhaps it was removed by mistake?
If this is indeed what happened, it seems like a bad thing that it's even possible. Since many, perhaps most people probably don't have showdead enabled, it means that the 'flag' option is effectively a mega-downvote.
ASLR/KASLR intends to make attackers lives harder by having non consistent offsets of known data structures. Its not obscuring a security flaw, instead its raises an attacks 'single run' effectivness.
The ASLR attack that i believe is being referenced is specific to abuse within the browser, and running with a single process. This single attack vector does not mean that KASLR globally is not effective.
Your quote has some choice words, but its contextually poor.
Most times in C, each fork() (rather than thread) has a differential address space, so it's actually less severe than you think.
I am not sure why anyone would attempt what you described, for the exact reason you stated. It certainly is not what I had in mind.
> The correct definition is simply "using obscurity as a security defense mechanism", nothing more.
This is just restating the term in more words without defining the core concept in context (“obscurity”).
It's useful to ask what the point being conveyed by the phrase is. Typically (at least as I've encountered it) it's that you are relying on secrecy of your internal processes. The implication is usually that your processes are not actually secure - that as soon as an attacker learns how you do things the house of cards will immediately collapse.
Also IIUC to perform AnC you need to already have arbitrary code execution. That's a pretty big caveat for an attacker.
I think the middle ground is to call the effectiveness of ASLR questionable. It is no longer the gold standard of mitigations that it was 10 years ago.
Think of it this way - if I guess the ASLR address once, a restart of the process renders that knowledge irrelevant implicitly. If I get your IPv6 address once, you’re going to have to redo your network topology to rotate your secret IP. That’s the distinction from ASLR.
I think the key feature of the IPv6 address example is that you need to expose the address in order to communicate. The entire security model relies on the attacker not having observed legitimate communications. As soon as an attacker witnesses your system operating as intended the entire thing falls apart.
Another way to phrase it is that the security depends on the secrecy of the implementation, as opposed to the secrecy of one or more inputs.
The problem with that line of reasoning is that it implies that data handling practices can determine whether or not a given scheme is security through obscurity. But that doesn't fit the prototypical example where someone uses a super secret and utterly broken home rolled "encryption" algorithm. Nor does it fit the example of someone being careless with the key material for a well established algorithm.
The key defining characteristic of that example is that the security hinges on the secrecy of the blueprints themselves.
I think a case can also be made for a slightly more literal interpretation of the term where security depends on part of the design being different from the mainstream. For example running a niche OS making your systems less statistically likely to be targeted in the first place. In that case the secrecy of the blueprints no longer matters - it's the societal scale analogue of the former example.
I think the IPv6 example hinges on the semantic question of whether a network address is considered part of the blueprint or part of the input. In the ASLR analogue, the corresponding question is whether a function pointer is part of the blueprint or part of the input.
Necessary but not sufficient condition. For example, if I’m transmitting secrets across the wire in plain text that’s clearly security through obscurity even if you’re relying on an otherwise secure algorithm. Security is a holistic practice and you can’t ignore secrets management separate from the algorithm blueprint (which itself is also a necessary but not sufficient condition).
I think the semantics are being confused due to an issue of recursively larger boundaries.
Consider the system as designed versus the full system as used in a particular instance, including all participants. The latter can also be "the system as designed" if you zoom out by a level and examine the usage of the original system somewhere in the wild.
In the latter case, poor secrets management being codified in the design could in some cases be security through obscurity. For example, transmitting in plaintext somewhere the attacker can observe. At that point it's part of the blueprint and the definition I referred to holds. But that blueprint is for the larger system, not the smaller one, and has its own threat model. In the example, it's important that the attacker is expected to be capable of observing the transmission channel.
In the former case, secrets management (ie managing user input) is beyond the scope of the system design.
If you're building the small system and you intend to keep the encryption algorithm secret, we can safely say that in all possible cases you will be engaging in security through obscurity. The threat model is that the attacker has gained access to the ciphertext; obscuring the algorithm only inflicts additional cost on them the first time they attack a message secured by this particular system.
It's not obvious to me that the same can be said of the IPv6 address example. Flippantly, we can say that the physical security of the network is beyond the scope of our address randomization scheme. Less flippantly, we can observe that there are many realistic threat models where the attacker is not expected to be able to snoop any of the network hops. Then as long as addresses aren't permanent it's not a one time up front cost to learn a fixed procedure.
> Function pointer addresses are not meant to be shared
Actually I'm pretty sure that's their entire purpose.
> they hold 0 semantic meaning or utility outside a process boundary (modulo kernel).
Sure, but ASLR is meant to defend against an attacker acting within the process boundary so I don't see the relevance.
How the system built by the programmer functions in the face of an adversary is what's relevant (at least it seems to me). Why should the intent of the manufacturer necessarily have a bearing on how I use the tool? I cannot accept that as a determining factor of whether something qualifies as security by obscurity.
If the expectation is that an attacker is unable to snoop any of the relevant network hops then why does it matter that the address is embedded in plaintext in the packets? I don't think it's enough to say "it was meant to be public". The traffic on (for example) my wired LAN is certainly not public. If I'm not designing a system to defend against adversaries on my LAN then why should plaintext on my LAN be relevant to the analysis of the thing I produced?
Conversely, if I'm designing a system to defend against an adversary that has physical access to the memory bus on my motherboard then it matters not at all whether the manufacturer of the board intended for someone to attach probes to the traces.
> The correct definition is simply "using obscurity as a security defense mechanism", nothing more.
Also stated as "security happens in layers", and often obscurity is a very good layer for keeping most of the script kiddies away and keeping the logs clean.My personal favorite example is using a non-default SSH port. Even if you keep it under 1024, so it's still on a root-controlled port, you'll cut down the attacks by an order of magnitude or two. It's not going to keep the NSA or MSS out, but it's still effective in pushing away the common script kiddies. You could even get creative and play with port knocking - that keeps under-1024 ports logs clean.
Security by obscurity is about the bad practice of thinking that obscuring your mechanisms and implementations of security increases your security. It's about people that think that by using their nephew's own super secret unpublished encryption they will be more secure than by using hardened standard encryption libraries.
Security through obscurity is when you run your sshd server on port 1337 instead of 22 without actually securing the server settings down, because you don’t think the hackers know how to portscan that high. Everyone runs on 22, but you obscurely run it elsewhere. “Nobody will think to look.”
ASLR is nothing like that. It’s not that nobody thinks to look, it’s that they have no stable gadgets to jump to. The only way to get around that is to leak the mapping or work with the handful of gadgets that are stable. It’s analogous to shuffling a deck of cards before and after every hand to protect against card counters. Entire cities in barren deserts have been built on the real mathematical win that comes from that. It’s real.
Any shuffling of a deck of cards by Alice is pointless if Bob can inspect the deck after she shuffles them. It makes ASLR not very different from changing your sshd port. In both cases, this describes the security:
https://web.archive.org/web/20240123122515if_/https://www.sy...
Words have meaning, god damn it! ASLR is not security through obscurity.
Edit: I was operating under the assumption that “AnC” was some new hotness, but no, this is the same stuff that’s always been around, timing attacks on the caches. And there’s still the same solution as there was back then: you wipe the caches out so your adversaries have no opportunity to measure the latencies. It’s what they always should have done on consumer devices running untrusted code.
I used to think this, but hearing about the AnC attack changed my mind. I have never heard of anyone claiming to mitigate it.
https://web.archive.org/web/20240123122515if_/https://www.sy...
https://web.archive.org/web/20240123122515if_/https://www.sy...
I haven't used a non-libc malloc before but I suspect the same applies.
If you as an individual avoid being at all different, then you are in the most company and will likely have the most success in the short term.
But it's also true that if we all do that then that leads to monoculture and monoculture is fragile and bad.
It's only because of people building code in different contexts (different platforms, compilers, options, libraries, etc...) that code ever becomes at all robust.
A bug that you mostly don't trigger because your platform or build flags just happens to walk just a hair left of the hole in the ground, was still a bug and the code is still better for discovering and fixing it.
We as individuals all benefit from code being generally robust instead of generally fragile.
I also suspect that any application using floats is more likely to have rough edges?
It also used to happen that just changing processors was likely to find some problems in the code. I have no doubt that still happens, but I'd also expect it has reduced.
Some of this has to be a convergence on far fewer compilers than we used to encounter. I know there are still many c compilers. Seems there are only two common ones, though. Embedded, of course, is a whole other bag of worms.
I do think I saw improvements. But I never got numbers, so I'm assuming most of my feel was wishful thinking. Reality is a modern computer is hella fast for something like emacs.
I did see compilation mode improve when I trimmed down the regexes it watches for to only the ones I knew were relevant for me. That said, I think I've stopped doing that; so maybe that is a lot better?
From a end developer perspective: I have no particular familiarity with mimalloc, but I know jemalloc has pretty extensive debugging functionality (not API compatible with glibc malloc of course).
cargo install --locked jaq
(you might also be able to add RUSTFLAGS="-C target-cpu=native" to enable optimizations for your specific CPU family)
"cargo install" is an underrated feature of Rust for exactly the kind of use case described in the article. Because it builds the tools from source, you can opt into platform-specific features/instructions that often aren't included in binaries built for compatibility with older CPUs. And no need to clone the repo or figure out how to build it; you get that for free.
jaq[1] and yq[2] are my go-to options anytime I'm using jq and need a quick and easy performance boost.
Every once in a while I test jaq against jq and gojq with my jq solution to AoC 2022 day 13 https://gist.github.com/oguz-ismail/8d0957dfeecc4f816ffee79d...
It's still behind both as of today
apt-get source jq
Then go into the package and recompile to your heart's content. You can even repackage it for distribution or archiving.You'll get a result much closer to the upstream Ubuntu, as opposed to getting lots of weird errors and misalignments.
It's actually a little bit interesting, if you are interested in how we use language. You could argue that now you now get 90% more work done in the same amount of time, and that would align with other 'speed' units that we commonly use (miles per hour, words per minute, bits per second). However, the convention in computer performance is to measure time for a fixed amount of work. I would guess that this is because generally we have a fixed amount of work and what might vary is how long we wait for it (and that is absolutely true in the case of this blog post) so we put time in the numerator.
It's a very interesting post and very well done, but it's not 90% faster.
Plus, if you were using units of time, you wouldn’t use the word “faster.” “Takes 45% less time” and “45% faster” are very different assertions, but they both have meaning, both in programming and outside it.
I think, generally, we fix on the earlier when talking about the change over time of a characteristic. "This stock went up 100%, this stock went down 50%". In both cases it's the earlier measurement that is taken as the unit. That makes this a 45% reduction in time to do the work, and that's actually what they measured.
When talking about comparisons between two things that aren't time dependent it depends on if we talk in multiples or percents I think. A twin bed is half as big as a king bed. A king bed is twice as big as a twin bed. Both are idiomatic. A king bed is 100% bigger than a twin bed. Yes, you could talk like this. A twin bed is 100% smaller than a king bed. Right away you say wait, a twin bed isn't 0 size! Because we don't talk in terms of the smaller thing when talking about decreasing percents, only increasing. A twin bed is 50% smaller than a king bed (iffy). A twin bed is 50% as big as a king bed. There, that's idiomatic again.
I would rephrase your comment as "the package takes 45% less time to process a given data set".
The article doesn't seem to even consider that option and I don't see any comment here mentioning this either. Am I missing something?
I wonder if the glibc allocator is standard there.
In all seriousness though, are you sure some of this isn’t those blocks being loaded into some kind of file system cache the second and third times?
How about if you rebooted and then ran the mimalloc version?
See this: https://en.wikipedia.org/wiki/Tacit_programming#jq
$ jq --null-input '[1, 2] | add'
3
This is much more intuitive now.
Should probably work for Debian and RedHat too. For this particular package.
Edit: based just on the title i initially thought this is an article about turning Ubuntu into Linux From Scratch.
ch -q "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity WHERE f.properties.TotalNetValue < 193000 FROM 'data.json'"
Reference: https://jsonbench.com/Try this (the placement of FROM was incorrect):
ch "WITH arrayJoin(features) AS f SELECT f.properties.SitusCity FROM 'a.json' WHERE f.properties.TotalNetValue < 193000"
And we are speaking about trafeoff the default number of arenas in glibc malloc is 8 times CPU which is a terrible tradeoff - on many workloads it cause heap fragmentation and memory usage (RSS) many times higher than allocated memory size, that's why it is common to find advice to set MALLOC_ARENA_MAX to 1 or 2. But probably such high number of areans allows glib to look less bad on synthetic benchmakrs.
Jemalloc, tcmalloc, mimalloc all were created with focus on multi-threaded applications from the beginning and while they don't work better than glibc malloc for single threaded application they don't work worse for this use case either. Probably the main disadvantage of using je/tc/mi mallocs for a single threaded app is large code size.
we're using it in feldera and it's been giving us (slightly) better performance than jemalloc which is what we've used before mimalloc
I realize, of course, that this could just be overall optimal. Feels more likely that the allocation patterns of the application using it will impact which is the better allocator?
I recall trying with march=native and seeing some improvement, but not enough to care at a system level.
By rebuilding the binary with different compiler options, but not changing malloc, they got an 20% speedup.
If we naively multiply these speedups, we get 1.78: 78% faster.
How it goes to 1.9 is that when you speed up only the program, but leave malloc the same, malloc matters to its performance a lot more.
When the faster malloc is applied to a program that is compiled better, it will make a better contribution than the 44% seen when the allocator was preloaded into the slower program.
To do the math right, we would have to look into how much time was saved with just the one change, and how much time was saved with the other. If we add those times, and subtract them from the original, slow time, we should get a time that is close to the 1.9 speedup.
Original time: 4.631
Better compiler options alone: 3.853 (-0.778)
Better allocator alone (preload): 3.209 (-1.422)
Add time saved from both: 2.200
Projected time: 4.631 - 2.200 = 2.431
Projected speedup from both: 4.631/2.431 = 1.905
Bang on!
When an AST interpreted languages gets a VM, gets native code, each step reveals GC to be slow.
You might go from 1% time spent in the GC to 15% to 60% (numbers of out thin air).
Upstream is stripped actually. My final build here is 43x larger than the distro binary, partly due to having plentiful debug info.
Love to see why they wouldn't have more optimizations.
I'm assuming this isn't the case for more packages? Feels too good.
Another thing the article adds is LTO, which in my experience also makes a huge difference: it makes your software a lot faster in certain cases, but also makes build times a lot worse. Spending a bit more time in the build process should be an easy call, but might be harder at the scale of a distro.
Not sure where I got that idea, though. :(. Will have to look into that later.
And understood a little on -O3 possibly increasing code size. I had thought that was more of a concern for tight environments than for most systems? Of course, I'd have assumed that -march=native would be more impactful, but the post indicates otherwise.
I said in a top level, but it seems the allocator makes the biggest impact for this application? Would be interesting to see which applications should use different allocators. Would be amazing to see a system where the more likely optimal allocator was default for different applications, based on their typical allocation patterns.
It used to be the case that the presence of debug symbols would affect GCC code generation. Nowadays that should be fixed. I think it still affects the speed of compilation so if you're building the whole system from source you might want to avoid it.
Increasing code size too much can result in hot functions not fitting in the icache, and that ultimately can make your program slower.
What impact does that have?
Debug symbols have 0 runtime penalty, just storage. They're just another section of the binary, referenced by debuggers, and which the loader skips.
In any case, all distros break out the symbols into separate files so that they can have their (storage) cake and (debug) eat it too.
To travel 10 miles, at 60 MPH, takes 10 minutes. Make it 100% faster, at 120 MPH, and that time becomes 5 minutes. Travel just as far in 50% of the time. Or travel just as far 100% faster. The 90% speedup matches the reduction of the time it takes to nearly half (a 90% (projected) speedup, or about a 45% time reduction, as mathed out by kazinator `Projected speedup from both: 4.631/2.431 = 1.905`). Your claim that its closer to 50% is correct from a total time taken perspective, just coming at it from the other direction.
Enables fun & safe math optimizations!
Why wouldn't that be identical?
Read more here: https://chimera-linux.org/about/
And for a long time using Jemalloc was the only was to keep the memory usage constant with multi-threaded ruby programs like Puma and Sidekiq. This was either achieved through compiling Ruby with jemalloc or modifying the LD_LIBRARY_PATH.
Some developers also reported 5% or so reduction in response times with jemalloc, iirc.
The problem with this approach is though, when a package has a lot of dependencies like ImageMagick which relies on jpeg, png, ghost and a lot of other libraries, you have to take a trial and error approach until it succeeds. Fortunately fixing the dependency errors are the easiest, sometimes building Python from source would throw errors from headers which are impossible to understand. If you find a Stackoverflow solution then you are good, or you have to go down the rabbit hole coming either successful or empty handed based on your level of expertise.
It's surprising how quick this kind of processing can be in go.
[1] https://data.acgov.org/datasets/2b026350b5dd40b18ed7a321fdcd...
The program:
package main
import (
"encoding/json"
"fmt"
"os"
)
type FeatureCollection struct {
Features []Feature `json:"features"`
}
type Feature struct {
Properties Properties `json:"properties"`
}
type Properties struct {
TotalNetValue int `json:"TotalNetValue"`
SitusCity string `json:"SitusCity"`
}
func main() {
filename := "Parcels.geojson"
data, err := os.ReadFile(filename)
if err != nil {
fmt.Println("Error reading file:", err)
return
}
var featureCollection FeatureCollection
err = json.Unmarshal(data, &featureCollection)
if err != nil {
fmt.Println("Error unmarshaling JSON:", err)
return
}
for _, feature := range featureCollection.Features {
if feature.Properties.TotalNetValue < 193000 {
fmt.Println(feature.Properties.SitusCity)
}
}
}
Which is why I'm in support of always building your own software with -O3.
But in general I have often found great benefit in many cases in obtaining packages from source and compiling them, or downloading official binaries, even if the repository does have the latest version in theory.
(shameless plug incoming): My only obstacle with such "Manually Installed [and/or] Source-Compiled" (MISC) packages was checking for updates. So I made this: https://sr.ht/~tpapastylianou/misc-updater/
Works great, and I've been slowly improving it to add 'upgrader' commands for some of the packages too.
It's not a full replacement for a package manager, nor is it meant to be. But it's made my ability to use MISC packages instead of repository ones a lot more streamlined and easier.
1) Why don't distros replace the glibc allocator with one of the better ones?
2) Why don't distros allow for making server-specific builds for only a few packages? You don't have to pay the compilation cost for everything, just a list of 2 or 3 packages.
Some of the other things disabled might be useful for distros in general, like ndebug assertions or compiling with debig symbols.
You could write your own custom package definitions, extending the default to change up compile flags and allocators, but then you need to do this for every single package (and maintain them all). I'm not sure Guix gives you much here, though maybe that's fine for one or two packages.
The most pain-free option I can think of is the --tune flag (which is similar to applying -march=native), but packages have to be defined as tunable for it to work (and not many are).
Is there another option?
guix build --with-configure-flag="jq=CFLAGS=-O3" jq
If you want it to be permanent, then you can use a guix home profile (that's a declarative configuration of your home directory) with a patch function in the package list there: (define llama-tune
(options->transformation `((tune . "znver3")))) ; Zen 3
(home-environment
(packages (list (llama-tune (specification->package "llama")))))
or: (define jq-patch
(options->transformation `((with-configure-flag . "jq=CFLAGS=-O3"))))
[...] (jq-patch (specification->package "jq"))
[...]You can also write a 10 line guile script to automatically do it for all dependencies (I sometimes do--for example for emacs). That would cause a massive rebuild, though.
>The most pain-free option I can think of is the --tune flag (which is similar to applying -march=native), but
> packages have to be defined as tunable for it to work (and not many are).
We did it that way on purpose--from prior experience, otherwise, you would get a combinatorial explosion of different package combinations.
If it does help for some package X, please email us a 2 line patch adding (tunable? . #t) to that one package.
If you do use --tune, it will tune everything that is tuneable in the dependency graph. But at least all dependents (not dependencies) will be just grafted--not be rebuilt.
It's the reason I so far use a Mac at work, which has its own issues, and a lot of them.
Maybe it's time to rebuild those with mimalloc, too.
jaq runs 5x faster on my machine in some cases
And in some/most cases -Os is faster than -O3
Boehm GC is usually slower than glibc with lots of allocs. MPS would be better, but needs lots of rewriting. https://github.com/Ravenbrook/mps
-Os is much slower, as I would have expected. Nothing in the perf data suggested high L1i miss rate. The .text of the "size optimized" program is, for some reason, 16x larger.
Converter: https://geoparquet.org/convert/
It'll scan way less data and be 90% faster than your 90% faster jq.
Can’t one recompile the same exact Ubuntu packages you already have on your system with optimal flags for your specific hardware?
Plus, the configuration could be automated - just push a button to (lazy, in background) recompile.
Priority for the ones profiled to be performance critical.
Well, in principle yes. But you'd also need to figure out what you mean by 'optimal' flags? The flags might differ between different packages, and not all packages will be compatible with all the flags.
Even worse, in the example in the article they got most of their speedup out of moving to a different allocator, and that's not necessarily compatible with all packages (nor an improvement for all packages).
However, if you still want to do all these things, then Gentoo Linux is the distribution for you. It supports this tinkering directly and you'll have a much happier time than trying to bend Ubuntu to your will.
Distros who went this path is going the way of the dinousaurs, together with their confused user base.
The undefined behaviour is already in the C standard.
But you are right that enabling -O3 is generally going to make your compiler more aggressive about exploiting the undefined behaviour in the pursuit of speed.
In general, if you want to avoid being the guy to run into new and exciting bugs, you want to run with options that are as close as possible to what everyone else is using.
Even Design by Contract in Eiffel turns off assertions in production.
Rule #1 of programming: If it can go wrong, it will.
I remember when it was my main driver back in early '00s, running on a horribly underpowered PII at 266MHz. It would often be compiling 24/7 to keep up with the updates.
My first install was 2002 and it took my a good 24 hours to get X to boot.
It did catch a faulty memory DIM later on because the linking kept failing intermittently.
I know most of us have accounts here, but when we're seeing something we want to read its not conducive to have to have to deal with roadblocks before we can see what is being talked about. Same goes for paywalls.
Not everything in life is an optimization problem. Fun little projects like this is what makes us human (and, arguably, are both educational and entertaining).
If we're talking about micro-seconds of difference, the trade-off doesn't seem worth it. Even on a mass scale where this is somehow adopted, nobody is going to notice the difference. Maybe if this were in something like eCommerce and web browing where the lag translates to profit lost? Or perhaps game engines?
IDK, I just consider humans time more precious than a slow package (that already runs blazingly fast on a CPU that it barely matters.)
Also a lot of simple optimizations could save seconds or minutes in the lives of millions of people, and across multiple devices/programs that adds up. Microsoft once sped up the booting process of their consoles by 5 seconds, by simply making the boot animation shorter by 5 seconds. Those consoles were sold to millions of people and it took MS 8 years or so to make the fix. That’s many lifetimes wasted once you add it all up.
As a broader comment though, this is how we learn and discover things. Just because the specific outcome here is "trivial" doesn't mean the approach or the lessons learnt aren't valuable.
I found it an interesting read despite not have any stake in the outcome.