https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
I don’t write exotic algorithms, but it’s always astounding how small n needs to be to become observably problematic.
https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
If the cost of doing something goes above quadratic, you shouldn't do it at all. Because essentially every customer interaction costs you more than the one before. You will never be able to come up with ways to cover that cost faster than it ramps. You are digging a hole, filling it with cash and lighting it on fire.
If you can't do something well you should consider not doing it at all. If you can only do it badly with no hope of ever correcting it, you should outsource it.
Software operates in a crazy number of different domains with wildly different constraints.
n is only rarely related to "customers". As long as n doesn't grow, the asymptotic complexity doesn't actually matter.
For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn takes 29.9 s and n^2 takes 31.7 years.
Also, the wise statement that 'memory is fairly cheap compared to CPU for scaling'. It's insane to see how often folks would rather manually open and scan a 'static-on-deploy' 20-100MB Json file for each request vs just parsing it into structures in memory (where, for most cases, the in memory usage is a fraction of the json itself) and just caching the parsed structure for the length of the application.
Less brittleness is worth paying a few percent. Especially if it unmuddies the waters enough for someone to spot other accidental (time) complexity.
But I also don't dabble in this area nearly enough to know whether there's years of tears and toil finding out repeatedly that O(n) is ~impossible to implement and verify :)
| n | n log n |
| 5 | 8.0472 |
| 10 | 23.0259 |
| 25 | 80.4719 |
| 50 | 195.6012 |
| 100 | 460.5170 |
I've seen it several times before, and it's exactly what happened here.
I spent a lot of time fixing n^2 in blink, but there were some fun surprises:
https://source.chromium.org/chromium/chromium/src/+/main:thi...
For large N without a cache :nth-child matching would be very slow doing n^2 scans of the siblings to compute the index. On the other hand for small sibling counts it turned out the cache overhead was noticably worse than just doing an n^2. (See the end of the linked comment).
This turns out to be true in a lot of surprising places, both where linear search beats constant time maps, and where n^2 is better than fancy algorithms to compensate.
Memory latency and instruction timing is the gotcha of many fun algorithms in the real world.
> where linear search beats constant time maps
Can you give an example? You said lots of good things in your post, but I struggling to believe this one. Also, it would help to see some wall clock times or real world impact.Some example workloads include:
1. Tokenization (checking if a word is a keyword)
2. Interpretation (mapping an instruction name to its action)
3. ML (encoding low-cardinality string features in something like catboost)
4. JSON parsers (usually key count is low, so parse into a linear-scan hashmap rather than a general-purpose hashmap)
Details vary in the exact workload, the hardware you're using, what other instructions you're mixing in, etc. It's a well-known phenomenon though, and when you're doing a microoptimization pass it's definitely something to consider. 2x speedups are common. 10x or more happen from time to time.
It's similar to (but not _quite_ the same as) the reason real-world binary search uses linear scans for small element counts.
When you go to really optimize the system, you'll also find that the linear scan solution is often more amenable to performance improvements from batching.
As to how much it matters for your composite program? Even at a microoptimization level I think it's much more important to pay attention to memory access patterns. When we wrote our protobuf parser that's all we really had to care about to improve performance (33% less execution time for the entire workload, proto parsing being much better than that). You're much more likely to be able to write sane code that way (contrasted with focusing on instructions and whatnot first), and it's easier to layer CPU improvements on top of a good memory access pattern than to go the other way around.
> That is not the case in C though, as it is much easier to use arrays and nested loops instead of hash maps.
I am confused. There are plenty of open source, fast hash map impls in C.That's the problem. A lot of these quadratic time algorithms don't set limits.
Even 'n!' is fine for small 'n'. Real production use cases don't have small 'n'.
Real production use cases absolutely have small n. You don't hear about them, because it's very hard for them to cause issues. Unless the use case changes and now the n is not small anymore and nobody noticed the trap.
It's been fine because "n" is "number of aircraft flying in this flight simulator" - and the simulator's engine starts to fail above around 2000 anyway. So even in the worst case it's still going to run within milliseconds.
ChatGPT has ruined bullet points for the rest of us…
No offense but writing this blog post couldn’t take more than a few minutes, why spoil it with LLM? Shoot, use one to check grammar and recommend edits even.
1.https://github.com/git/git/commit/bb74c0abbc31da35be52999569...
Why aren't they just snapshotting and archiving the full git repo? Does `git bundle` add something over frequent ZFS backups?
https://git-scm.com/docs/gitfaq#_transfers
It doesn't tell you how to make a backup safely though.
On a personal scale, Syncthing and Btrfs snapshots work plenty good enough. It's as fast as the storage/network too.
Because it's at a personal scale, the only time I can corrupt a git repo is if I work on the same repo (and it's workdir) from more than one device in the time it takes for Syncthing to replicate the changes.
But even then it's not a big deal because git fsck is quick. And I have my snapshots, and the syncthing versioning, and git defaults to two weeks before pruning. And because of how git works, using hash to identify contents, files are not easily overwritten either.
In 10y I only had one git corruption (I ran a command on the same repo on a different machine via ssh, yielding a synctning conflict). Syncthing kept copies of the conflict file. One commit disappeared from the history but not from the database. It was easy to rebase the changes. I think I used git fsck to deleted the syncthing versioned files.
That said, there's another less known feature that bundles help out with when used with `git clone --bundle-uri` The client can specify a location to a bundle, or the server can send the client the bundle location in the clone results and the client can fetch the bundle, unpack it, and then update the delta via the git server, so it's a lot lighter weight on the server for cloning large repos, and a ton faster for the client for initial clones.
EDIT: you could also rsync from a .zfs snapshot directory if you have them enabled.
... is this really the way people "back up" git repos? I mean, it is git, so isn't there some way to mirror changes to the repo in another repo and just use ZFS / snapshots / backup software / etc to do that? It's a distributed version control system. Just make sure the version control information is ... distributed?
This is the approach I've taken with SQLite in production environments. Turn on WAL and the problem gets even easier to solve. Customer configures the VM for snapshots every X minutes. Git presumably doesn't have something approximating a WAL, so I understand the hesitation with this path. But, I still think the overall strategy is much more viable and robust to weird edges within git.
Bingo. One of the worst problems is helping a client piece back together a corrupted repo when they are using snapshots. Check my profile to see how I know. :)
It's usually an OMG down scenario, and then you are adding in the "oh no, now the restore is corrupted."
It's fixable but it's definitely annoying.
A few months back a better solution was provided by SQLite: sqlite3_rsync
There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
File systems can be weird. Sometimes the OS can be weird and fsync type calls may not do what you expect. At least at one point MacOS fsync didn't behave the same way as Linux (i.e. Linux should ensure the write is truly done and not just in cache so long as the drive isn't lying). [0]
> There are system requirements that a customer would be expected to adhere to if they wanted a valid enterprise support contract with one of these vendors.
Gitlab has a community edition. Not handling data well would be bad for their public image.
If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would guess not.
If you can’t agree with this, then you shouldn’t be speaking or writing, IMO.
Those who argue that words that mean different things are actually equivalent have no business dealing with language.
I.e. you are saying and f(n) speedup means T(n)/f(n), but others would say it means f(T(n)) or some variation of that.
Also because I've often heard tha the quantum Fourier transform is an exponential speedup over the discrete Fourier transform, and there the scaling goes n^2->nlogn.
I don't think this is meaningless or non-constructive pedantry - we're a technical community and those are technical words.
Here they are actually using it to refer to growth functions (which is rare for this error) and being honest (which is also rare IMO) but it's still wrong. They should have written about quadratic or quadratic vs linear.
Regardless sloppy language leads to sloppy thought.
It’s a very specific subset of the one you’re describing.
“Tell me you don’t know what you’re talking about without telling me you don’t know what you’re talking about.”
In this case Git already had a string set, but it's still not standard so there's a good chance the original author just didn't know about it.
The original commit was made in January 2009 (https://github.com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa912...), strmap was added in November 2020 (https://github.com/git/git/commit/ae20bf1ad98bdc716879a8da99..., strset was added a few days later: https://github.com/git/git/commit/1201eb628ac753af5751258466...). It was first proposed in 2018 (https://lore.kernel.org/git/20180906191203.GA26184@sigill.in... the proposal specifically mentions it fixing possibly quadratic sites).
As noted in the comment, git did have a sorted string list with bisection search, and that's from 2008 (and it actually dates back to 2006 as the "path list" API, before it was renamed following the realisation that it was a generalised string list). Though as the hashmap proposal notes, it's a bit tricky because there's a single type with functions for sorted and functions for unsorted operations, you need to know whether your list is sorted or not independent of its type.
The worst troublesome cases of inefficient production are almost always O(n^2).
Quadratic complexity sits in an awkward sweet spot: Fast enough for medium-sized n to pass first QA, but doomed to fail eventually as n grows.
Also moved away from Gitlab because it's so damn slow.
I know their backend git proxy is written in golang, their runner agent is written in golang, it spawns CI jobs using containerd, written in golang, and they use postgresql and a redis-esque KV -- although for that part I do believe they're still using Sidekick (ruby) for doing job dispatch, so that one could very easily lolol back into not actioning tasks efficiently
GitHub Actions sure does enjoy just sitting there twiddling its thumbs when I push "run job," and is also both Ruby and their own crazypants ajax-y web framework, so I don't think it's exactly the shining star of performance itself
Realtalk: They should rewrite this post's headline to be in a positive tense instead of leading with a negative word. I'm glad I read the post, because it is a cool and good fix, but I saw “Decreasing […] repo backup” and my first thought was that it was an announcement of some service downgrade like some sort of cost-cutting measure.
It will spin up a localhost server after the trace ends, the profiler uses the localhost server and nothing is shared with Firefox servers unless you explicitly choose to upload the data and create a permalink.
You record performance data with `perf`, then use the scripts there to turn it into a SVG.
https://github.com/google/pprof
go install github.com/google/pprof@latest
pprof -http=: prof.out
I normally collect the profiles with gperftools (https://github.com/gperftools/gperftools) and then just LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so CPUPROFILE=prof.out <your command>
I've been meaning to try Samply though. Not sure if it works with pprof.Uh no you didn't. Not possible. At most a polynomial reduction is possible else complexity theory needs a re-write.
(OK, yes, k could be doing some heavy lifting here, but I doubt it.)
If you are going to quote a maths formula then please don't use "exponetially" to mean "lots".
I stopped reading there: I don't want to have to read each word and wonder if they actually meant it, or it's just bad hype.
If my barista says that his new coffee is exponentially better, it’s ok to use it colloquially.
If my git hosting provider writes about an impactful performance improvement, it’s not.
(I don't think that anyone should use "exponentially" that way: it is an art term with a specific and particular meaning, so find another word if you mean something else! Like misusing specific legal or sporting terms...)
Instead of saying "set". The code itself uses the type "strset".
https://addons.thunderbird.net/en-US/thunderbird/addon/remov...
I used a hash to begin with, using a simplistic digest function to the message headers I was comparing, getting me a 4-byte hash key.
That worked, but was kind of slow.
Finally, the idea came to me to not apply _any_ digesting, and use the combined concatenated headers of the messages as hash keys. About 2k bytes per hash key!
The result: About 20x perf improvement if memory serves.
How is that possible? The reason is that the code all runs in a Javascript machine; and applying the digest was not a built-in function, it was looping over the headers and doing the arithmetic. Thousands upon thousands of JS abstract machine steps. The use of the large hash key may be inefficient, but - it's just one JS object / dictionary operation, and one of the most heavily-optimized in any implementation.
While perhaps implementing low level sorts is silly, knowing which data structures to use and when, and what underlying big-o performance they have, is critical, as demonstrated here.
wild that this a pattern like this would be part of git-core, but I guess we all overlook stuff on a regular basis
Yeah but you should want your thoughts on a single post to tie together.
> Many years ago I had a user with thousands of karma points. I used to get really annoyed with other users downvoting my valid and thoughtful comments because it affected my karma. Despite attempts to rally the community around getting rid of downvoting, that never happened.
Sorry you had that reaction. While I get annoyed by downvotes sometimes, I've never cared about losing some points from the mostly useless pile.
Why you're so afraid to let your ideas and views collect under a single account? Are they that controversial or are you weary of your own thoughts and don't want to see them again, or are you afraid to own your views as yours?
We're talking (mostly) tech here, and nobody is forced to comment.
While I don't go as far as the poster you are replying to, I do see his way as superior. It's just too much hassle for me.
Nazi Germany & Jews issue is different. There's an aspect of forcing, and this is unethical and wrong on so many levels, and I'll just leave the subject here.
OTOH, from my perspective if you're afraid that you're writing a sensitive comment, you can create a throwaway. That's justifiable IMHO, but creating three accounts to discuss maps vs. loops, now that's different.
If we're talking about being ridiculous, bringing up Nazi Germany vs. Jews issue to a technical discussion is more ridiculous than the alleged ridiculousness of me asking the OP about their fears. To close, my questions was not to belittle or shame the OP, they were genuine. I'm not that person who jabs for giggles.
It’s fine to be fearless. But don’t persecute someone for trying to protect themselves.