Anyone have similar stories? Curious about cases where knowing your domain beat throwing compute at the problem.
Every email address was an s3 object so every new email sent to that email address was saved as a new object version.
Presenting that email as an mailbox was just a matter of reading all the versions of that object.
It worked!
I used this contraption as a domain level catch—all inbox for a while until cloudflare started supporting email forwarding.
Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.
I really, really, really wish this sequence of words did not exist in modern society.
/my unsubstantiated reddit-tier comment which I'm only posting because I'm sure someone will piggyback off of it with something related and actually insightful.
There are "smarter" solutions like radix tries, hash tables, or even skip lists, but for any design choice, you also have to examine the tradeoffs. A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
I guess the moral of the story is to just examine all your options during the design stage. Machine learning solutions are just that, another tool in the toolbox. If another simpler and often cheaper solution gets the job done without all of that fuss, you should consider using it, especially if it ends up being more reliable.
> There are "smarter" solutions like... hash tables.... A goal of my project is to make the code simpler to understand and less of a black box, so a simpler data structure made sense, especially since other design choices would not have been all that much faster or use that much less memory for this application.
Strangely, my own software-related answer is the opposite for the same reason.
I was implementing something for which I wanted to approximate a https://en.wikipedia.org/wiki/Shortest_common_supersequence , and my research at the time led me to a trie-based approach. But I was working in Python, and didn't want to actually define a node class and all the logic to build the trie, so I bodged it together with a dict (i.e., a hash table).
To figure that out, I remember searching for articles on how to implement inverted indices. Once I had a list of candidate strategies and data structures, I used Wikipedia supplemented by some textbooks like Skiena's [2] and occasionally some (somewhat outdated) information from NIST [3]. I found Wikipedia quite detailed for all of the data structures for this problem, so it was pretty easy to compare the tradeoffs between different design choices here. I originally wanted to implement the inverted index as a hash table but decided to use a trie because it makes wildcard search easier to implement.
After I developed most of the backend, I looked for books on "information retrieval" in general. I found a history book (Bourne and Hahn 2003) on the development of these kind of search systems [4]. I read some portions of this book, and that helped confirm many of the design choices that I made. I actually was just doing what people traditionally did when they first built these systems in the 1960s and 1970s, albeit with more modern tools and much more information on hand.
The harder part of this project for me was writing the interpreter. I actually found YouTube videos on how to write recursive descent parsers to be the most helpful there, particular this one [5]. Textbooks were too theoretical and not concrete enough, though Crafting Interpreters was sometimes helpful [6].
[1] https://en.wikipedia.org/wiki/Inverted_index
[2] https://doi.org/10.1007/978-3-030-54256-6
[3] https://xlinux.nist.gov/dads/
[4] https://doi.org/10.7551/mitpress/3543.001.0001
And it goes the ChatGPT comes back with and runs the appropriate command.
It underperformed banning the word "password" from a Google Form.
So that's what they went with.
I needed to test pumping water through a special tube, but didn’t have access to a pump. I spent days searching how to rig a pump to this thing.
Then I remembered I could just hang a bucket of water up high to generate enough head pressure. Free instant solution!
I suspect that locating the referenced comment would require a semantic search system that incorporates "fancy models with complex decision boundaries". A human applying simple heuristics could use that system to find the comment.
In the "Dictionary of Heuristic" chapter, Polya's "How to Solve it" says this: *The feeling that harmonious simple order cannot be deceitful guides the discover in both in mathematical and in other sciences, and is expressed by the Latin saying simplex sigillum veri (simplicity is the seal of truth).*
This generalises to a few situations where going faster just doesn't matter. For example for many cli tools it matters if they finish in 1s or 10s. But once you get to 10ms vs 100ms, you can ask "is anyone ever likely to run this in a loop on a massive amount of data?" And if the answer is yes, "should they write their own optimised version then?"
Firstly, my approach ("set discovery") was simply to take relatively dumb samples of nodes from the leaves towards roots and ask the other party if they knew these nodes, and then iteratively refine with more roundtrips. In practice, this by far beat the previous sophisticated approach ("tree discovery") which tries to use the structure of the DAG to cleverly select "highly informative" nodes.
Secondly, I had a symmetric setup where the client sent samples to the server, and the server responded with information about those samples, and samples of its own. It worked great, saving sometimes 100-eds of network roundtrips. However, computing the samples is relatively expensive. Another contributor suggested that it would work almost as well if the server was kept dumb and would just respond for each sample node whether it knew it or not. This massively reduced server load and kept the protocol much simpler.
https://repo.mercurial-scm.org/hg/file/tip/mercurial/setdisc... https://repo.mercurial-scm.org/hg/rev/cb98fed52495
The story goes some company/university/whatever in the early days of computing wanted a batch scheduler[1] to run jobs at specific times on their big IBM mainframe. They spoke to IBM who quoted them an eye-watering amount for it and said it would take months to implement. The main system operator told them to just hold fire and he’d see what he could come up with. The next day they had a working batch scheduler for zero dollars. He had set up the jobs so they would run on a keypress on a particular keyboard, then taken some of his kids’ lego and made a long finger on a hinge. He wrapped some string around the winder of a wind-up alarm clock then attached it to the lego and set the alarm clock to go off at the time they wanted to run the job. This had the effect of unwinding the string, lowering the finger that then pressed the key on the keyboard, running the job.
Not only that, but the jobs had a problem if you tried to run them twice, so he made it so the lego brick snapped off when pressing the key, making the job idempotent.
[1] Think “cron”, but for a mainframe
My “dumb” solution is a little Ansible job that just runs a git pull on the server. It gets the new code and I’m done. The job also has an option to set everything up, so if the server is wiped out for some reason I can be back up and running in a couple minutes by running the job with a different flag.
----
For storage, people often overcomplicate things. Maybe you do need RAID 5 in a NAS, etc. Maybe what you need is a simple server with a single disk and an offsite backup that rsyncs every night. That RAID 5 doesn't stop 'rm -rf' from destroying everything.
For databases, people often shove a database into an app or product much too early. The rule of thumb that I use is that you should switch to a database (from flat files) when you would have to implement foreign keys, or when data won't fit in memory anymore and memory-mapped files aren't sufficient. Using a database before that just complicates your data model, introducing ORM too early seriously complicates your code.
For algorithms, there are an awful lot of O(nLogn) solutions deployed for problems with small n. An O(n) solution is often faster to write, and still solves the problem. O(n) is often actually faster when things fit in L1 or L2 cache.
For software architecture, we often forget that the client has CPU and storage (and network) that we can use. Even if you don't trust the client, you can sign a cache entry to be saved on the client, and let the client forward it later. Greatly reduces the need for consistency on the backend. If you don't trust the client to compute, you can have the server compute a spot check at lower resolution, a subset, etc.
I needed cache eviction logic as there was only 1 MB of RAM available to the indexer, and most of that was used by the library that parsed the input format. The initial version of that logic emptied the entire cache when it hit a certain number of entries, just as a placeholder. When I got around to adding some LRU eviction logic, it became faster on our desktop simulator, but far slower on the embedded device (slower than with no cache at all). I tried several different "smart" eviction strategies. All of them were faster on the desktop and slower on the device. The disconnect came down to CPU cache (not word cache) size / strategy differences between the desktop and mobile CPUs — that was fun to diagnose!
We ended up shipping the "dumb" eviction logic because it was so much faster. The eviction function was only two lines of code, plus a large comment explaining all this and saying something to the effect of "yes, this looks dumb, but test speed on the target device when making it smarter."
My group (and some others) had to design a device to transport an egg from one side of a very simple "obstacle course" to the other, with the aid of beacons (to indicate the egg location and target, each along opposite ends) and light sensors. There was basically a single obstacle, a barrier running most of the way across the middle. The field was fairly small, I think 4 metres across by 3 metres wide.
The other teams followed tutorials, created beacons that emitted high-frequency light pulses and circuitry to filter out 60Hz ambient light and detect the pulse; various robots (I think at least one repurposed a remote-control car) and feedback control to steer them toward the beacons, etc. There were a few different microcontrollers on offer to us for this task, and groups generally had three people: someone responsible for the mechanical parts, someone doing circuitry, and someone doing assembly programming.
My group was just the two of us.
I designed extenders for the central barrier, a carriage to straddle the barrier, and a see-saw the length of the field. The machine would find the egg, scoop it into one end, tilt the see-saw (the other person's innovation: by releasing a stop allowing the counterweighted far side to fall), find the target and release the scoop on the other end. Our light sensors were pointed directly at the ceiling (the source of the "noise"), and put through a simple RC circuit to see that light as more or less constant. Our "beacons" were pieces of construction paper used to block the light physically. All controlled by a 3-bit finite state machine implemented directly in TTL/CMOS (I forget which).
And it worked in testing (praise for my partner; I would never have gotten the mechanics robust enough), but on presentation day the real barrier (made sloppily out of wood) was noticeably wider than specified and the carriage didn't fit on it.
As I recall, in later years the obstacle course was made considerably more complex, ruling out solutions like mine entirely. (There were other projects to choose from, for my year and later years, that as far as I know didn't require modification.)
When we are learning difficult techniques we want to show them (who doesn't like to show others that he can execute a perfect "Kick of the crescent Dragon from the West"?). But the old master knows that moving aside and sticking out a foot is enough to defeat that rival. More so, maybe that master knows that not fighting is the best solution for solving that problem.
As I'm getting old I want to be more like this.
This gave a great impression of an intelligent adversary with very minimal code and low CPU overhead.
Unless enemies have entirely non-functional pathing. Then it's just funny.
- buying a bigger server is almost always better than distributed system
- Few lines of bash can often wipe out hundreds of lines of python.
What I had overlooked was that journeys on that particular website were fairly constrained by design, i.e., if you landed on the home page, did a bunch of stuff, put product X in the cart - there was pretty much one sequence of pages (or in the worst case, a small handful) that you'd traverse for the journey. Which means the bag-of-words (BoW) representation was more or less as expressive as the sequence model; certain pages showing up in the BoW vector corresponded to a single sequence (mostly). But the DT could learn faster with less data.
I think a lot of people would have used a database at this point, but the site didn't need to be updated once built so serving a load of static files via S3 makes ongoing maintenance very low.
Also feel a slight sense of superiority when I see colleagues write a load of pandas scripts to generate some basic summary stats Vs my usual throw away approach based around awk.
But more on the topic, i would say calling ffmpeg as external binary to handle some multimedia processing would be one of those cases where simple is better.
Generally, I would say that implementing your own solution over an external one(like a library, service or product) will always fall under this umbrella. Mostly, because you can implement only what you need and add other things that might be missing or adjust things that do not work exactly as you'd needed them to, so you can avoid any patches, translation layers or external pipelines.
For example, right now I am implementing my own layouting library, because Clay bindings for Go were not working or manual translations were missing features or were otherwise incomplete or non-functional. So I've learnt about Clay and on what principles it was built, by Nic Baker, and wrote my own version in Go. It has little over 2k lines of code right now and will take me about two weeks to finish, with all or most features I wanted. Now I have a fully native Go layouting library that does what I need and I can use and modify it ad infinitum.
So I would say that I equal a "dumb solution" with "my own" solution.
PS: looking back, when I used to work in advertising/marketing/web agency, we used to make websites in CMSs(I did Drupal, colleague did Wordpress). Before my departure from the job in general, I came to a conclusion that if we would be using static website generators, we could have saved unimaginable amount of work hours and deliver the same products, as 99% of clients never managed their websites as by nature of the job, we were doing presentational websites, not complicated functional ones. And when they did, they only needed such tiny changes that it would make way more sense to do it for them for free upon request. For example, imagine you charge someone 5000€ for a website that takes you two months to ship, because you need to design it, build it functionally, fit the visual style and tweak whatever is needed. If you'd use static website generator, the work would take two weeks - a week for the design and a week for coding the website itself. Now you've saved yourself 6 weeks of work while getting paid the same amount of money. Unfortunately, I did not have a chance to try this out and force a new direction at the company as it was at the end of my career.
The (deliberately) very limited analytics software I wrote for my personal website[0] could have used database but I didn't want to add a dependency to what was a very simple project so I hacked up an in-memory datastructure that periodically dumps itself to disk as a json file. This gives persistence across reboots and at a pinch I can just edit the file with a text editor.
Game design is filled with "stupid" ideas that work well. I wrote a text-based game[1] that includes Trek-style starship combat. I played around with a bunch of different ideas for enemy AI before just reverting to a simple action drawn off the top of a small deck. It's a very easy system to balance and expand, and just as fun for the player.
This doesn't mean a LLM can't build such things however.
You cant be "agile" with them, you need to design your data storage upfront. Like a system design interview :).
Just use postgres (or friends) until you are webscale. Unless you really have a problem amenible to key/value storage.
Interesting to hear now that the opinion is the opposite.
On another note I recently wrote this large single page app that is just a collection of functions organized by page sections as a collection of functions according to a nearly flat typescript interface. It’s stupid simple to follow in the code and loads as fast as an eighth of a second. Of course that didn’t stop HN users from crying like children for avoiding use of their favorite framework.