My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
I don’t know what you mean by software houses, but every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
This has been common wisdom for decades. I don’t know how many times I’ve heard the repeated quote about how Amazon loses $X million for every Y milliseconds of page loading time, as an example.
> Helldivers 2 devs slash install size from 154GB to 23GB
https://news.ycombinator.com/item?id=46134178
Section of the top comment says,
> It seems bizarre to me that they'd have accepted such a high cost (150GB+ installation size!) without entirely verifying that it was necessary!
and the reply to it has,
> They’re not the ones bearing the cost. Customers are.
And Skylines rendering teeth on models miles away https://www.reddit.com/r/CitiesSkylines/comments/17gfq13/the...
Sometimes the performance is really ignored.
It cost several human lifetimes if i remember correctly. Still not as bad as windows update which taking the time times wage has set the gdp of a small nation on fire every year..
We've only got a couple of game dev shops in my city, so not sure how common that is.
A senior joining when time is tight makes sense, they don’t want anyone to rock the boat, just to plug the holes.
Gamedev is very different from other domains, being in the 90th percentile for complexity and codebase size, and the 99th percentile for structural instability. It's a foregone conclusion that you will rewrite huge chunks of your massive codebase many, many times within a single year to accomidate changing design choices, or if you're lucky, to improve an abstraction. Not every team gets so lucky on every project. Launch deadlines are hit when there's a huge backlog of additional stuff to do, sitting atop a mountain of cut features.
The inverse, however, is bizarre. That they spent potentially quite a bit of engineering effort implementing the (extremely non-optimal) system that duplicates all the assets half a dozen time to potentially save precious seconds on spinning rust - all without validating it was worth implementing in the first place.
The first was on PS3 and PS4 where they had to deal with spinning disks and that system would absolutely be necessary.
Also if the game ever targeted the PS4 during development, even though it wasn’t released there, again that system would be NEEDED.
They talk about it being an optimization. They also talk about the bottleneck being level generation, which happens at the same time as loading from disk.
Tell me you don't work on game engines without telling me..
----
Modern engines are the cumulative result of hundreds of thousands of engine-programmer hours. You're not rewriting Unreal in several years, let alone multiple times in one year. Get a grip dude.
I thought this post was weird for a few reasons, so I read through your resume, which perfectly explained everything. You're the author of that voxel engine I keep seeing get posted, but you have absolutely no experience in the industry, not even bushleague itch.io stuff. I wouldn't normally call someone out for any of this, but who are you to tell anybody to get grip? You don't even know what you're talking about, which is why your reply is simultaneously naive and a non-sequitur. Chill out. You have a cute toy engine, but it's nothing more than portfolio padding. Until you've actually shipped a real game from start to finish, you've got no room to act like this. A little bit more self awareness please.
I’d be careful before telling people to “get a grip”.
Though in this case GitHub wasn't bearing the cost, it was gaining a profit...
I think this is uncharitably erasing the context here.
AFAICT, the reason that Helldivers 2 was larger on disk is because they were following the standard industry practice of deliberately duplicating data in such a way as to improve locality and thereby reduce load times. In other words, this seems to have been a deliberate attempt to improve player experience, not something done out of sheer developer laziness. The fact that this attempt at optimization is obsolete these days just didn't filter down to whatever particular decision-maker was at the reins on the day this decision was made.
Are you sure that you’re not the driving force behind those metrics; or that you’re not self-selecting for like-minded individuals?
I find it really difficult to convince myself that even large players (Discord) are measuring startup time. Every time I start the thing I’m greeted by a 25s wait and a `RAND()%9` number of updates that each take about 5-10s.
A large part of my friend group use discord as the primary method of communication, even in an in person context (was at a festival a few months ago with a friend, and we would send texts over discord if we got split up) so maybe its not a common use case.
I made a slight at Word taking like 10 seconds to start and some people came back saying it only takes 2, as if that still isn't 2s too long.
Then again, look at how Microsoft is handling slow File Explorer speeds...
Took me almost a year to get a separate laptop laptop for office and development. Their Enhanced Security prevented me from testing administrative code features and broke Visual Studios bug submission system, which Microsoft requires you to use for posting software bugs.
By the way, I can brake Windows simply by running their PowerShell utilities to configure NICs. Windows is not the stable product people think it is.
It’s closed unless I get a DM on my phone and then I suffer the 2-3 minute startup/failed update process and quit it again. Not a fan of leaving their broken, resource hogging app running at all times.
It would not fail to update if installed as a user installed flatpak.
Many apps are this way now.
I'll admit that the Discord service is really good from a UX point of view.
>>what you mean by software houses
How about Microsoft? Start menu is a slow electron app.
The decline of Windows as a user facing product is amazing, especially as they are really good at developing things they care about. The “back of house” guts of Windows has improved alot, for example. They should just have a cartoon Bill Gates pop up like clippy and flip you the bird at this point.
It has both indexing failures and multi-day performance issues for mere kilobytes of text!
What falsehoods people believe and spread about a particular topic is an excellent way to tell what the public opinion is on something.
Consider spreading a falsehood about Boeing QA getting bonuses based on number of passed planes vs the same falsehood about Airbus. If the Boeing one spreads like wildfire, it tells you that Boeing has a terrible track record of safety and that it’s completely believable.
Back to the start menu. It should be a complete embarrassment to MSFT SWEs that people even think the start menu performance is so bad that it could be implemented in electron.
In summary: what lies spread easily is an amazing signal on public perception. The SMBC comic is dumb.
If your users are trapped due to a lack of competition then this can definitely happen.
This is true for sites that are trying to make sales. You can quantify how much a delay affects closing a sale.
For other apps, it’s less clear. During its high-growth years, MS Office had an abysmally long startup time.
Maybe this was due to MS having a locked-in base of enterprise users. But given that OpenOffice and LibreOffice effectively duplicated long startup times, I don’t think it’s just that.
You also see the Adobe suite (and also tools like GIMP) with some excruciatingly long startup times.
I think it’s very likely that startup times of office apps have very little impact on whether users will buy the software.
They often have a recognizable delay to user data input compared to local software
Are they evaluating the shape of that line with the same goal as the stonk score? Time spent by users is an "engagement" metric, right?
Then respectfully, uh, why is basically all proprietary software slow as ass?
Commons would be if it's owned by nobody and everyone benefits from its existence.
This isn’t what “commons” means in the term ‘tragedy of the commons’, and the obvious end result of your suggestion to take as much as you can is to cause the loss of access.
Anything that is free to use is a commons, regardless of ownership, and when some people use too much, everyone loses access.
Finite digital resources like bandwidth and database sizes within companies are even listed as examples in the Wikipedia article on Tragedy of the Commons. https://en.wikipedia.org/wiki/Tragedy_of_the_commons
The behavior that you warn against is that of a free rider that make use of a positive externality of GitHub’s offering.
“Commons can also be defined as a social practice of governing a resource not by state or market but by a community of users that self-governs the resource through institutions that it creates.”
https://en.wikipedia.org/wiki/Commons
The actual mechanism by which ownership resolves tragedy of the commons scenarios is by making the resource non-free, by either charging, regulating, or limiting access. The effect still occurs when something is owned but free, and its name is still ‘tragedy of the commons’, even when the resource in question is owned by private interests.
Edit: oh, I do see what you mean, and yes I misunderstood the quote I pulled from WP - it’s talking about non-ownership. I could pick a better example, but I think that’s distracting from the fact that ‘tragedy of the commons’ is a term that today doesn’t depend on the definition of the word ‘commons’. It’s my mistake to have gotten into any debate about what “commons” means, I’m only saying today’s usage and meaning of the phrase doesn’t depend on that definition, it’s a broader economic concept.
If Github realizes that the free tier is too generous, they can cut it anytime without it being in any way a "tragedy" for anybody involved - having to pay for stuff or service you want to consume is not the "T" in ToC! The T is that there are no incentives to pay (or use less) without increasing the incentives for everyone else to just increase their relative use! You not using the github free tier doesn't increase the usage of Github for anybody else - if it has any effect at all, it might actually decrease the usage of Github because you might not publish something that might in turn attract other users to interact.
The ‘tragedy’ that the top comment referred to is losing unlimited access to some of GitHub’s features, as described in the article (shallow clones, CPU limits, API rate limits, etc.). The finiteness, or natural limit, does exist in the form of bandwidth, storage capacity, server CPU capacity, etc.. The Wikipedia article goes through that, so I’m left with the impression you didn’t understand it.
The Wikimedia organization does not actually own wikipedia. They do not control editorial policy nor own the copyright of any of the contents. They do not pay any of the editors.
But let's stick with Github. On which of the following statements can we agree?
Z1) A "Commons" is a system of interacting market participants, governed by shared interests and incentives (and sometimes shared ownership). Github, a multi billion subsidiary of the multi trillion dollar company Microsoft, and I, their customer, are not members of the same commons; we don't share many interests, we have vastly different incentives, and we certainly do not share any ownership. We have a legally binding contract that each side can cancel within the boundaries of said contract under the applicable law.
Z2) A tragedy in the sense of the Tragedy of the Commons is that something bad happens even though everyone can have the best intentions, because the system lacks a mechanism would allow to a) coordinate interests and incentives across time, and b) to reward sustainable behavior instead of punishing it.
A) Github giving away stuff for free while covering the cost does not constitute a common good from... 1. a legal perspective 2. an ethical perspective 3. an economic perspective
B) If a free tier is successful, a profit maximizing company with a market penetration far from saturation will increase the resources provided in total, while there is no such mechanism or incentive for any participant in a market involving a common good, e.g. there will be no one providing additional pasture for free if an Allmende is already destroying the existing pasture through overgrazing.
C) If a free tier is unsuccessful because it costs more than it enables in new revenue, a company can simply shut it down – no tragedy involved. No server has been depreciated, no software destroyed, no user lost their share of a commonly owned good.
D) More users of a free tier reduce net loss / increase net earnings per free user for the provider, while more cattle grazing on a pasture decrease net earnings / increase net loss per cow.
E) If I use less of Github, you don't have any incentive to use more of it. This is the opposite of a commons, where one participant taking less of it puts out an incentive to everybody else to take their place and take more of it.
F) A service that you pay for with your data, your attention, your personal or company brand and reach (e.g. with public repositories), is not really free.
G) The tiny product samples that you can get for free in perfume shops do not constitute a common good, even though they are limited, "free" for the user, and presumably beneficial even for people not involved in the transaction. If you think they were a common good, what about Nestlé offering Cheerios with +25% more for free? Are those 20% a common good just because they are free? Where do you draw the line? Paying with data, attention, and brand + reach is fine, but paying for only 80% of the produce is not fine?
H) The concepts of "moral hazard" and "free riders" apply to all your examples, both Github and Wikipedia. The concept of a Commons (capital C) is neither necessary nor helpful in describing the problems that you want to describe wrt to free services provided by either Github of Wikipedia.
Certainly private property is involved in tragedy of the commons. In the classic shared cattle ranching example, the individual cattle are private property, only the field is held in common.
I generally think that tragedy of the commons requires the commons, to, well, be held in common. If someone owns the thing that is the commons, its not a commons but just a bad product. (With of course some nit picking about how things can be de jure private property while being defacto common property)
In the microsoft example, windows becoming shitty software is not a tragedy of the commons, its just MS making a business decision because windows is not a commons. On the other hand, computing in general becoming shitty, because each individual app does attention grabbing dark patterns, as it helps the induvidual apps bottom line while hurting the ecosystem as a whole, would be a tragedy of the commons, as user attention is something all apps hold in common and none of them own.
The Microsoft example in this subthread is GitHub, not Windows. Windows is not a digital commons, because it’s neither free nor finite. Github is (or was) both. That is the criteria that Wikipedia is using to apply the descriptor ‘commons’: something that is both freely available to the public, and comes in limited supply, e.g. bandwidth, storage, databases, compute, etc.
Wikipedia’s article seems to be careful to not discuss ownership nor define the tragedy of the commons in terms of ownership, presumably because the phrase describes something that can still happen when privately owned things are made freely available. I skimmed Investopedia’s article on Tragedy as well, and it seems similarly to not explicitly discuss ownership, and even brings up the complicated issue of lack of international commons. That’s an interesting point: whatever we call commons locally may not be a commons globally. That suggests that even the original classic notion of tragedy of the commons often involves a type of private ownership, i.e. overfishing a “public” lake is a lake owned by a specific country, cattle overusing a “public” pasture is land owned by a specific country, and these resources might not be truly common when considered globally.
Your oft-repeated customer vs product platitude doesn’t seem to apply to GitHub, at least not to it’s founding and core product offering. You are the customer, and GitHub doesn’t advertise. It’s a freemium model, the free access is just a sort of loss leader to entice paid upgrades by you, the customer.
The mere fact that we're discussing this here is advertisement for Github's services. Q.e.d.
The idea of the tragedy of the commons relies on this feedback loop of having these unsustainably growing herds (growing because they can exploit the zero-cost-to-them resources of the commons). Feedback loops are notoriously sensitive to small parameter changes. MS could presumably impose some damping if they wanted.
Not linearity but continuity, which I think is a well-founded assumption, given that it's our categorization that simplifies the world by drawing sharp boundaries where no such bounds exist in nature.
> The idea of the tragedy of the commons relies on this feedback loop of having these unsustainably growing herds (growing because they can exploit the zero-cost-to-them resources of the commons)
AIUI, zero-cost is not a necessary condition, a positive return is enough. Fishermen still need to buy fuel and nets and pay off loans for the boats, but as long as their expected profit is greater than that, they'll still overfish and deplete the pond, unless stronger external feedback is introduced.
Given that the solution to tragedy of the commons is having the commons owned by someone who can boss the users around, GitHub being owned by MS makes it more of a commons in practice, not less.
You’re fundamentally misunderstanding what tragedy of the commons is. It’s not that it’s “zero-cost” for the participants. All it requires a positive return that has a negative externality that eventually leads to the collapse of the system.
Overfishing and CO2 emissions are very clearly a tragedy of the commons.
GitHub right now is not. People putting all sorts of crap on there is not hurting github. GitHub is not going to collapse if people keep using it unbounded.
Not surprisingly, this is because it’s not a commons and Microsoft oversees it, placing appropriate rate limits and whatnot to make sure it keeps making sense as a business.
But I would make the following clarifications:
1. A private entity is still the steward of the resource and therefore the resource figures into the aims, goals, and constraints of the private entity.
2. The common good is itself under the stewardship of the state, as its function is guardian of the common good.
3. The common good is the default (by natural law) and prior to the private good. The latter is instituted in positive law for the sake of the former by, e.g., reducing conflict over goods.
I think it's both simpler and deeper than that.
Governments and corporations don't exist in nature. Those are just human constructs, mutually-recursive shared beliefs that emulate agents following some rules, as long as you don't think too hard about this.
"Tragedy of the commons" is a general coordination problem. The name itself might've been coined with some specific scenarios in mind, but for the phenomenon itself, it doesn't matter what kind of entities exploit the "commons"; the "private" vs. "public" distinction itself is neither a sharp divide, nor does it exist in nature. All that matters is that there's some resource used by several independent parties, and each of them finds it more beneficial to defect than to cooperate.
In a way, it's basically a 3+-player prisonner's dilemma. The solution is the same, too: introducing a party that forces all other parties to cooperate. That can be a private or public or any other kind of org taking ownership of the commons and enforcing quotas, or in case of prisonners, a mob boss ready to shoot anyone who defects.
But it appears we cannot avoid getting into the weeds a bit…
> Governments and corporations don't exist in nature.
This is not as simple as you seem to think.
The claim “don’t exist in nature” is vague, because the word “nature” in common speech is vague. What is “natural”? Is a beehive “natural” Is a house “natural”? Is synthetic water “natural”? (I claim that the concept of “nature” concerns what it means to be some kind of thing. Perhaps polystyrene has never existed before human beings synthesized it, but it has a nature, that is, it means something to be polystyrene. And it is in the nature of human beings to make materials and artifacts, i.e., to produce technology ordered toward the human good.)
So, what is government? Well, it is an authority whose central purpose is to function as the guardian and steward of the common good. I claim that parenthood is the primordial form of human government and the family as the primordial form of the state. We are intrinsically social and political animals; legitimate societies exist only when joined by a common good. This is real and part of human nature. The capacity to deviate from human nature does not disprove the norm inherent to it.
Now, procedurally we could institute various particular and concrete arrangements through which government is actualized. We could institute a republican form of government or a monarchy, for example. These are historically conditioned. But in all cases, there is a government. Government qua government is not some arbitrary “construct”, but something proper to all forms and levels of human society.
> "Tragedy of the commons" is a general coordination problem.
We can talk about coordination once we establish the ends for which such coordination is needed, but there is something more fundamental that must be said about the framing of the problem of the “tragedy”. The framing does not presume a notion of human beings as moral agents and political and social creatures. In other words, it begins with a highly individualist, homo economicus view of human nature as rationally egoist and oriented toward maximizing utility, full stop. But I claim that is not in accord with human nature and thus the human good, even if people can fall into such pathological patterns of behavior (especially in a culture that routinely reinforces that norm).
As I wrote, human beings are inherently social animals. We cannot flourish outside of societies. A commons that suffers this sort of unhinged extraction is an example of a moral and a political failure. Why? Because it is unjust, intemperate, and a lack of solidarity to maximize resource extraction in that manner. So the tragedy is a matter of a) the moral failure of the users of that resource, and b) the failure of an authority to regulate its use. The typical solution that’s proposed is either privatization or centralization, but both solutions presuppose the false anthropology of homo economicus. (I am not claiming that privatization does not have a place, only that the dichotomy is false.)
Now, I did say that the case with something like github is analogical, because functionally, it is like a common resource, just like how social media functions like a public square in some respects. But analogy is not univocity. Github is not strictly speaking a common good, nor is social media strictly a public square, because in both cases, a private company manages them. And typically, private goods are managed for private benefit, even if they are morally bound not to harm the common good.
That intent, that purpose, is central to determining whether something is public or private, because something public has the common benefit as its aim, while something private has private benefit as its aim.
The "tragedy", if you absolutely need to find one, is only for unrestricted, free-for-all commons, which is obviously a bad idea.
But that doesn't mean the tragedy of the commons can't happen in other scenarios. If we define commons a bit more generously it does happen very frequently on the internet. It's also not difficult to find cases of it happening in larger cities, or in environments where cutthroat behavior has been normalized
That works while the size of the community is ~100-200 people, when everyone knows everyone else personally. It breaks down rapidly after that. We compensate for that with hierarchies of governance, which give rise to written laws and bureaucracy.
New tribes break off old tribes, form alliances, which form larger alliances, and eventually you end up with countries and counties and vovoidships and cities and districts and villages, in hierarchies that gain a level per ~100x population increase.
This is sociopolitical history of the world in a nutshell.
You say it like this is a law set in stone, because this is what happened im history, but I would argue it happened under different conditions.
Mainly, the main advantage of an empire over small villages/tribes is not at all that they have more power than the villages combined, but that they can concentrate their power where it is needed. One village did not stand a chance against the empire - and the villages were not coordinated enough.
But today we would have the internet for better communication and coordination, enabling the small entieties to coordinate a defense.
Well, in theory of course. Because we do not really have autonomous small states, but are dominated by the big players. And the small states have mowtly the choice which block to align with, or get crushed. But the trend might go towards small again.
(See also cheap drones destroying expensive tanks, battleships etc.)
NETWORK effect is a real thing
canadians need an anti-imperial radio-canada run alternative. we arent gonna be able to coordinate against the empire when the empire has the main control over the internet.
when the americans come a knocking, we're gonna wish we had chinese radios
Yet we regularly observe that working with millions of people; we take care of our young, we organize, when we see that some action hurt our environment we tend to limit its use.
It's not obvious why some societies break down early and some go on working.
That's more like human universals. These behaviors generally manifest to smaller or larger degree, depending on how secure people feel. But those are extremely local behaviors. And in fact, one of them is exactly the thing I'm talking about:
> we organize
We organize. We organize for many reasons, "general living" is the main one but we're mostly born into it today (few got the chance to be among the founding people of a new village, city or country). But the same patterns show up in every other organizations people create, from companies to charities, from political interests groups to rural housewives' circles -- groups that grow past ~100 people split up. Sometimes into independent groups, sometimes into levels of hierarchies. Observe how companies have regional HQs and departments and areas and teams; religious groups have circuits and congregations, etc. Independent organizations end up creating joint ventures and partnerships, or merge together (and immediately split into a more complex internal structure).
The key factor here is, IMO, for everyone in a given group to be in regular contact with everyone else. Humans are well evolved for living in such small groups - we come with built-in hardware and software to navigate complex interpersonal situations. Alignment around shared goals and implicit rules is natural at this scale. There's no space for cheaters and free-loaders to thrive, because everyone knows everyone else - including the cheater and their victims. However, once the group crosses this "we're all a big family, in it together" size, coordinating everyone becomes hard, and free-loaders proliferate. That's where explicit laws come into play.
This pattern repeats daily, in organizations people create even today.
But if a significant fraction of the population is barely scraping by then they're not willing to be "good" if it means not making ends meet, and when other people see widespread defection, they start to feel like they're the only one holding up their end of the deal and then the whole thing collapses.
This is why the tendency for people to propose rent-seeking middlemen as a "solution" to the tragedy of the commons is such a diabolical scourge. It extracts the surplus that would allow things to work more efficiently in their absence.
It’s easier to explain in those terms than assumptions about how things work in a tribe.
Commons can fail, but the whole point of Hardin calling commons a "tragedy" is to suggest it necessarily fails.
Compare it to, say, driving. It can fail too, but you wouldn't call it "the tragedy of driving".
We'd be much better off if people didn't throw around this zombie term decades after it's been shown to be unfounded.
No it does not. This sentiment, which many people have, is based on a fictional and idealistic notion of what small communities are like having never lived in such communities.
Empirically, even in high-trust small villages and hamlets where everyone knows everyone, the same incentives exist and the same outcomes happen. Every single time. I lived in several and I can't think of a counter-example. People are highly adaptive to these situations and their basic nature doesn't change because of them.
Humans are humans everywhere and at every scale.
Nonetheless, the concept is still alive, and anthropic global warming is here to remind you about this.
Communal management of a resource is still government, though. It just isn’t central government.
The thesis of the tragedy of the commons is that an uncontrolled resource will be abused. The answer is governance at some level, whether individual, collective, or government ownership.
> The "tragedy", if you absolutely need to find one, is only for unrestricted, free-for-all commons, which is obviously a bad idea.
Right. And that’s what people are usually talking about when they say “tragedy of the commons”.
that seems like an unreasonable bar, and less useful than "does this system make ToC less frequent than that system"
This is of course a false dichotomy because governance can be done at any level.
Let's Encrypt is a solid example of something you could reasonably model as "tragedy of the commons" (who is going to maintain all this certificate verification and issuance infrastructure?) but then it turns out the value of having it is a million times more than the cost of operating it, so it's quite sustainable given a modicum of donations.
Free software licenses are another example in this category. Software frequently has a much higher value than development cost and incremental improvements decentralize well, so a license that lets you use it for free but requires you to contribute back improvements tends to work well because then people see something that would work for them except for this one thing, and it's cheaper to add that themselves or pay someone to than to pay someone who has to develop the whole thing from scratch.
The jerks get their free things for a while, then it goes away for everyone.
And out of curiosity, aside from costing more for some people, what’s worse exactly? I’m not a heavy GitHub user, but I haven’t really noticed anything in the core functionality that would justify calling it enshittified.
Probably the worst thing MS did was kill GitHub’s nascent CI project and replace it with Azure DevOps. Though to be fair the fundamental flaws with that approach didn’t really become apparent for a few years. And GitHub’s feature development pace was far too slow compared to its competitors at the time. Of course GitHub used to be a lot more reliable…
Now they’re cramming in half baked AI stuff everywhere but that’s hardly a MS specific sin.
MS GitHub has been worse about DMCA and sanctioned country related takedowns than I remember pre acquisition GitHub being.
Did I miss anything?
As for how the site has become worse, plenty of others have already done a better job than I could there. Other people haven't noticed or don't care and that's ok too I guess.
Remember how GTA5 took 10 minutes to start and nobody cared? Lots of software is like this.
Some Blizzard games download 137 MB file every time you run them and take few minutes to start (and no, this is not due to my computer).
The number of companies that have this much respect for the user is vanishingly small.
I think companies shifted to online apps because #1 it solved the copy protection problem. FOSS apps are not in any hurry to become centralized because they dont care about that issue.
Local apps and data are a huge benefit of FOSS and I think every app website should at least mention that.
"Local app. No ads. You own your data."
Native software being an optimum is mostly an engineer fantasy that comes from imagining what you can build.
In reality that means having to install software like Meta’s WhatsApp, Zoom, and other crap I’d rather run in a browser tab.
I want very little software running natively on my machine.
Yes, there are many cases when condoms are indicative of respect between parties. But a great many people would disagree that the best, most respectful relationships involve condoms.
> Meta
Does not sell or operate respectful software. I will agree with you that it's best to run it in a browser (or similar sandbox).
I think this is sad.
In contrast as long as you have a native binary, one way or another you can make the thing run and nobody can stop you.
I know the browser is convenient, but frankly, its been a horror show of resource usage and vulnerabilities and pathetic performance
The idea that somehow those companies would respect your privacy were they running a native app is extremely naive.
We can already see this problem on video games, where copy protection became resource-heavy enough to cause performance issues.
> "The Macintosh boots too slowly. You've got to make it faster!"
This is what people mean about speed being a feature. But "user time" depends on more than the program's performance. UI design is also very important.
Google and amazon are famous for optimizing this. Its not an externality to them though, even 10s of ms can equal an extra sale.
That said, i don't think its fair to add time up like that. Saving 1 second for 600 people is not the same as saving 10 minutes for 1 person. Time in small increments does not have the same value as time in large increments.
2. Monopolies and situations with the principal/agent dilemma are less sensitive to such concerns.
An externality is usually a cost you don't pay (or pay only a negligible amount of). I don't see how pricing it helps justify optimizing it.
First argument would be - take at least two 0's from your estimation, most of applications will have maybe thousands of users, successful ones will maybe run with 10's of thousands. You might get lucky to work on application that has 100's of thousands, millions of users and you work in FAANG not a typical "software house".
Second argument is - most users use 10-20 apps in typical workday, your application is most likely irrelevant.
Third argument is - most users would save much more time learning how to use applications (or to use computer) properly they use on daily basis, than someone optimizing some function from 2s to 1s. But of course that's hard because they have 10-20 apps daily plus god know how many other not on daily basis. Though still I see people doing super silly stuff in tools like Excel or even not knowing copy paste - so not even like any command line magic.
Wait times don’t accumulate. Depending on the software, to each individual user, that one second will probably make very little difference. Developers often overestimate the effect of performance optimization on user experience because it’s the aspect of user experience optimization their expertise most readily addresses. The company, generally, will have a much better ROI implementing well-designed features and having you squash bugs
Prehaps not everyone cares but I've played enough Age of Empires 2 to know that there are plenty of people who have felt value gains coming from shaving seconds off this and that to get compound games over time. It's a concept plenty of folks will be familiar with.
Perhaps 120fps might result in a better approximation of motion blur.
i have to pay less attention to a thing that updates less frequently. idle games are the best in that respect because you can check into the game on your own time rather than the game forcing you to pay attention on its time
What is the probability of it being used? About 0%, right? Because git is proven and GitHub is free. Engineering aspects are less important.
So how do I start using it if I, for example, want to use it like a decentralized `syncthing`? Can I? If not, what can I use it for?
I am not a mathematician. Most people landing on your repo are not mathematicians either.
We the techies _hate_ marketing with a passion but I as another programmer find myself intrigued by your idea... with zero idea how to even use it and apply it.
I have never been convinced by this argument. The aggregate number sounds fantastic but I don't believe that any meaningful work can be done by each user saving 1 second. That 1 second (and more) can simply be taken by me trying to stretch my body out.
OTOH, if the argument is to make software smaller, I can get behind that since it will simply lead to more efficient usage of existing resources and thus reduce the environmental impact.
But we live in a capitalist world and there needs to be external pressure for change to occur. The current RAM shortage, if it lasts, might be one of them. Otherwise, we're only day dreaming for a utopia.
I’d see this differently from a user perspective. If the average operations takes one second less, I’d spend a lot of time less waiting for my computer. I’d also have less idle moments where my mind wanders while waiting for some operation to complete too.
Not all of those externalizing companies abuse your time but whatever they abuse can be expressed in a $ amount and $ can be converted to a median's person time via median wage. Hell, free time is more valuable than whatever you produce during work.
Say all that boils down to companies collectively stealing 20 minutes of your time each day. 140 minutes each week. 7280 (!) minutes each year, which is 5.05 days, which makes it almost a year over the course of 70 years.
So yeah, don't do what you do and sweettalk the fact that companies externalize costs (private the profits, socialize the losses). They're sucking your blood.
A high usage one, absolutely improve the time of it.
Loading the profile page? Isn't done often so not really worth it unless it's a known and vocal issue.
https://xkcd.com/1205/ gives a good estimate.
Even if all you do with it is just stretching, there's a chance it will prevent you pulling a muscle. Or lower your stress and prevent a stroke. Or any number of other beneficial outcomes.
For 24 years of career I've met the grand total of _two_ such. Both got fired not even 6 months after I got in the company, too.
Who's naive here?
So now you and I both have come across such a manager. Why would you make the claim most engineer’s don’t come across such people?
The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.
I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.
I'd love to hear from others about ideas and approaches they've heard about or tried
From compute POV you can serve that with one server or virtual machine.
Bandwidth-wise, given a 100 MB repo size, that would make it 3.4 GB/s - also easy terrain for a single server.
The git transport protocol is "smart" in a way that is, in some ways, arguably rather dumb. It's certainly expensive on the server side. All of the smartness of it is aimed at reducing the amount of transfer and number of connections. But to do that, it shifts a considerable amount of work onto the server in choosing which objects to provide you.
If you benchmark the resource loads of this, you probably won't be saying a single server is such an easy win :)
Using the slowest clone method they measured 8s for a 750 MB repo, 0.45s for a 40MB repo. appears to be linear so 1.1s for 100MB should be a valid interpolation.
So doing 30 of those per second only takes 33 cores. Servers have hundreds of cores now (eg 384 cores: https://www.phoronix.com/review/amd-epyc-9965-linux-619).
And remember we're using worst case assumptions in places (using the slowest clone method, and numbers from old hardware). In practice I'd bet a fastish laptop would suffice.
edit: actually on closer look at the github reported numbers the interpolation isn't straightforward: on the bigger 750MB repo the partial clone is actually said to be slower then the base full clone. However this doesn't change the big picture that it'll easily fit on one server.
It's a very hacky feeling addon that RKE2 has a distributed internal registry if you enable it and use it in a very specific way.
For the rate at which people love just shipping a Helm chart, it's actually absurdly hard to ship a self contained installation without just trying to hit internet resources.
Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.
Oh no no no. Consumer-facing companies will burn 30% of your internal team complexity budget on shipping the first "frame" of your app/website. Many people treat Next as synonymous with React, and Next's big deal was helping you do just this.
The answer is in TFA:
> The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.
because it's bad at this job, and sqlite is also free
this isn't about "externalities"
Anyone working in government, banking, or healthcare is still out of luck since the likes of Claude and GPT are (should be) off limits.
Also in case you're not aware, accusing people of shilling or astroTurfing is against the hacker news guidelines
This is perfectly sensible behavior when the developers are working for free, or when the developers are working on a project that earns their employer no revenue. This is the case for several of the projects at issue here: Nix, Homebrew, Cargo. It makes perfect sense to waste the user's time, as the user pays with nothing else, or to waste Github's bandwidth, since it's willing to give bandwidth away for free.
Where users pay for software with money, they may be more picky and not purchase software that indiscriminately wastes their time.
Windows 11 should not be more sluggish than Windows 7.
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.
This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).
It then digresses into implementation details of Github's backend implementation (how is 20k forks relevant?), then complains about default settings of the "standard" git implementation. You don't need to checkout a git working tree to have efficient key value lookups. Without a git working tree you don't need to worry about filesystem directory limits, case sensitivity and path length limits.
I was surprised the author believes the git-equivalent of a database migration is a git history rewrite.
What do you want me to do, invent my own database? Run postgres on a $5 VPS and have everybody accept it as single-point-of-failure?
Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.
Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.
But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.
Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.
I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?
Every package has its own git repository which for binary packages contains mostly only the manifest. Sources and assets, if in git, are usually in separate repos.
This seems to not have the issues in the examples given so far, which come from using "monorepos" or colocating. It also avoids the "nightmare" you mention since any references would be in separate repos.
The problematic examples either have their assets and manifests colocated, or use a monorepo approach (colocating manifests and the global index).
There's no concept of installing sqlite 3.0 on a system where sqlite 3.5 is available.
For a language package manager, it's exactly the opposite. I could make a project with every version of sqlite the package manager has ever known about. They all must be resolvable.
If you want to do that resolution quickly (which manifest do I use for sqlite 3.0?), repo-per-package doesn't work without a bunch of machinery that makes it, IMO, not worth it.
Pacman is the best, you'd have to pry Arch from my cold, dead hands. Just different constraints.
The thing that scales is dumb HTTP that can be backed by something like S3.
You don't have to use a cloud, just go with a big single server. And if you become popular, find a sponsor and move to cloud.
If money and sponsor independence is a huge concern the alternative would be: peer-to-peer.
I haven't seen many package managers do it, but it feels like a huge missed opportunity. You don't need that many volunteers to peer inorder to have a lot of bandwidth available.
Granted, the real problem that'll drive up hosting cost is CI. Or rather careless CI without caching. Unless you require a user login, or limit downloads for IPs without a login, caching is hard to enforce.
For popular package repositories you'll likely see extremely degenerate CI systems eating bandwidth as if it was free.
Disclaimer: opinions are my own.
That can be moved elsewhere / mirrored later if needed, of course. And the underlying data is still in git, just not actively used for the API calls.
It might also be interesting to look at what Linux distros do, like Debian (salsa), Fedora (Pagure), and openSUSE (OBS). They're good for this because their historic model is free mirrors hosted by unpaid people, so they don't have the compute resources.
However a lot of the "data in git repositories" projects I see don't have any such need, and then ...
> Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?
... is a good plan. Usually they make a nice static website with the data that's easy for humans to read though.
So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.
It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.
I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized
You could just make a registry hosted as plain HTTP, with everything signed. And a special file that contains a list of mirrors.
Clients request the mirror list and the signed hash of the last entry in the Merkel tree. Then they go talk to a random mirror.
Maybe, you central service requires user sign-in for publishing and reading, while mirrors can't publish, but mirrors don't require sign-in.
Obviously, you'd have to validate that mirrors are up and populated. But that's it.
You can start by self hosting a mirror.
One could go with signing schemes inspired by: https://theupdateframework.io/
Or one could omit signing all together, so long as you have a Merkel tree with hashes for all publishing events. And the latest hash entry is always fetched from your server along with the mirror list.
Having all publishing go through a single service is probably desirable. You'll eventually need to do moderation, etc. And hosting your service or a mirror becomes a legal nightmare if there is not moderation.
Disclaimer: opinions are my own.
0: https://github.com/mesonbuild/wrapdb/tree/master/subprojects
Interesting! Do you mind sharing a link to the project at this point?
But, that being said, here's the repo! I added a very basic README for you. It's one command to bootstrap to a self hosting build, so give it a shot if you're interested. My contact is in my profile.
Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).
It works fine, Julia has some heuristics to not re-download it too often.
But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.
[1] https://github.com/JuliaRegistries/General/blob/master/Regis...
Another way to phrase this mindset is "fuck around and find out" in gen-Z speak. It's usually practical to an extent but I'm personally not a fan
Building on the same thing people use for code doesn't seem stupid to me, at least initially. You might have to migrate later if you're successful enough, but that's not a sign of bad engineering. It's just building for where you are, not where you expect to be in some distant future
When you fuck around optimizing prematurely, you find out that you're too late and nobody cares.
Oh, well, optimization is always fun, so there's that.
Software engineers always make the excuse that what they're making now is unimportant, so who cares? But then everything gets built on top of that unimportant thing, and one day the world crashes down. Worse, "fixing the problem" becomes near impossible, because now everything depends on it.
But really the reason not to do it, is there's no need to. There are plenty of other solutions than using Git that work as well or better without all the pitfalls. The lazy engineer picks bad solutions not because it's necessarily easier than the alternatives, but because it's the path of least resistance for themselves.
Not only is this not better, it's often actively worse. But this is excused by the same culture that gave us "move fast and break things". All you have to do is use any modern software to see how that worked out. Slow bug-riddled garbage that we're all now addicted to.
Most software gets to take it to more of an extreme then many engineering fields since there isn't physical danger. Its telling that the counter examples always use the potentially dangerous problems like medicine or nuclear engineering. The software in those fields are more stringent.
As opposed to something like using a flock of free blogger.com blogs to host media for an offsite project.
Contrary to the snap conclusion you drew from the article, there are design trade-offs involved when it comes to package managers using Git. The article's favored solution advocates for databases, which in practice, makes the package repository a centralized black box that compromises package reproducibility. It may solve some problems, but still sucks harder in some ways.
The article is also flat-out wrong regarding Nixpkgs. The primary distribution method for Nixpkgs has always been tarballs, not Git. Although the article has attempted to backpedal [1], it hasn't entirely done so. It's now effectively criticizing collaboration over Git while vaguely suggesting that maybe it’s a GitHub problem. And you think what, that collaboration over Git is "unethical"???
On one side, there are open-source maintainers contributing their time and effort as volunteers. On the other, there are people like you attacking them, labeling them "lazy" and bemoaning that you're "forced" to rely on the results of their free labor, which you deride as "slow, bug-riddled garbage" without any real understanding. I know whose side I'm on.
[1]: https://github.com/andrew/nesbitt.io/commit/8e1c21d96f4e7b3c...
You realize, there are people who think differently? Some people would argue that if you keep working on problems you don't have but might have, you end up never finishing anything.
It's a matter of striking a balance, and I think you're way on one end of the spectrum. The vast majority of people using Julia aren't building nuclear plants.
Refusing to fix a problem that hasn't appeared yet, but has been/can be foreseen - that's different. I personally wouldn't call it unethical, but I'd consider it a negative.
Literally anybody could forsee that, _if_ something scales to millions of users, there will be issues. Some of the people who forsee that could even fix it. But they might spend their time optimizing for something that will never hit 1000 users.
Also, the problems discussed here are not that things don't work, it's that they get slow and consume too many resources.
So there is certainly an optimal time to fix such problems, which is, yes, OK, _before_ things get _too_ slow and consume _too_ many resources, but is most assuredly _after_ you have a couple of thousand users.
It cannot be the case that software engineers are labelled lazy for not building the at-scale solution to start with, but at the same time everyone wants to use their work, and there are next to no resources for said engineer to actually build the at scale solution.
> the path of least resistance for themselves.
Yeah because they're investing their own personal time and money, so of course they're going to take the path that is of least resistance for them. If society feels that's "unethical", maybe pony up the cash because you all still want to rely on their work product they are giving out for free.
I like OSS and everything.
Having said that, ethically, should society be paying for these? Maybe that is what should happen. In some places, we have programs to help artists. Should we have the same for software?
... Should it be concerning that someone was apparently able to engineer an ID like that?
Right now I don't see the problem because the only criterion for IDs is that they are unique.
Apparently it is the former, and most developers independently generate random IDs because it's easy and is extremely unlikely to result in collisions. But it seems the dev at the top of the list had a sense of vanity instead.
https://en.wikipedia.org/wiki/Universally_unique_identifier
> 00000000-1111-2222-3333-444444444444
This would technically be version 2, which would be built from the date-time and MAC address, and DCE security version.
But overall, if you allow any yahoo to pick a UUID, its not really a UUID, its just some random string that looks like one.
universally unique identifier (UUID)
> 00000000-1111-2222-3333-444444444444
It's unique.
Anyway we're talking about a package that doesn't matter. It's abandoned. Furthermore it's also broken, because it uses REPL without importing it. You can't even precompile it.
https://github.com/pfitzseb/REPLTreeViews.jl/blob/969f04ce64...
https://devblogs.microsoft.com/oldnewthing/20120523-00/?p=75...
This is too naive. Fixing the problem costs a different amount depending on when you do it. The later you leave it the more expensive it becomes. Very often to the point where it is prohibitively expensive and you just put up with it being a bit broken.
This article even has an example of that - see the vcpkg entry.
Mostly to avoid downloading the whole repo/resolve deltas from the history for the few packages most applications tend to depend on. Especially in today's CI/CD World.
It relies on a git repo branch for stable. There are yaml definitions of the packages including urls to their repo, dependencies, etc. Preflight scripts. Post install checks. And the big one, the signatures for verification. No binaries, rpms, debs, ar, or zip files.
What’s actually installed lives in a small SQLite database and searching for software does a vector search on each packages yaml description.
Semver included.
This was inspired by brew/portage/dpkg for my hobby os.
Sure, eventually you run into scaling issues, but that's a first world problem.
As it is, this comment is just letting out your emotion, not engaging in dialogue.
The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.
I am sure there's value having a vision for what your scaling path might be in the future, so this discussion is a good one. But it doesn't automatically mean that git is a bad place to start.
Let's also keep in mind that the use case mentioned in the OP is specifically about the index, which is just the datastructure that informs the version resolver how to resolve versions. When it came time to replace the git-based index, Cargo didn't replace it with a specialized database, it replaced it with HTTP endpoints (which are probably just backed by an off-the-shelf database). It's not clear what sort of specialized database would be useful to abstract this for other package managers.
The issues are only fundamental with that architecture. Using a separate repo for each package, like the Arch User Repos, does not have the same problems.
Nixpkgs certainly could be architected like that and submodules would be a graceful migration path. I'm not aware of discussion of this but guess that what's preventing it might be that github.com tooling makes it very painful to manage thousands of repos for a single project.
So I think it can be a lesson not to that using git as a database is bad but that using github.com as a database is. PRs as database transactions is clunky and GitHub Actions isn't really ACID.
The index could be split from the build and the package build defs could live in independent repos (like go or aur).
It would probably take some change to nix itself to make that work and some nontrivial work on tooling to make the devex decent.
But I don't think the friction with nixpkgs should be seen as damning for backing a package registry with git in general.
Much better to start with an API. Then you can have the server abstract the store and the operations - use git or whatever - but you can change the store later without disrupting your clients.
Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.
No, that's false. You don't need anything to be accessible over HTTP.
But even if it did, and you had to use mTLS, there's a whole bunch of ways to solve this. How do you solve this for any other software that doesn't present client certs? You use a local proxy.
This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.
In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.
I don't know if you rely on github.com but IMO vendor lock-in there might be a bigger issue which you can avoid.
regardless of the semantics, git is not ideal for serving files. this has been more apparent in the ai world, where extensions such as git lfs has allowed larger file size.
but as seen elsewhere, network effects trump over any design issues. we can always introduce an "lfs" for better shallow fetching (cached? compressed?) and this would resolve a majority of the op's grievences.
If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.
Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.
There are a few other small problems - but it's interesting to see that so many other projects do this.
I ended up working on an open source software library to help in these cases: https://www.datatig.com/
Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.
Homebrew uses OCI as its backend now, and I think every package manager should. It has the right primitives you expect from a registry to scale.
Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.
https://phiresky.github.io/blog/2021/hosting-sqlite-database...
But that's different from how you collect the data in a git repository in the first place - or are you suggesting just putting a Sqlite file in a git repository? If so I can think of one big reason against that.
But if you are I wouldn't recommend it.
PR's won't be able to show diff's. Worse, as soon as multiple people send a PR at once you'll have a really painful merge to resolve, and GitHub's tools won't help you at all. And you can't edit the files in GitHub's web UI.
I recommend one file per record, JSON, YAML, whatever non-binary format you want. But then you get:
* PR's with diff's that show you what's being changed
* Files that technical people can edit directly in GitHub's web editor
* If 2 people make PR's on different records at once it's an easy merge with no conflicts
* If 2 people make PR's on the same record at once ... ok, you might now have a merge conflict to resolve but it's in an easy text file and GitHub UI will let you see what it is.
You can of course then compile these data files into a SQLite file that can be served in a static website nicely - in fact if you see my other comments on this post I have a tool that does this. And on that note, sorry, I've done a few projects in this space so I have views :-)
Could even follow your record model, and use that as data to populate the db.
Cut out the middle man, directly serve the query response to the package manager client.
(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)
Anything where you are opening a TCP connection to a hosted SQL server is a non-starter. You could hypothetically have so many read replicas that no one could blow anyone else up, but this would get to be very expensive at scale.
Something involving SQLite is probably the most viable option.
Also Stackoverflow exposes a SQL interface so it isn't totally impossible.
https://play.clickhouse.com/
clickhouse-client --host play.clickhouse.com --user play --secure
ssh play.clickhouse.comAll of the complexity lives on the client. That makes a lot of sense for a package manager because it's something lots of people want to run, but no one really wants to host.
I don't get what is so bad about shallow clones either. Why should they be so performance sensative?
If 83GB (4MB/fork) is "too big" then responsibility for that rests solely on the elective centralization encouraged by Github. I suspect if you could go and total up the cumulative storage used by the nixpkgs source tree distributed on computers spread throughout the world, that is many orders of magnitude larger.
The solution is simple: using a shallow clone means that the use case doesn’t care about the history at all, so download a tarball of the repo for the initial download and then later rsync the repo. Git can remain the source of truth for all history, but that history doesn’t have to be exposed.
Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.
Consider vcpkg. It’s entirely reasonable to download a tree named by its hash to represent a locked package. Git knows how to store exactly this, but git does not know how to transfer it efficiently.
Naïvely, I’d expect shallow clones to be this, so I was quite surprised by a mention of GitHub asking people not to use them. Perhaps Git tries too hard to make a good packfile?..
Meanwhile, what Nixpkgs does (and why “release tarballs” were mentioned as a potential culprit in the discussion linked from TFA) is request a gzipped tarball of a particular commit’s files from a GitHub-specific endpoint over HTTP rather than use the Git protocol. So that’s already more or less what you want, except even the tarball is 46 MB at this point :( Either way, I don’t think the current problems with Nixpkgs actually support TFA’s thesis.
So the phrase the article says "Package managers keep falling for this. And it keeps not working out" I feel that's untrue.
The most issue I have with this really is "flakes" integration where the whole recipe folder is copied into the store (which doesn't happen with non-flakes commands), but that's a tooling problem not an intrinsic problem of using git
O(1) beats O(n) as n gets large.
This entire blog is just a waste of time for anyone reading it.
Well that’s an extremely rude thing to say.
Personally I thought it was really interesting to read about a bunch of different projects all running into the same wall with Git.
I also didn’t realize that Git had issues with sparse checkouts. Or maybe author meant shallow? I forget.
That's completely unrelated.
The --allow-dirty flag is to bypass a local safety check which prevents you from accidentally publishing a crate with changes which haven't been committed to your local git repository. It has no relation at all to the use of git for the index of packages.
> Crates.io should not know about or care about my project's Git usage or lack thereof.
There are good reasons to know or care. The first one, is to provide a link from the crates.io page to your canonical version control repository. The second one, is to add a file containing the original commit identifier (commit hash in case of git) which was used to generate the package, to simplify auditing that the contents of the package match what's on the version control repository (to help defend against supply chain attacks). Both are optional.
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.
I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?
Python used to have this problem as well (technically still does, but a large majority of things are available as a wheel and PyPI generally publishes a separate .metadata file for those wheels), but at least it was only a question of downloading and unpacking an archive file, not cloning an entire repo. Sheesh.
Why would Go need to do that, though? Isn't the go.mod file in a specific place relative to the package root in the repo?
One thing that still seems absent is awareness of the complete takeover of "gadgets" in schools. Schools these days, as early as primary school, shove screens in front of children. They're expected to look at them, and "use" them for various activities, including practicing handwriting. I wish I was joking [1].
I see two problems with this.
First is that these devices are engineered to be addictive by way of constant notifications/distractions, and learning is something that requires long sustained focus. There's a lot of data showing that under certain common circumstances, you do worse learning from a screen than from paper.
Second is implicitly it trains children to expect that anything has to be done through a screen connected to a closed point-and-click platform. (Uninformed) people will say "people who work with computers make money, so I want my child to have an ipad". But interacting with a closed platform like an ipad is removing the possibilities and putting the interaction "on rails". You don't learn to think, explore and learn from mistakes, instead you learn to use the app that's put in front of you. This in turn reinforces the "computer says no" [2] approach to understanding the world.
I think this is a matter of civil rights and freedom, but sadly I don't often see "civil rights" organizations talk about this. I think I heard Stallman say something along these lines once, but other than that I don't see campaigns anywhere.
Like, yes, you should host your own database. This doesn't seem like an argument against that database being git.
The other stuff mentioned in the article seems to be valid criticisms.
It looks like that doc https://docs.gitlab.com/development/wikis/ was outdated - since fixed to no longer mention Gollum.
For example, we use Hugo to provide independent Go package URLs even though the code is hosted on GitHub. That makes migrating away from GitHub trivial if we ever choose to do so (Repo: https://github.com/foundata/hugo-theme-govanity; Example: https://golang.foundata.com/hugo-theme-dev/). Usage works as expected:
go get golang.foundata.com/hugo-theme-dev
Edit: FormattingI quite like the hackage index, which is an append-only tar file. Incremental updates are trivial using HTTP range requests making hosting it trivial as well.
for every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else
It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback
What the... why would you run an autoupdate every 5 minutes?
The most pain probably just becomes from the hugeness of Nixpkgs, but I remain an advocate for the huge monorepo of build recipes.
I think the snix¹ folks are working on something like this for the binary caches— the greater granularity of the content-addressing offers morally the same kind of optimization as delta RPMs: you can download less of what you don't need to re-download.
But I'm not aware of any current efforts to let people download the Nixpkgs tree itself more efficiently. Somehow caching Git deltas would be cool. But I'd expect that kind of optimization to come from a company that runs a Git forge, if it's generally viable, and to benefit many projects other than Nix and Nixpkgs.
--
The ideal for nix would be “I have all content at commit X and need the deltas for content at commit Y” and i suspect nix would be fairly unique in being able to benefit from that. To the point that it might actually make sense to just implement the fact git repo syncs and have a local client serving those tarballs to the nix daemon.
"Use a database" isn't actionable advice because it's not specific enough
Seems possible if every git client is also a torrent client.
> The hosting problems are symptoms. The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.
Does this mean mbox is inherently superior to maildir? I really like the idea of maildir because there is nothing to compact but if we assume we never delete emails (on the local machine anyways), does that mean mbox or similar is preferable over maildir?
I feel sometimes like package management is a relatively second-class topic in computer science (or at least among many working programmers). But a package manager's behavior can be the difference between a grotesque, repulsive experience and a delightful, beautiful one. And there aren't quite yet any package managers that do well everything that we collectively have learned how to do well, which makes it an interesting space imo.
Re: Nixpkgs, interestingly, pre-flakes Nix distributes all of the needed Nix expressions as tarballs, which does play nice with CDNs. It also distributes an index of the tree as a SQLite database to obviate some of the "too many files/directories" problem with enumerating files. (In the meantime, Nixpkgs has also started bucketing package directories by name prefix, too.) So maybe there was a lesson learned here that would be useful to re-learn.
On the other hand, IIRC if you use the GitHub fetcher rather than the Git one, including for fetching flakes, Nix will download tarballs from GitHub instead of doing clones. Regardless, downloading and unpacking Nixpkgs has become kinda slow. :-\
The index is used for all lookups; it can also be generated or incrementally updated client-side to accommodate local changes.
This has worked fine for literally decades, starting back when bandwidth and CPU power was far more limited.
The problem isn’t using SCM, and the solutions have been known for a very long time.
>With release 2.45, Git has gained support for the “reftable” backend to read and write references in a Git repository. While this was a significant milestone for Git, it wasn‘t the end of GitLab’s journey to improve scalability in repositories with many references. In this talk you will learn what the reftable backend is, what work we did to improve it even further and why you should care.
https://www.youtube.com/watch?v=0UkonBcLeAo
Also see Scalar, which Microsoft used to scale their 300GiB Windows repository, https://github.com/microsoft/scalar.
Or does fossil itself still have the same issues ?
That is such an insane default, I'm at a loss for words.
Every, ...king time, when I read something like "RFC 2789 introduced a sparse HTTP protocol." my brain suffers from a short-circuit. BTW: RFC 2789 is a "Mail Monitoring MIB".
Not so smart, when we realize, that one of aspects of secure and reliable system is elimination of ambiguities.