None of this was ever the point of copyright. The best part about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Disney is ironically the proof of how constructive a system that regularly turns works over to the public domain can be. But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life.
Copyright is only possible because we the public fund the infrastructure necessary to maintain it. “IP” isn’t self manifesting like physical items. Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them. Thus we set up laws and courts and police to create a complicated simulation of physical properties for IP. Your tax dollars pay for that. The original deal was that in exchange, those works would enter the public domain to give back to society. We’ve gotten so far from that that people now argue about OpenAI “stealing” from authors, when the authors most of the time don’t even own the works — their employers do! What a sad comedy where we’ve forgotten we have a stake in this too and instead argue over which corporation should “own” the exclusive ability to cheaply and blazingly fast create future works while everyone else has to do it the hard way.
All property rights depends on public funding the infrastructure to enforce them. If I believed movies derived from applying generative AI techniques to other movies was the endgame of human creativity, I'd find your endgame of it being the fiefdom of corporations who sold enough Windows licenses to own billions of dollars worth of computer hardware even more dystopian than it being invested in the corporations who originally paid for the movies...
1. You are assuming that "greatest computing power" is a requirement. I think we're actually seeing a trend in the opposite direction with recent generative art models: It turns out consumer grade hardware is "enough" in basically all cases, and renting the compute you might otherwise be missing is cheap. I don't buy this as the barrier.
2. Given #1, I think you are framing the conversation in a very duplicitive manner by pitching this as "either Microsoft or Disney - pick your oppressor". I'd suggest that breaking the current fuckery in copyright, and restoring something more sane (like the 7 + 7 year original timespans) would benefit individuals who want to make stories and art far more than it would benefit corporations. Disney is literaly THE reason for half of the current extensions in timespan. They don't want reduced copyright - they want to curtail expression in favor of profit. This case just happens to have a convienent opponent for public sentiment.
---
Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".
Given #1, I think the OP is framing the conversation in a far more duplicitous manner by assuming that in a lawsuit against AI which doesn't even involve Disney, the only beneficiary of OpenAI not winning will be Disney. Disney extending copyright laws in past decades has nothing to do with a 10 year old internet company objecting to Open AI stripping all the copyright information off its recent articles before feeding them into its generative model.
> Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".
People who don't respect physical property are just as capable of removing it as people who don't respect intellectual property are capable of copying it. In both cases the thing that prevents them doing so is a legal system and taxpayer funded enforcement against people that don't play by the rules.
Still true, because people generally depend on the legal system and police departments to enforce physical property rights (both are publicly funded entities).
"Copyrights exist, and people can copy others works if they have enough computing power to multiplex it with other works and demultiplex to get it back" is not a reasonable position.
I'm all for limiting it to 15 or 20 years, and requiring registration. If you want to completely end them, I'd be ok with that too (but I think it's suboptimal). But "end them to rich people" isn't acceptable.
that's not how copyright works, it's not a binary thing. Also, it's similar but not the same in every legislation. You can make partial copies, you can make full copies as personal backup, you can make copies to transform copyrighted material (like create art and parodies.)
These cases are going to decide whether Google Books was a fluke or indeed, there is a limit to the power of the big copyright holders (not the artists/creators: those keep on starving, except few lucky ones.)
Like most simple binaries, this is a false dichotomy, and not only do more options exist in possibility, but neither of those matches the overt state of the law (where copyrights exist, but so do a range of caveats and exceptions, so people can copy and otherwise make use of works by others without permission under certain circumstances, but not at will, satisfying neither of the two options you present as exhaustive of all possibilities.)
That said, I agree with putting more emphasis on individual creators, even if they have sold the copyright to corporations. I was appalled by the Google settlement with the author's guild: Why does a guild decide who owns what and who gets compensations?
Both Disney and ClosedAI are in the wrong here. I'm the opposite of a Marxist, but Marx' analysis was frequently right. He used the term "alienation from one's work" in the context of factory workers. Now people are being alienated from their intellectual work, which is stolen, laundered and then sold back to them.
The "Marxsist" name is either about believing on the parts that aren't true or about the political philosophy (that honestly, can't stand by its own without the wrong facts). The ones that fit reality only make one a "realist".
And yeah it looks be shaping up as exactly that.
As far as I can tell the only copyright term extension that might have been influenced by Disney lobbying in the US is the Copyright Term Extension Act of 1998, which extended the term from life+50 to life+70 (or from 75 to 95 years for works of corporate authorship).
The switch from fixed terms to life plus+50 came with the Copyright Act of 1976 which had nothing to do with Disney. They were probably for it, but so was nearly everybody because it laid the groundwork for the US joining the Berne Convention and making its copyright system much more compatible with that of most other countries.
As far as copyright law outside the US goes, most countries were on life+50 or longer before Disney even existed.
This is a stupid argument, no matter how often it comes up.
If I hire Alice to come to my sandwich shop and make sandwiches for customers all week and then on payday I say, "Welp, no need to pay you—the sandwiches are already made!" then Alice is definitely out something, and I am categorically a piece of shit for trotting out this line of reasoning to try to justify not paying her.
If I do the same thing except I commission Alice to do a drawing for a friend's birthday, then I am no less a piece of shit if I make my own copy once she's shown it to me and try to get out of paying since I'm not using "her" copy.
(Notice that in neither case was the thing produced ever something that Alice was going to have for herself—she was never going to take home 400 sandwiches, nor was she ever interested in a portrait of your friend and his pet rabbit.)
If Alice senses that I'd be interested in the drawing but might not be totally swayed until I see it for myself, so she proactively decides to make the drawing upfront before approaching me, then it doesn't fundamentally change the balance from the previous scenario—she's out no less in that case than if I approached her first and then refuse to pay after the fact. (If she was wrong and it turns it I didn't actually want it because she misjudged and will not be able to recoup her investment, fair. But that's not the same as if she didn't misjudge and I come to her with this bankrupt argument of, "You already made the drawing, and what's done is done, and since it's infinitely reproducible, why should I owe you anything?")
Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
Wrong. It's that (not honoring an agreement negotiated beforehand) and an argument against treating past-action-thing as inherently zero-cost and/or zero-value; the fact that a prior agreement is an element in the offered scenarios doesn't negate or neutralize the rest of it (just like the fact that a sandwich shop is an element in one of the scenarios doesn't negate or neutralize the broader reality for non-sandwich-involving scenarios).
And that's before we mention: there _is_ such an prior agreement in the case of modern IP—you can't not contend with the fact that if Alice is operating in the United States which has existing legislation granting her a "temporary monopoly" on her creative output, and then she generates the output on the basis that she'll be protected by the law of the land, and then you decide that you just don't agree with the idea of IP, then Alice is getting screwed over by someone not holding up their end of the bargain.
A material difference between fraud and copyright violations as categories is the presence of lost profit. With fraud one has lost the time value of their work, but with media piracy there is some research (funded by the EU of all things) that it doesn't trade off with sales and may even help some sales.
> I'm sorry
Are you? I think you mixed up the words "insincere" and "sorry".
Do you think it would be reasonable for Mallory to sell burgers and then demand that if you share some of them with your friend you need to seek her permission? And of course since the burger becomes part of your body then perhaps Mallory should have a say in what you can do with that too and can extract some fee for you existing after eating her burgers. That's how copyright is usually (mis)used - to extract rent in perpetuity for work that was done long ago. This kind of business model just doesn't exist out of IP. It's entirely artificial.
> But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
On the contrary, it is a very important point. We don't burgers just sitting around to feed everyone for their entire lives. We do have all kinds of art and entertainment as well as productivity tools that have essentially infinite free copies. We don't really NEED to artificially encourage more creation for a lot of these whereas if people stopped producing food everyone would be in big trouble.
> Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.
I would argue that it is perhaps the opposite of tired: ironically less relevant in the past and more relevant as technology advances and mere thought experiments become practical reality. I think many of these issues weren't dealt with in the past because these edge cases existed as mere hypotheticals. Kind of like a mathematician saying that copyright doesn't make any sense because he could write a program that iterates through all books. Lawyers just roll their eyes not because they have a counter-argument, but because they don't think that scenario exists as something they'd ever have to deal with. I think the idea of a computer that reads all the text in the world and learns from it is definitely tied to the questions of unresolved issues with the nature of data, but would have until very recently been considered an annoying hypothetical in a serious discussion about copyright, allowing us to actual dismiss it and continue not addressing it.
We all agree that copyright is too long. And I also think this would just become a non-issue if we had a reasonable duration for copyrights. Even if you philosophically disagreed with it, it wouldn't be worth arguing over vs. just waiting it out.
> This is a stupid argument, no matter how often it comes up.
I knew bringing this up would rekindle these arguments from 20 years ago, but it was necessary for a later point, so I was hoping I was making it value-neutral enough that it wouldn't trigger this, but I guess I was wrong.
To be clear, I am not making the same argument you have seen several times before. I am making a strictly weaker argument. The only goal of this distinction is to demonstrate that these properties are "different", and that the law aims to make "intellectual property" behave like physical property. Notice for example that I didn't then assert that IP thus doesn't exist. I didn't even argue whether this goal of matching the behavior was good or bad. I am simply stating that it doesn't by default behave the way we seem to want it to, and, people don't seem to intuitively ascribe the same morality to it either. My only intention is to make the point that this goal thus requires work, and (as I'll explain in more detail below), more work than in the physical case. So far I don't think there is anything necessarily unreasonable about this as a set of premise conditions for establishing the terms under which the public at large agrees to take on the costs of maintaining said system.
> A bunch of stuff about Alice making sandwiches and drawing pictures
Disclaimer: I don't think we're really in disagreement about the important points, and I don't think this section is relevant to the important points which I return to below, however I find it intellectually interesting to talk about, so I have a retort here, which I believe is just an unrelated digression
These analysis of Alice making sandwiches and drawings (IMO) misses the actual meaningful differences in these scenarios since it (IMO) focuses on the uncontroversial, but also irrelevant, breach-of-contract issues. In both these scenarios, the issue is not really the "property," it is the refusal to comply with a previously agreed arrangement. You can see this if we add a third scenario where I pay Alice to do jumping jacks for a week, she does them, and then I refuse to pay at the end of the week. No need to pay you, you already did the jumping jacks. No one "got" anything here, other than I guess "satisfaction" or "exercise". We can make the example even more abstract by having me pay Alice to do nothing all week, and she once again does a great job by sitting quietly in her room all week, and then I once again don't pay her. The sandwiches and drawings are just props in the original examples -- they're not actually necessary since this is a contract question, not a theft question.
The actual interesting aspects around the sandwiches and drawings are 1) what happens much after this transaction, and 2) what happens with third parties. With the sandwiches, "what happens after" is straight-forward. I either the sandwiches, resell them immediately, or they go bad. There's not much interesting there. No one needs to think hard about the "ramifications" of the sale of the sandwiches. Compare this to the drawing. What if after I have paid you, just like we agreed, I proceed to make my infinite copies. You might think that's not fair, you thought you'd have a repeat customer. I assumed I was free to do as I please with the drawing. In fact, ironically enough, in this instance if I treat the drawing like physical property, where the expectation is I can do as I please with it, it ironically creates this conundrum because "putting the paper in the photocopier" is in the set of "do as I please". But let's go one step further, what if I make all those copies and then sell them.
I'm sure you'll now respond that the royalties or usage rights were all implied in your original story. Great! But that's my point. Those were required. You needed a supremely complex web of laws and binding contracts (and litigation if they aren't followed) as a necessary component of that transaction due to the existence of degrees of freedom that simply don't exist for the sandwiches. You can write up a contract around the resale of a sandwich, but most sandwich shops don't because me eating sandwiches for the rest of my life by copying the original sandwich isn't a realistic scenario (so no need to price that into the original cost of the sandwich), and me out-sandwiching you by carbon-cloning the sandwich isn't feasible, and even if it was it would still have material ingredient costs that would bound its effect on my shop, and even "figuring out the recipe" isn't that much of a worry since you still need to like buy ingredients and make sandwiches as opposed to hitting paste over and over. These scenarios are dramatically different, and that's why sandwich shops usually don't employ lawyers but design shops do. And again, we didn't even go into third parties. What if someone manages to somehow make a copy of your image just as you're handing it to the client. Now both of you are in compliance with your deal, neither of you is angry at each other, but there's this weird situation where you were never expecting to get money from me, but I have a copy of the picture now, and it's really hard to reason about what that means in terms of "gain" and "loss" if I never do anything other than hang it up in my room. This is simply not possible with the sandwich, no one could quickly "copy the sandwich" in transit and potentially introduce an entirely new threat to your business.
Again, my only point here is that it seems very strange to insist that there physical property is identical to intellectual property, and that it isn't fairly complicated to make intellectual property approximate the relationships we have with physical property. And to be clear, nothing even derogatory has been said about this goal yet. You could take everything I've written in this comment so far, and use it as part of argument for copyright. However important is it precisely because of the explosion of complexity in possibilities that simply don't exist for the vast majority of physical items.
But either way, I think the point that immediately follows that is even more important, right? The fact that the nature of the ownership, even in a "successful" transaction, is incredibly more complicated with the drawing. How the we don't even properly understand how much "ownership" you have of the drawing without a contract. How that transaction potentially puts you in direct competition with Alice in the future. Etc. etc. Again, the entirety of my position is the fairly narrow statements that: 1) intellectual property is fundamentally different from physical property, 2) you thus cannot simply model intellectual property transactions by merely pretending you're dealing with physical objects (since there's fundamentally more dimensionality and ambiguity without explicitly outlining and agreeing to way more terms and details), and 3) intellectual property thus naturally requires significant infrastructure in order to create an environment that gets anywhere close to simulating the same "physical-like" properties for intellectual property. I don't think that's controversial.
Persuasive arguments should focus on what's good for the world today.
I will however say that I think my comment was not just an appeal to authority. Again, I think the fact that using public domain works was critical to Disney's early success is a fairly important data point, especially considering some of those works would not have allowed to be used with the current lifespans (e.g. Pinocchio's copyright would have lasted until 1960, 20 years after the film premiered).
But again, the most important thing I want taken away from this is that we the "consumers" of the content should not consider ourselves bystanders, but understand that we do have an active stake here as well. You're first sentence is perhaps more important than you realize, making up the rules wasn't something one-off incidental property of being first to the table, we could choose to make up the rules too, so we should act like it, as opposed to trying to "deduce" the ownership of a sentence. This is unique, we don't have that ability with physical property. We can't simply declare that everyone gets Ferrari tomorrow and then have them magically appear in everyone's garage. But we could declare that everyone can have the rights to Superman tomorrow, and they would "magically just have them".
There's no real baseline here. We should just weigh the pros and cons. The fashion industry operates more or less copyright-free. The infrastructure to enforce copyright has real costs. Not to mention there is all the collateral damage from the abuse of copyright takedowns this system brings along with it. And any sort of appeal to authorship is also highly suspicious given that authors rarely end up owning these rights. Every time one of these Marvel movies comes out there's a mini outcry when people see the guy whose comic the movie is based on is just some dude who gets nothing from the making of the movie. On the flip side we take for granted that every public domain character was of course at one point created. Robin Hood, Zorro, Dracula, Sherlock Holmes. Are we unhappy with the diversity of adaptations we've gotten from these? Would it be that Earth shattering if Harry Potter joined that list? As things stand right now no one on this website will likely ever get to legally publish "their take Harry Potter". The clock doesn't start ticking until after JK Rowling dies. It would have entered the public domain 2011 under the original rules. In case you're curious, her net worth in 2011 was $500M, if you want to factor that into whether you think that would have been "fair" (and its not like she stops making money at that point, its just other people start to be able to do stuff with the first book). I think it is worthwhile to imagine a different approach to this.
Sure, various AI assistants will make more aspects of your life automated. In that sense it'll buy people more time in their private lives. It won't get most people a meaningful increase in wealth, which is the ultimate liberator of time. That is, financial independence.
And you can already see the ratio of people that are highly engaged with utilizing the latest LLMs, paying for them, versus either rarely or never using them (either not caring/interested in utilizing, or not understanding how to do so effectively). It's heavily bifurcated between the elites and everybody else, just as most tech advances have been so far. A decade ago a typical lower / lower middle class person could have gone to the library and learned JavaScript and over the course of years could have dramatically increased their earning potential (a process that takes time to be clear); for the same reason that rarely happens by volition, they also will not utilize LLMs to advance their lives despite the wide availability of them. AI will end up doing trivial automation tasks for the bottom 50%. For the top ~1/4 it will produce enormous further wealth from equity holdings and business process productivity gains (boosting wealth from business ownership, which the bottom 50% lacks universally).
The reliance on media saturation and marketing creates a perception that certain works are inherently more valuable than others, despite new creative works constantly being developed. While I agree that companies should have the right to profit from their investments, such as a $500 million movie, there should be reasonable limits. Once they recoup their costs, including a reasonable profit multiplier, the copyright could be considered fulfilled and should expire.
Holding onto copyrights indefinitely or for excessively long periods serves primarily to sustain a system that benefits lawyers and enforcement agencies, rather than providing meaningful value to society. For instance, enforcing a copyright from the 1940s for a multinational corporation that already generates billions makes little sense.
There should be a balanced framework. If I invest significant time and effort—say 100 hours—into creating a work, I should be entitled to earn a reasonable return, perhaps 10 times the effort I put in. However, after that point, the copyright should no longer apply. Current laws have spiraled out of control, failing to strike a balance between protecting creators and fostering innovation. Reform is long overdue.
If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed
There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.
Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.
There should a was to check for OpenAI. But my guess is, if Google does it, OpenAI and others must be using the same/similar resource pool.
My website has some 56K Token and I have no clue what that was, but something is there https://www.dropbox.com/scl/fi/2tq4mg16jup2qyk3os6ox/brajesh...
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
You have to license content you want to use, you cant just use it for free because it's on the internet.
Netflix doesn't just start hosting shows and hope they don't get a copyright suit...
I guess I should have used the phrase "common sense stealing in any other context" to be more precise?
Clearly not common sense stealing. The Intercept was not deprived of their content. If OpenAI would have sneaked into their office and server farm and took all the hard drives and paper copies with the content that would be "common sense stealing".
Copyright means you're not allowed to copy something without permission.
It's that simple. There is no "Yes but you still have your book" argument, because copyright is a claim on commercial value, not a claim on instantiation.
There's some minimal wiggle room for fair use, but clearly making an electronic copy and creating a condensed electronic version of the content - no matter how abstracted - and using it for profit is not fair use.
but is training an AI copying? And if so, why isn't someone learning from said work not considered copying in their brain?
If the AI produces chunks of training set nearly verbatim when prompted, it looks like copying.
> And if so, why isn't someone learning from said work not considered copying in their brain?
Well, their brain, while learning, is not someone's published work product, for one thing. This should be obvious.
But their brain can violate copyright by producing work as the output of that learning, and be guilty of plagiarism, etc. If I memorise a passage of your copyrighted book when I am a child, and then write it in my book when I am an adult, I've infringed.
The fact that most jurisdictions don't consider the work of an AI to be copyrightable does not mean it cannot ever be infringing.
That does not make the model copyright violation itself.
But an LLM doesn't just enable direct duplication, it (well its model) contains it.
If software had a meaningful distribution cost or per-unit sale cost, a blank tape tax would be very appropriate for LLM sales.
But instead OpenAI is operating a for-pay duplication service where authors don't get a share of the proceeds -- it is doing the very thing that copyright laws were designed to dissuade by giving authors a time-limited right to control the profits from reproducing copies of their work.
If you transcode a CD to mp3 and build a business around selling these files without the author's permission you'd be in big legal problems.
Tech products that "accidentally" reproduce materials without the owners' permission (e.g. someone uploading La La Land into YouTube) have processes to remove them by simply filling a form. Can you do that with ChatGPT?
It's legal for you to possess a single joint. It's not legal for you to possess a warehouse of 400 tons of weed.
The line between legal and not legal is sometimes based on scale; being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it.
I am stating what is, right now.
I thought the weed example made that clear.
Let me clarify: the state of things, as they stand, is that the entire justice system, legislation and courts included, takes scale into account when looking at the line dividing "legal" from "illegal".
There is literally no defense of "If it is legal at qty x1, it is legal at any qty".
Excelent. Then the next question is where (in which jurisdiction) are you describing the law? And what are your sources? Not about the weed, i don’t care about that. Particularly the “being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it”.
The reason why i’m asking is because you are drawing a paralel between criminal law and (i guess?) copyright infringement. The drug posession limits in many jurisdictions are explicitly written into the law. These are not some grand principle of laws but the result of explicit legislative intent. The people writing the law wanted to punish drug peddlers without punishing end users. (Or they wanted to punish them less severly or differently.) Are the copyright limits you are thinking about similarly written down? Do you have case references one can read?
I did not make the point that there is a written law specifically for copyright violations at scale (although many jurisdictions do have exemptions at small scale written into law).
I will try to clarify once again: there is no defence in law that because something is allowed at qty X1, it must be allowed at any qty.
This is the defence that was originally posted that I replied to, it is the one that is not valid because courts regularly consider the scale of an activity when determining the line between allowed and not allowed.
Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.
[0]: https://www.reuters.com/article/technology/google-book-scann...
But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
But who knows. Maybe it can be done for more fact-like stuff.
Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.
(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)
Edit:
Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."
And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).
All of that and more, all at the same time.
Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.
It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.
Your ability to regurgitate remembered article that is copyrighted does not make your brain a derivative work because removing that specific article from the training set is below noise floor of impact.
However reproducing the copyrighted material based on that is a violation because the created reproduction does critically depend on that copyrighted material.
(Gross simplification) Similar to how you can watch & read a lot of Star Wars and then even ape Ralph McQuarrie style in your own drawings but unless the result is unmistakenly related to Star Wars there's no copyright infringement - but there is if someone looks at the result and goes "that's Star Wars, isn't it?"
AI's approach to copyright is very much "rules for thee but not for me".
we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals. lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)
Influencers are parasites that have been made possible by broken, user-hostile platforms.
You are advocating for a deranged, dangerous world, where demagogues rule over large masses of idiots that can't tell the difference between AI junk and reality.
Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.
I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.
https://finance.yahoo.com/news/report-reveals-openais-44-bil...
https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
There is plenty of public domain text that could have taught a LLM English.
As time goes on, I imagine that it'll increasingly be the case that these LLM's will displace people out of their jobs / careers. I don't know whether the harm done will be greater than the benefit to society. I'm sure the answer will depend on who it is that you ask.
> That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.
Obviously given what I wrote above, I'd consider it a bad thing if LLM tech severely regressed due to copyright law. Laws are not inherently good or bad. I think you can make a good argument that this tech will be a net negative for society, but I don't think it's valid to do so just on the basis that it is breaking the law as it is today.
Good thing whether or not something is a copyright violation doesn't depend on if you can make more money with someone else's work than they can.
I also think that someone making money off LLM's is a separate question from whether or not the original creator has been harmed. I think many creators are going to benefit from better tools, and we'll likely see new forms of creation become viable.
We already recognize that certain uses of intellectual property should be permitted for societies benefit. We have fair use doctrine, patent compulsory licensing for public health, research exmpetions, and public libraries. Transformative use is also permitted, and LLMs are inherently transformative. Look at the volume of data that they ingest compared to the final size of a trained model, and how fundamentally different the output format is from the input data.
Human progress has always built upon existing knowledge. Consider how both Darwin and Wallace independently developed evolution theory at roughly the same time -- not from isolation, but from building on the intellectual foundation of their era. Everything in human culture builds on what came before.
That all being said, I'm also sure that this tech is going to negative impact people too. Like I said in the other reply, whether or not this tech is good or bad will depend on who you ask. I just think that we should weigh these costs against the potential benefits to society as a whole rather than simply preserving existing systems, or blindly following the law as if the law is inherently just or good. Copyright law was made before this tech was even imagined, and it seems fair to now evaluate whether the current copyright regime makes sense if it turns out that it'd keep us in some local maximum.
*unless they violate country laws.
Which means openAI or its alternative could survive in China but not in US. The question is that if we are fine with it?
dmca does not cover scraping.
I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.
If they were, say, a charity doing this for the good of mankind, I’d have more sympathy. Shame they never were.
The best party about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Now you’re never allowed to. And more often than not, not to grant a monopoly to the “author”, but to the corporation that hired them. The correct analysis shouldn’t be OpenAI vs. Intercept or Disney of whomever. You’re just choosing kings at that point.
People do get sued for making songs that are too similar to previously made songs. One defence available is that they've never heard it themselves before.
If you want to treat AI like humans then if AI output is similar enough to copyrighted material it should get sued. Then you try to prove that it didn't ingest the original version somehow.
A legal argument would be needed to argue the other way. This argument would imply granting LLMs some degree of human rights, which the very industry profiting from these copyright violations will never let happen for obvious reasons.
Also, it is only a matter of time until one of those employees (thanks to free will and agency) will whistleblow, it doesn’t scale, etc.
Frankly, the fact that such a big segment of HN crowd unthinkingly buys big tech’s double standard (LLMs are human when copyright is concerned, but not human in every other sense) makes me ashamed of the industry.
You would be except for the fact that publishing stuff on the web gives people an implicit license to download it for the purposes of viewing it.
OpenAI (and Google and everyone else) is creating a publicly-accessible system that produces output that could be derived from copyrighted material.
That‘s confidently and completely wrong.
I agree, but the original author might get butthurt if you distribute it. Realistically copyright law in the US is a mess when it comes to weird pieces of art.
It is disingenuous to imply the scale of someone buying books and reading them (for which the publisher and author are compensated) or borrowing them from the library and reading them (again, for which the publisher and author are compensated) is the same as the wholesale copying without permission or payment of anything not behind a pay wall on the Internet.
Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality
Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality
There are already multiple lifetimes of quality content out there. It's difficult to get worked up about the potential future losses.
But investigative journalism has not disappeared. If anything, it has grown.
The budgets at newspapers used to be much larger and fund more investigative journalism with a clearer motive.
When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.
Nobody did that.
> It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
It makes sense. There is always scale to consider in these things.
No, their mention of "slave labor" is not a comparison to how LLMs work, nor an assertion of moral equivalence.
Instead it is just one example to demonstrate that chasing economic/geopolitical competitiveness is not a carte blanche to adopt practices that might be immoral or unjust.
Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.
Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?
If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?
you already are.
a proportion of what you pay for books, music, tv shows, movies goes to rights holders already.
any subscription to spotify/apple music/netflix/hbo; any book/LP/CD/DVD/VHS; any purchased digital download … a portion of that sales is paid back to rights holders.
so… i’m not entirely sure what your comment is trying to argue for.
are you arguing that you should get paid a rebate for your salary that’s already been spent on copyright payments to rights holders?
> If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?
no. that’s not how copyright functions.
the actual episodes of the simpsons are the copyrighted work.
broadcasting/allowing purchases of those episode incurs the copyright as it involves COPYING the material itself.
COPYright is about the rights of the rights holder when their work is COPIED, where a “work” is the material which the copyright applies to.
merely mentioning the existence of a tv show involves zero copying of a registered work.
being inspired by another TV show to go off and write your own tv show involves zero copying of the work.
a hollywood writer rebroadcasting a simpsons during a TV interview would be a different matter. same with the hollywood writer just taking scenes from a simpsons episode and putting it into their film. that’s COPYing the material.
—-
when it comes to open AI, obviously this is a legal gray area until courts start ruling.
but the accusations are that OpenAi COPIED the intercept’s works by downloading them.
openAi transferred the work to openAi servers. they made a copy. and now openAi are profiting from that copy of the work that they took, without any permission or remuneration for the rights holder of the copyrighted work.
essentially, openAI did what you’re claiming is the status quo for you… but it’s not the status quo for you.
so yeah, your comment confuses me. hopefully you’re being sarcastic and it’s just gone completely over my head.
As well as the "copying" of content some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.
So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.
Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.
why not deal with it the same way as humans have been dealt with in the past?
If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.
Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation.
> If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.
the COPYing is happening on your local machine with non-cloud versions of Photoshop.
you are making a copy, using a tool, and then distributing that copy.
in music royalty terms, the making a copy is the Mechanical right, while distributing the copy is the Performing right.
and you are liable in this case.
> Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation
OpenAI make a copy of the original works to create training data.
when the original works are reproduced verbatim (memorisation in LLMs is a thing), then that is the copyrighted work being distributed.
mechanical and performing rights, again.
but the twist is that ChatGPT does the copying on their servers and delivers it to your device.
they are creating a new copy and distributing that copy.
which makes them liable.
—
you are right that “ChatGPT” is just a tool.
however, the interesting legal grey area with this is — are ChatGPT model weights an encoded copy of the copyrighted works?
that’s where the conversation about the tool itself being a copyright violation comes in.
photoshop provides no mechanism to recite The Art Of War out of the box. an LLM could be trained to do so (like, it’s a hypothetical example but hopefully you get the point).
if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them? What openAI chooses to do with the viewed information is up to them - such as distilling summary statistics, or whatever.
> are ChatGPT model weights an encoded copy of the copyrighted works? that is indeed the most interesting legal gray area. I personally believe that it is not. The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.
It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!
—
> if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them?
whether you can download a copy from your browser doesn’t matter. whether the work is registered as copyrighted does (and following on from that, who is distributing the work - aka allowing you to download the copy - and for what purposes).
from the article (on phone cba to grab a quote) it makes clear that the Intercept’s works were not registered as copyrighted works with whatever the name of the US copyright office was.
ergo, those works are not copyrighted and, yes, they essentially are public domain and no remuneration is required …
(they cannot remove DMCA attribution information when distributing copies of the works though, which is what the case is now about.)
but for all the other registered works that OpenAI has downloaded, creating their copy, used in training data, which the model then reproduces as a memorised copy — that is copyright infringement.
like, in case it’s not clear, i’ve been responding to what people are saying about copyright specifically. not this specific case.
> The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.
that’s one argument.
my argument would be it is a form of compression/decompression when the model weights result in memorised (read: overfitted) training data being regurgitated verbatim.
put the specific prompt in, you get the decompressed copy out the other end.
it’s like a zip file you download with a new album of music. except, in this case, instead of double clicking on the file you have to type in a prompt to get the decompressed audio files (or text in LLM case)
> It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!
actually, that’s the whole point of courts ruling on this.
the boundaries of what is considered reproduction is at question. it is up to the courts to decide on the red lines (probably blurry gray areas for a while).
if i specifically ask a model to reproduce an exact song… is that different to the model doing it accidentally?
i don’t think so. but a court might see it differently.
as someone who worked in music copyright, is a musician, sees the effects of people stealing musicians efforts all the time, i hope the little guys come out of this on top.
sadly, they usually don’t.
edit: i am so sorry about the wall of text.
> some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.
> So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.
what you’re talking about here is the concept of “derivative works” made from other, source works.
this is subtly different to reproduction of a work.
see the last half of this comment for my thoughts on what the interesting thing courts need to work out regarding verbatim reproduction https://news.ycombinator.com/item?id=42282003
in the derivative works case, it’s slightly different.
sampling in music is the best example i’ve got for this.
if i take four popular songs, cut 10 seconds of each, and then join each of the bits together to create a new track — that is a new, derivative work.
but i have not sufficiently modified the source works. they are clearly recognisable. i am just using copyrighted material in a really obvious way. the core of my “new” work is actually just four reproductions of the work of other people.
in that case — that derivative work, under music copyright law, requires the original copyright rights holders to be paid for all usage and copying of their works.
basically, a royalty split gets agreed, or there’s a court case. and then there’s a royalty split anyway (probably some damages too).
in my case, when i make music with samples, i make sure i mangle and process those samples until the source work is no longer recognisable. i’ve legit made it part of my workflow.
it’s no longer the original copyrighted work. it’s something completely new and fully unrecognisable.
the issue with LLMs, not just ChatGpt, is that they will reproduce both verbatim and recognisably similar output to original source works.
the original source copyrighted work is clearly recognisable, even if not an exact verbatim copy.
and that’s what you’ve probably seen folks talking about, at least it sounds like it to me.
> Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.
robin thicke “blurred lines” —
* https://en.m.wikipedia.org/wiki/Pharrell_Williams_v._Bridgep...
* https://en.m.wikipedia.org/wiki/Blurred_Lines (scroll down)
yes, there is already some very limited precedent, at least for a narrow specific case involving sheet music in the US.
the TL;DR IANAL version of the question at hand in the case was “did the defendants write the song with the intention of replicating a hook from the plaintiff’s work”.
the jury decided, yes they did.
this is different to your example in that they specifically went out to replicate the that specific musical component of a song.
in your example, you’re talking about someone having “watched” a thing one time and then having to pay royalties to those people as a result.
that’s more akin to “being inspired” by, and is protected under US law i think IANAL. it came up in blurred lines, but, well, yeah. https://en.m.wikipedia.org/wiki/Idea%E2%80%93expression_dist...
again, the red line of infringement / not infringement is ultimately up to the courts to rule on.
—
anyway, this is very different to what openAi/chatGpt is doing.
openAi takes the works. chatgpt edits them according to user requests (feed forward through the model). then the output is distributed to the user. and that output could be considered to be a derivative work (see massive amount of text i wrote above, i’m sorry).
LLMs aren’t sitting there going “i feel like recreating a marvin gaye song”. it takes data, encodes/decodes it, then produces an output. it is a mechanical process, not a creative one. there’s no ideas here. no inspiration or expression.
an LLM is not a human being. it is a tool, which creates outputs that are often strikingly similar to source copyrighted works.
their users might be specifically asking to replicate songs though. in which case, openAi could be facilitating copyright infringement (wether through derivative works or not).
and that’s an interesting legal question by itself. are they facilitating the production of derivative works through the copying of copyrighted source works?
i would say they are. and, in some cases, the derivative works are obviously derived.
When I borrow a book from a friend, how do the original authors get paid for that?
borrowing a book is not creating a COPY of the book. you are not taking the pages, reproducing all of the text on those pages, and then giving that reproduction to your friend.
that is what a COPY is. borrowing the book is not a COPY. you’re just giving them the thing you already bought. it is a transfer of ownership, albeit temporarily, not a copy.
if you were copying the files from a digitally downloaded album of music and giving those new copies to your friend (music royalties were my specialty) then technically you would be in breach of copyright. you have copied the works.
but because it’s such a small scale (an individual with another individual) it’s not going to be financially worth it to take the case to court.
so copyright holders just cut their losses with one friend sharing it with another friend, and focus on other infringements instead.
which is where the whole torrenting thing comes in. if i can track 7000 people who have all downloaded the same torrented album, now i can just send a letter / court date to those 7000 people.
the costs of enforcement are reduced because of scale. 7000 people, all found the same thing, in a way that can be tracked.
and the ultimate, one person/company has download the works and making them available to others to download, without paying for the rights to make copies when distributing.
that’s the ultimate goldmine for copyright infringement lawsuits. and it sounds suspiciously like openAi’s business model.
That's not what's happening with training AI models either though.
https://news.ycombinator.com/item?id=42282443
OpenAI are taking copies of people’s data. some of that is copyrighted data.
that’s copyright infringement.
an LLM is a tool to create derivative works from the data OpenAI has copied without permission (when considering only copyrighted works and nothing public domain).
derivative works can also be considered copyright infringement in some cases.
how the tool functions is irrelevant for the most part. how copy right infringement occurs doesn’t matter. only that it does.
These extracted useful information cannot and should not be copyrightable.
no i'm not - i'm arguing that it's weights are not copyrightable. It doesn't have to be free or not - that is a separate (and uninteresting) argument.
So far this has not been determined and there's plenty of reasonable arguments that they are not breaking copyright law.
Is this sarcasm?
I'm sure China gets competitive advantages from their use of indentured and slave-like labor forces, and mass reeducation programs in camps. Should the US allow these things to happen? What about if a private business starts?
But remember, they're just trying to compete with China on a fair playing field, so everything is permitted right?
https://apnews.com/article/prison-to-plate-inmate-labor-inve...
But don't worry, it's not considered "slave labor" because there's a nominal wage of a few pennies involved and it's not technically "forced." You just might be tortured with solitary confinement if you don't do it.
We need to point fewer fingers and clean up the problems here.
Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.
I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.
It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.
FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.
> It's not Godwin's law when it's correct.
Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?
China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?
I.e., it's some legalese trick, but "everyone knows" what's really at stake.
Ultimately, this is probably going to require congress to create new laws to codify this.
The legal argument is that copying or creating what would otherwise be derivative works solely within a human brain is exempt because the human brain is not a medium wherein a configuration of information constitutes either a copy or a new work until it is set in another medium or performed publicly, whereas the storage of an artificial computer is absolutely such a medium (both of which are well-established law), so that the “learning” metaphor is not legally valid even if it is arguably a decent metaphor for some other purpose, furthermore, learning and then creating something new is often illegal, if the “something new” has sufficient proximity to the source material (that's the prohibition on unlicensed derivative works), and GenAI systems often do that and are (so the argument goes) sufficiently frequently used, and known to the service and model providers to be used. Intentionally to do that that, even were the training itself not a violation, the standards for contributory infringement are met in the provision of the certain models and/or services.
If so, you could argue that your local library returns perfect copies of copyrighted works too. IMO it's somehow different from a business turning the results of their scraping into a profit machinery.
The internet archive also scrapes the web for content, does not pay authors, the difference being that it spits out literal copies of the content it scraped, whereas an LLM fundamentally attempts to derive a new thing from the knowledge it obtains.
I just can't figure out how to plug this into copyright law. It feels like a new thing.
> does not pay authors
Check.
> it spits out literal copies of the content it scraped
Check.
> attempts to derive a new thing from the knowledge it obtains.
Check.
* Is interactive: Check.
* Can output text that sounds syntactically and grammatically correct, but a human can instantly say "that doesn't look right": Check.
* Changing one word in a sentence affects words in a completely different sentence, because that changed the context: Check.
"Give and take", "equal exchange", however people want to put it. I don't mind if someone uses publicly-accessible content and ignores its copyright to make another thing, as long as their result is publicly-accessible and they're prepared to have their copyright ignored in return. If you not only use the result of someone else, but also their process, then be prepared to have your process publicly-accessible too, with its copyright ignored. And so on.
That's why I don't mind "unofficial" translations or subtitles (both copyright violations as soon as they are distributed) appearing on multiple sites. That's why I respect open-source licenses of projects that respect them. That's why I pay for some open-source software even if I don't have to. That's why I give credit to artists even when I use an image that I didn't make myself as profile picture (either from the internet or because I paid for it).
That's also why I don't mind anyone ignoring my copyright as long as it's on "equal" terms ("if you vendor my code and pass it off as yours, that's tacit approval for someone else doing the same thing to you" kind of thing ("someone else" because, at least for code, it won't be me)).
I only gave very specific examples, but I hope I was able to explain what I mean.
The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it.
Does intent matter for the purposes of interpreting the laws here? I'm not criticizing your point, I'm genuinely curious if that matters (outside the context of fair use). I can certainly think of valid use cases that would not be considered fair use.
> The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it.
I'm not sure that I agree with this one, given that most serious LLMs are free or very low cost to use, and in llama and phi-3's case pretty much just given away. Not a small gesture given the substantial expenses required to provide free access to some of these models.
If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.
Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.
That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.
We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.
Arrrrr matey, this is going to be fun.
GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)
And what would connect those two things together?
Also was this code open source? Your stack exchange contributions were open source, so they don't need any ToS-based permission in the first place. They have access under CC BY-SA.
It's not always clear that Stack Exchange always followed the CC license, and if they violated it once, it was terminated. The checkbox you have to click now to access the data dumps might be a violation. The data dumps don't come with copies of the licenses, so that's a violation.
He spends his time amassing power and is well positioned to plow over a speed bump like that.
It seems to me that it shouldn't really affect model quality all that much, is it?
Also, in the amended complaint:
> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights
Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?
In the decision:
> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.
Have you read 1202? It's all about hiding your infringement.
If we presume it's illegal to train on copyrighted works, but Wikipedia, a website summarizing the article is perfectly legal, then what would happen if we got LLM A to summarize the article and use that to train LLM B.
LLM A could be trained on public domain works.
And no, having 5000 different summarizing LLMs doesn't help here.
It's sort of like taking a photograph of a photograph.
Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.
let real physically tangible assets keep the exclusivity problem
let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data
let us say no to digital scarcity
> The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.
> Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
http://www.catb.org/jargon/html/H/hacker-ethic.html
Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)
(Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)
"Scrapping" (scraping) copyrighted materials is not the wrong thing to do.
Making it proprietary is.
It is important to be clear about what is wrong, so you don't accidentally end up fighting for copyright expansion, or fighting against open models.
If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.
In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.
Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)
(My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)
I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.
Ignoring moral rights of creators is the issue.
So I don't think you actually mean moral rights, since it's not being ignored here.
But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.
And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?
Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.
And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.
?!?! Comparing and equating commenting to creative works. ?!?!
These comments are NOT equivalent to the 17 full time months it took me to write a nonfiction book.
Or an 8 year art project.
When I give away my work I decide to whom and how.
You might want to take a look at https://www.gnu.org/philosophy/shouldbefree.en.html
Look, either actually read the link and refute the points within, or don't. But there's no use discussing anything if you're unwilling to even understand and seriously refute a single point being made here, other than repeating "mine, mine, mine".
In the process, [OpenAI] trained ChatGPT not to acknowledge or respect copyright, not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights, and not to provide attribution when using the works of human journalists
How could an ethical hacker side with OpenAI, when OpenAI is using its technological expertise to exploit creators without?
So basically the reasoning is this:
- NYT vs OpenAI, neither is disenfranchied - OpenAI vs individual creators, creators are disenfranchised - NYT vs individual model trainers, model trainers are disenfranchised - Individual model trainers vs individual creators, neither are disenfranchised
And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.
Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.
If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
You need to draw the line somewhere.
> If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
Kind of? It's not okay, but not because it is usage of information without consent (this is the "information should free" part), but because it is intentionally and unnecessarily annoying and angering people (this is the "don't use the information for evil" part which I think is your position).
"See? Similarly, even in your view, model trainers aren't bad because they're using data. They're bad in general because they're exploiting creatives."
But why is it exploitative?
"They're putting the creatives out of a job." But this applies to automation in general.
"They're putting creatives out of a job, using data they created." This is the strongest argument for me. It does intuitively feel exploitative. However, there are several issues:
1. Not all models or datasets do that. For instance, no one is visibly getting paid to write comments on HN, or to write fanfics on the non-commercial fanfic site AO3. Since the data creators are not doing it as a job in the first place, it does not make sense to talk about them losing their job because of the very same data.
2. Not all models or datasets do that. For example, spam filters, AI classifiers. All of this can be trained from the entire Internet and not be exploitative because there is no job replacement involved here.
3. Some models already do that, and are already well and morally accepted. For example, Google Translate.
4. This may be resolved by going the other way and making more models open source (or even leaks), so more creatives can use it freely, so they can make use of the productive power.
"Because they're using creatives' information without consent." But as mentioned, it's not about the information or consent. It's about what you do with the information.
Finally, because this is a legal case, it's also important to talk about the morality of using the state to restrict people from using information freely, even if their use of the information is morally wrong.
If you believe in free culture as in free speech, then it is wrong to restrict such a use using the law, even though we might agree it is morally wrong. But this really depends if you believe in free culture as in free speech in the first place, which is a debate much larger than this.
Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?
Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.
Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.
which has indeed turned into "i-am-rich-cuz-i-own-tech-stock"news
When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?
In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?
I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)
Obviously the armed forces are much less despised than the police. But given that private gun ownership is at an all-time high (with woman and people of color - historically marginalized groups with regard to arms equality - making up the lion's share of the recent increase), I'm not sure that people are feeling particularly vulnerable to invasion either.
Is the state really that popular in your circle? How do people express their esteem? Am I just missing it?
Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.
Humans are allowed to synthesize a bunch of inputs together and produce a new novel copyrighted.
An algorithm, if it mixes a bunch of copyrighted things together by itself, plausibly is incapable of producing a novel copyright, and instead inherits the old copyright.
Just like Clean Room Design (https://en.wikipedia.org/wiki/Clean-room_design) can be used to re-create the same software free of the original copyright, I think the parent is arguing that a mechanical turk process could allow AI to produce the same output free of the original copyright.
In fact, since I use ChatGPT a lot, I get more gain if it is.
> Andrew Deck is a generative AI staff writer at Nieman Lab...
*No copyright.*
https://insights.manageengine.com/artificial-intelligence/th...
“Source A on date 1 said XYX”
“Source B …”
“Synthesizing these, it seems that the majority opinion is X but Y is also a commonly held opinion.”
Instead of what it does now, which is make extremely confident, unsourced statements.
It looks like the copyright lawsuits are rent-seeking as much as anything else; another reason I hate copyright in its current form.
One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'
An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.
Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.
Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.
If an entity charges fees for "AI", then is it "rent-seeking"
(Assume that the entity is not the author of the training data used)
I can see where you're coming from with wanting the government to be more proactive in clamping down on illegal practices. But it's pretty standard, from what I understand, that violations civil law only has consequences if and when an aggrieved party goes to court.
Sora's presentation [0]: on multiple text input examples "... the video does not contain any text or additional objects."
are you gonna say it's a prompt for no text on signs or banners or is it a way to get rid of subtitles & watermark logos?
Which is why nothing is going to happen, can't have people starting with the latter.
That would make the USCO a defacto clearinghouse for news.
(ETA: This paragraph below is diametrically wrong. Sorry.)
AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.
Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.
As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.
But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.
Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.
Buy your family member a copy of:
https://www.goodreads.com/book/show/58734571-patent-it-yours...
I know it's tangential to this thread but could you link to further reading?
This is patent, not copyright.
It certainly used to be a legal device people used.
Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:
I guess the modern version would be to sha256 the plans and shove it into a bitcoin transaction
good luck explaining that to a judge
> Infringement suits require that relevant works were first registered with the U.S. Copyright Office (USCO).
I have it upside down/diametrically wrong, however you see fit. Right that structures exist, exactly wrong on how they apply.
It is registration that guarantees access to statutory damages:
https://www.justia.com/intellectual-property/copyright/infri...
Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.
Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"
Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.
You can go register after an infringement and still sue, but you then won't be able to get statutory damages or attorney's fees.
Statutory damages are a big deal in general but especially here where proving how much of OpenAI's revenue is due to your specific articles is probably impossible. Which is why they're suing under this DMCA provision: it's not an infringement suit so the registration requirement doesn't apply, and there's a separate statutory damages provision for it.
It's another instance of "move fast, break things" (i.e. "keep your eyes shut while breaking the law at scale")
The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.
If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.
Hilarious (and depressing) that this is what people think journalists do.
They are "content creators" now.