"The New York Times is demanding that we turn over 20 million of your private ChatGPT conversations."

As might any plaintiff. NYT might be the first of many others and the lawsuits may not be limited to copyright claims

Why has OpenAI collected and stored 20 million conversations (including "deleted chats")

What is the purpose of OpenAI storing millions of private conversations

By contrast the purpose of NYT's request is both clear and limited

The documents requested are not being made public by the plaintiffs. The documents will presumably be redacted to protect any confidential information before being produced to the plaintiffs, the documents can only be used by the plaintiffs for the purpose of the litigation against OpenAI and, unlike OpenAI who has collected and stored these conversations for as long as OpenAI desires, the plaintiffs are prohibited from retaining copies of the documents after the litigation is concluded

The privacy issue here has been created by OpenAI for their own commercial benefit

It is not even clear what this benefit, if any, will be as OpenAI continues to search for a "business model"

Wanton data collection

NB. There is no order to "collect". The order is to preserve what is already being collected and stored in the ordinary course of business

https://ia801404.us.archive.org/31/items/gov.uscourts.nysd.6...

https://ia801404.us.archive.org/31/items/gov.uscourts.nysd.6...

Why does OpenAI collect and retain for 30 days^1 chats that the user wants to be deleted

It was doing this prior to being sued by the NYT and many others

OpenAI was collecting chats even when the user asked for deletion, i.e., the user did not want them saved

That's why a lawsuit could require OpenAi to issue a hold order, retain these chats for longer and produce them to another party in discovery

If OpenAI was not collecting these chats in the ordinary course of its business before being sued by the NYT and many others, then there would be no "deleted chats" for OpenAI to be compelled by court order to retain and produce to the plaintiffs

1. Or whatever period OpenAI decides on. It could change at any time for any reason. However OpenAI cannot change their retention policy to some shortened period after being sued. Google tried this a few years ago. It began destroying chats between employees after Google was on notice it was going to be sued by the US government and state AGs

I'd trust Sam Altman about as far as I could throw him and there is absolutely no way OpenAI should be having sensitive private conversations with anybody. Sooner or later all that data will end up with Microsoft who can then correlate it with a ton of data they already have from other sources (windows, office online, linkedin, various communications services including 'teams', github and so on).

This is an intelligence service's wet dream.

  • Y-bar
  • ·
  • 15 hours ago
  • ·
  • [ - ]
> […] there is absolutely no way OpenAI should be having sensitive private conversations with anybody. Sooner or later all that data will end up with Microsoft who can then […]

I don't think you even need to go as far as to Microsoft (who have earned zero points in the Privacy Protection league), just have a look at Altmans "I want to create a biometric database of every human" Orb/World-coin eye-scanning project: https://www.ft.com/content/0c5c2b8d-b185-40b6-9221-b80ee130b...

  • ·
  • 15 hours ago
  • ·
  • [ - ]
  • ·
  • 15 hours ago
  • ·
  • [ - ]
I'm not commenting on the core point of your comment, only the "why retain for 30 days" question.

Im an age of automated backups and failovers, deleting can be really hard. Part of the answer could simply be that syncing a delete across all the redundancies (while ensuring those redundancies are reliable when a disaster happens and they need to recover or maintain uptime) may take days to weeks. Also the 30 days could be the limit, as oppose to the average or median time it takes.

The most likely explanation is whatever storage solution they’re using has a built in “recycle bin” functionality and deleted data stays the for 30 days before it’s actually deleted. I see this a lot in very large databases. The recycle bin functionality is built in to the data store product.
I'm doubtful that a data store product used at their scale can't be configured to not keep data for 30 days; for large clients that could be TB of deleted data or more. This would be neither cheap or easy to manage.
oh i realize that but deviating from those defaults they have now would require so much testing and all the risk that goes along with it that they'll avoid it at all costs.
That sounds very plausible.
The problem when dealing with any company that has proven itself untrustworthy is that by default the innocent "plausible" option is probably no longer the "likely" one.

And I say this knowing that intentionally deleting data is harder than it looks.

that doesn't sound quite right to me.

Something about game theory, art of war, and the difference between stated intentions and actual intentions.

Trustworthiness comes from alignment of stated intentions, actual intentions, abilities and actions. Someon can have integrity between stated and actual intentions, but fail to follow through. In this case I think we doubt the integrity between openais stated and actual intentions.

So Sam can be saying stuff and then we find out he wasn't being honest. We can learn over time about his intentions by watching actions instead of listening to what he says. Then we can make new assumptions based on what his actual intentions seem like.

Based on what I assume Sam's intentions to be (with some healthy suspicion of the alignment between his stated intentions and actual intentions), I'm still skeptical that the reason for the 30 day thing goes far beyond quality control, the difficulty of balancing deletion and redundancy and the features of the tech stack they are using.

> I'm not commenting on the core point of your comment, only the "why retain for 30 days" question. Im an age of automated backups and failovers, deleting can be really hard.

I doubt it's that. Deletion is hard, but it's not "exactly 30 days" hard.

The most likely explanation is that OpenAI wants the ability to investigate abuse and / or publicly-made claims ("ChatGPT told my underage kid to <x>!" / "ChatGPT praised Hitler!"). If they delete chats right away, they're flying blind and you can claim anything you want.

Now, whether you should have a "delete" button that doesn't really delete stuff is another question.

What is the standard way of being forced to restore from backup while ensuring deleted data does not also become restored? Is every delete request stored so that it can be replayed against any restore?
I have only had to manage this in a startup context with relatively low stakes and it was hard and messy. I don't know what best practice is at the scale that openai operates, but from my limited experience I have an intuition that the challenge is not trivial.

Also I suspect there is a big gap between best practice and common practice. My guess is common practice is dysfunctional. I would also suspect there is no standard way, but there are established practices within different technology stacks that vary between performative, barely compliant and effective at scale.

In one case I saw there was a substantial manual effort to load snapshots into instances run the delete and then save new snapshots. This was over 10 years ago though and it was more of a "we just need to get this done" than a "what's the most elegant way to do this at scale"

  • ·
  • 1 day ago
  • ·
  • [ - ]
  • ·
  • 12 hours ago
  • ·
  • [ - ]
> Why does OpenAI collect and retain for 30 days^1 chats that the user wants to be deleted

When working on an e-commerce gig we would get "delete my data" requests from customers, which we're legally obliged to comply with. A script would delete everything we could from the DB immediately. Since we had 30 day backups, their data would only be deleted from the backups on day 31. I think this was acceptable to the GDPR consultant.

Going in to the backups to delete their data there in insane.

  • Y_Y
  • ·
  • 14 hours ago
  • ·
  • [ - ]
> Going in to the backups to delete their data there in insane.

If I was legally obliged to delete data then I'd make sure I deleted, regardless of the purpose or location of the storage. If you can't handle a delete request you shouldn't collect the data in the first place.

What you want to do is encrypt/anonymize per user information using a translation layer that also gets backed up. In case of a gdpr request, you delete this mapping / key and voila: data cleanup. The backup data becomes unusable.

But this obviously means building an extensive system to ensure the encoded identifier is the only thing used across your system (or a giant key management system).

In the past I’ve been a part of systems at exabyte scale that had to implement this. Hard but not impossible. I can see how orgs try to ‘legalese’ their way out of doing this though because the only forcing function is judicial.

Maybe an append only data store where actual hard deletes only happen as an async batch job? Still 30 days seems really long for this.
The two documents you linked are responses to specific parts of OpenAI's objection. They're not good sources for the original order.

Nevertheless, you're generally correct but you don't realize why: A core feature of ChatGPT is that it keeps your conversation history right there so you can click on it, review it, and continue conversations across all of your devices. The court order is to preserve what is already present in the system even if the user asks to delete it.

For those who are confused: A core feature of ChatGPT and other LLM accounts is that your past conversations are available to return to, until you specifically delete them. The problem now is that if a user asks for the conversation to be deleted, OpenAI has to retain the conversation for the court order even though it appears deleted.

  • ·
  • 1 day ago
  • ·
  • [ - ]
  • ·
  • 1 day ago
  • ·
  • [ - ]
> What is the purpose of OpenAI storing millions of private conversations

Your previous ChatGPT conversations show up right in the ChatGPT interface.

They have to store the private conversations to enable users to bring them up in the interface.

This isn't a secretive, hidden data collection. It's a clear and obvious feature right in the product. They're fighting for the ability to not retain secret records of past conversations that have been deleted.

The problem with the court order is that it requires them to keep the conversations even after a user presses the 'Delete' button on them.

They could have been stored at the client, and encrypted before optionally synced back to OpenAI servers in a way that the stored chats can only be read back by the user. Signal illustrates how this is possible.

OpenAI made a choice in how the feature was and is implemented.

  • nl
  • ·
  • 1 day ago
  • ·
  • [ - ]
Signal does End-to-end encryption, so they (Signal) can never read it.

The whole point of ChatGPT conversations is so they can be read by the model on the server.

Conversations are kept around because they can be picked up and continued at any point (I use this feature frequently).

Additionally you can use conversations in their scheduled notification feature, where the conversation is replayed and updates are sent to you, all done on the server.

> OpenAI made a choice in how the feature was and is implemented.

Indeed they did, and it was a sensible choice given how the conversations are used.

You could definitely do this E2EE.

Models should run in ephemeral containers where data is only processed in RAM. For active conversation a unique and temporary key-pair is generated. Saved chats are encrypted client side and stored encrypted server side. To resume a conversation[0], decrypt client side, establish connection to container, generate new temporary key-pair, and so on. There's more details and nuances but this is very doable.

How Mullvad handles your data, for some inspiration: https://mullvad.net/en/help/no-logging-data-policy

  > Conversations are kept around because they can be picked up and continued at any point (I use this feature frequently).
I'm not sure why this is a problem. There's no requirement that data at rest needs be unencrypted. Nor is there a requirement that those storing the data need to have the keys to decrypt that data. Encrypted storage is a really common thing...

  > Additionally you can use conversations in their scheduled notification feature, where the conversation is replayed and updates are sent to you, all done on the server.
For this we can use the above scenario, or we can use a multi-key setting if you want to ping multiple devices, or you can have data temporarily decrypted. There is still no need to store the data to disk unencrypted or encrypted with keys OAI owns.

Of course, I also don't see OAI pushing the state of Homomorphic Encryption forward either... But there's definitely a lot of research and more than acceptable solutions that allow data to be processed server side while being encrypted for as long as possible and making access to that data incredibly difficult.

Again, dive deep into how Mullvad does it. It is not possible for them to make all their data encrypted, but they make it as close to impossible to get, including by themselves. There doesn't need to be a perfect solution, but there's no real reason these companies couldn't restrict their own access to that data. There's only 2 reasons they are not doing so. Either 1) they just don't care enough about your privacy or 2) they want it for themselves. Considering how OpenAI pushes the "Scale is All You Need" narrative, and "scale" includes "data", I'm far more inclined to believe the reason is option 2.

[0] Remember, this isn't so much a conversation in the conventional sense. The LLMs don't "remember". You send them the entire chat history in each request. In this sense they are Markovian. It's not like they're tuning a model just to you. And even if they were, well we can store weights encrypted too. Doesn't matter if a whole model, LoRA, embeddings, or whatever. That can be encrypted at rest via keys OAI does not have access to.

Services like Mullvad and Signal are in the business of passing along messages between other parties; messages the service isn't a party to. With chatgpt chat histories, the user is talking directly to the service - you're suggesting the service should E2EE messages to and from itself, to prevent itself from spying on data generated by its own service?
You cannot compare these examples. There is currently no way to encrypt the user message and have the model on the server read/process the message without it being decrypted first.

Mullvad and E2EE Messengers do not need to process the contents of the message on their server. All they do is, passing it to another computer. It could be scrambled binary for all they care. But any AI company _has_ to read the content of the message by definition of their service.

It's a solved problem. Lumo.
> Models should run in ephemeral containers where data is only processed in RAM

Maybe, but letting aside that they are two different kind of products, how can you trust them to really do so? And in any way, in the case of ChatGPT where should I store my client side private key, as I use those bots only in my web browser? Maybe in my password manager and I copy paste it every time I start a new conversation.

My take is that if they went this way we would not be talking about them now, we would be talking about one of their competitors that didn't put hurdles between their product and their customers.

In other words, survivor bias.

  • cush
  • ·
  • 1 day ago
  • ·
  • [ - ]
People are responding in this thread as if ChatGPT is a one-on-one conversation with another person. The data isn’t “shared” with OpenAI. You’re chatting with OpenAI. ChatGPT is just a service. There’s no way to use ChatGPT without sharing all of your chats with OpenAI, that’s what the entire product is.
  • cush
  • ·
  • 1 day ago
  • ·
  • [ - ]
This doesn’t sound realistic. Signal is end to end encrypted and only sends one message at a time, while ChatGPT needs the entire chat context for every message and they need to decrypt your messages in their services in order to feed them into the LLM.
this is what proton are doing with lumo[0]

https://lumo.proton.me

> Our long-term roadmap includes advanced security features designed to keep your data private, including client-side encryption for your messages with ChatGPT. We believe these features will help keep your private conversations private and inaccessible to anyone else, even OpenAI.
  • JCM9
  • ·
  • 1 day ago
  • ·
  • [ - ]
This sort of thing is pretty trivial to implement from the start, they just chose not to because they wanted the data themselves
  • ·
  • 1 day ago
  • ·
  • [ - ]
Hah. I seriously doubt it is even close to trivial. Especially when they are to exist on any device you use the service from.
  • ·
  • 16 hours ago
  • ·
  • [ - ]
No it's not. It's literally a court order mandating them to collect this data.

- [1] https://arstechnica.com/tech-policy/2025/08/openai-offers-20...

This article says nothing of the sort. The court order is to preserve existing logs they already have, not to disable logging, and hand all the logs over the plaintiffs. OpenAI's objections are mainly that 1/there are too many logs (so they're proposing a sample instead) and that 2/there's identifying data in the logs and so they are being "forced" to anonymize the logs at their expense (even though it's what they want as a condition of transferring the logs).

There is nothing in the article that mentions OpenAI being forced to create new logs they don't already have.

This response is misleading. Almost all computer services keep logs for a short period of time, so the court order to retain existing information is quite a bit more powerful than a layman would think. Because a huge amount of data is retained for a short period of time and then rapidly deleted in most web services I've worked on for the past 30 years.

This is true in services like Datadog, New Relic, and logging services like Splunk. But even privacy-focused services like Mullvad keep logs for 24 hours to monitor for abuse. So this concept that retaining logs is significantly weaker than not ordering the collection is really a bit of misdirection. I'm not sure whether it's intentional, but it's definitely misleading.

There is an important distinction that relates to a court’s ability to order a defendant to perform work to facilitate discovery. A court can order preservation of records, but they generally cannot order a defendant to create new ones. I was responding to your use of the word “collect,” which implies significantly more effort than merely not destroying logs (i.e. logging new information that they weren’t already).

It’s not misdirection or misleading; it lies in an understanding of the law. There’s plenty of case law out there on the subject if you’re interested.

Both are simply software changes. In one case, they're going to have to alter the software to not delete chats that users request to be deleted. In the other case, they'll alter the software to log new information. Neither of these are particularly difficult.
I understand, but the law still distinguishes between the two cases. In my experience, typically expunging is handled by a process separate from its creation (it depends on the logging framework, of course). And with the increasing trend of generated logs being ingested, processed, and stored by separate services, often disabling log deletion is a mere API call away.
Well, don’t get yourself sued and you won’t have to perform discovery for the plaintiffs.
[flagged]
If OpenAI truly didn't keep conversation records for any length of time, they would not be subject to this kind of order. Lots of stateless services get these and are able to defeat them because they never store the user's data. The fact that they store them at all means that they are in scope for a preservation order. It also means that they are in scope for all manner of usage by OpenAI themselves even if a user requests deletion.
It seems as if the court has forced OpenAI into collecting logs that they weren't otherwise collecting, or that they were deleting at user request.

So in this case not keeping logs as ordered by the court would be contempt of court.

Respectfully, it doesn’t matter the way it “seems,” it matters what is. They were collecting these logs, and as soon as they got the preservation order, they disabled deletion functionality and notified their customers of that.

There is a separate higher-tier private API customers can pay for that never had logging enabled, and the court did not force the company to add it.

  • ·
  • 1 day ago
  • ·
  • [ - ]
This is an excellent article and source. Thank you.
> What is the purpose of OpenAI storing millions of private conversations

Its needed for the conversation history feature, a core feature of the ChatGPT product

Its like saying "What is the purpose of Google Photos storing millions of private images"

This is true but why retain deleted conversations?
That's the objection: The court order requires them to retain everything they currently have, even if the user requests that it be deleted.
Because the New York Times sued them and made them.

https://openai.com/index/response-to-nyt-data-demands/

ChatGPT (the app) specifically says they keep deleted conversations for up to 30 days. That's probably why.
yeah but the link states "The 20 million user conversations were randomly sampled from Dec. 2022 to Nov. 2024" so this makes no sense. 2024 was much longer than 30 days ago
Because the court ordered them to retain the records longer than they normally would.
  • cush
  • ·
  • 1 day ago
  • ·
  • [ - ]
>What is the purpose of OpenAI storing millions of private conversations

Have you used ChatGPT? Your conversation history is on the left rail

I read in the pleadings that OpenAI claims it cannot search its logs without decompressing them first

I can search the logs I keep without decompressing

Every user is different and each is free to use whatever software they want

"Have you used ChatGPT?"

No

Large number of upvotes on the quoted comment however. Maybe some of those voters are ChatGPT users

I do searching from the command line in text mode. The script I use keeps a "log" (a customised SERP) of all query strings and search result URLs. I also have these URLs stored in the logs from the forward proxy. These are compressed using RePair. I can search the compressed logs faster this way than with something like

    ztsd -dc log.zst|grep pattern
or

    rg -z pattern log.zst
> No

Given that, I'd suggest not offering "alternatives" to the features described in TFA for a service you've never used. There are people here talking about oranges, a lot of them with domain expertise, and you're not just talking about apples, you're talking about bird migrations.

> Large number of upvotes on the quoted comment however.

Sure, and also downvotes - that measures factionalism, not correctness.

But tech wise, you're confused. Functionally speaking chatgpt is a shared document editor - the server needs to store chat histories for the same reason Google Docs stores the content of documents. Users can submit text to chatgpt.com from one browser, and later edit that text from the app or a different browser. Ergo the text is stored on the server, simple as that.

  • ·
  • 1 day ago
  • ·
  • [ - ]
Downvotes is a tiny faction

3 versus 190+, so far

Many commenters cannot distinguish rhetorical questions from questions that seek an answer

By attempting to answer a rhetorical question one may only strengthen the point being made by the question, for example, poor decision-making, and may reveal an absence self-awareness

Using RePair for compression I can also search inside compressed tarballs full of logs

To do this, I first insert a blank line at the top of each log file before adding to the tarball

IME, RePair is faster than compressing with zstd and the size reduction is almost the same

The only "catch" is that RePair requires more memory during compression

Pardon, but do you have a link for this RePair compressor?

Unfortunately, different searches for this RePair you mentioned have only revealed links to resources for repairing broken air compressors, damaged compressed files, spinal injuries, etc.

They made the feature, now they get to live with it. So they can spare us the feigned surprise and outrage.

Instead of writing open letters they could of course do something about it. Even Google stopped storing your location timeline on their servers and now have it per-device only.

  • cush
  • ·
  • 1 day ago
  • ·
  • [ - ]
We’re talking about two different things. It would be like Gmail not storing your emails. Expecting ChatGPT to not store your chats is ridiculous
  • tzs
  • ·
  • 1 day ago
  • ·
  • [ - ]
> The documents requested are not being made public by the plaintiffs

In fact, as far as I understand it, they could not be made public by the plaintiffs even if they wanted to do so, or even if one of their employees decided to leak them.

That's because the plaintiffs themselves never actually see the documents. They will only be seen by the plaintiff's lawyers and any experts hired by those lawyers to analyze them.

You are correct. I've operated under many protective orders that require me to redact portions of reports clients paid for because they were not authorized to see those specific parts due to the order.
News Plaintiffs October 15, 2025 Letter Motion to Compel

https://ia801205.us.archive.org/1/items/gov.uscourts.nysd.61...

OpenAI October 30, 2025 Letter Opposing Motion to Compel

https://ia601205.us.archive.org/1/items/gov.uscourts.nysd.61...

November 7, 2025 Order on Motion to Compel

https://ia601205.us.archive.org/1/items/gov.uscourts.nysd.61...

"OpenAI has failed to explain how its consumers privacy rights are not adequately protected by: (1) the existing protective order in this multidistrict litigation or (2) OpenAIs exhaustive de-identification of all of the 20 million Consumer ChatGPT Logs.1

1. As News Plaintiffs point out, OpenAI has spent the last two and a half months processing and deidentifying this 20 million record sample. (ECF 719 at 1 n.1)."

If an analogy to the history of search engines can be made,^1 then we know that log retention policies in the US can change over time. The user has no control over such changes

https://ide.mit.edu/wp-content/uploads/2018/01/w23815.pdf

Companies operating popular www search engines might claim that the need for longer retention is "to provide better service" or some similar reason that focuses on users' interests rather than the company's interests^2

2. Generally, advertising services

This paper attempts to expose such claims as bogus

1. According to some reports OpenAI is sending some queries to Google

Amusingly, this discussion thread is filled with replies that attempt to "answer" the question of "why" OpenAI collects chat histories even when it must have known it would be sued for copyright infringment

For users affected by OpenAI's conduct, an "answer" makes no difference. Anyone can construct any "answer" they want and we can see that in this thread. For users affected by OpenAI's conduct, it does not matter

In the above paper on search engines, the claim was that longer retention of sensitive data leads to better search. This was the "answer" presented in response to the question of "why"

But the "answer" is only misdirection. The companies have no reputation for being honest and their operations are non-transparent. Accordingly, user focus will be on the consequences for users of the company's practices, not "why"

Some readers are probably too young to have read through the AOL search data

https://en.wikipedia.org/wiki/AOL_search_log_release

Did anyone care "why" AOL released the data

IMHO, it is unfortunate that papers like the one above need to published

The question of "why" is rhetorical. It is meant to the draw attention to the consequences for users, not to seek an "answer"

  • ·
  • 1 day ago
  • ·
  • [ - ]
"Fighting the New York Times' lawyers' and experts' invasion of user privacy"
>Why has OpenAI collected and stored 20 million conversations (including "deleted chats")

To train the AI further. Obviously. Simple as.

Is there a technical limitation that prevents chat histories from being stored locally on the user's computer instead of being stored on someone else's computer(s)

Why do chat histories need to be accessible by OpenAI, its service partners and anyone with the authority to request them from OpenAI

If users want this design, as suggested by HN commenters, if users want their chat histories to be accessible to OpenAI, its service providers and anyone with authority to request them from OpenAI, then wouldn't it also be true that these users are not much concerned with "privacy"

If so, then why would OpenAI proclaim they are "fighting the New York Times' invasion of user privacy", knowing that NYT is prohibited from making the logs public and users generally do not care much about "privacy" anyway

The restrictions on plaintiff NYT's use of the logs are greater than the restrictions, if any,^1 on OpenAI's use of them

1. If any such restrictions existed, for example if OpenAI stated "We don't do X" in a "privacy policy" and people interpreted this as a legally enforceable restriction,^2 how would a user verify that the statement was true, i.e., that OpenAI has not violated the "restriction". Silicon Valley companies like OpenAI are highly secretive

2. As opposed to a statement by OpenAi of what OpenAI allegedly does not do. Compare with a potentially legally-enforceable promise such as "OpenAI will not do X". Also consider that OpenAI may do Y, Z, etc. and make no mention of it to anyone. As it happens Silicon Valley companies generally have a reputation for dishonesty

Presumably for cross-device interactivity. If I interact with ChatGPT on my phone, then open it on my desktop. I might be a bit frustrated that I can't get to the chat I was having on my phone previously.

OpenAI could store the chat conversation in an encrypted format that only you, the user, can decrypt, with the client-side determining the amount of previous messages to include for additional context, but there's plenty of user overhead involved in an undertaking like that (likely a separate decryption password would be needed to ensure full user-exclusive access, etc).

I'd appreciate and use a feature like that, but I doubt most "average" users would care.

Syncthing could do that, if the software is designed to store locally.

Ever since I put the effort into Syncthing across my all devices (paired with restic on one of them for backup), I can't help but see how cross-device functionality and cloud this are the Sysco hash potatoes that balloons Big Corp services' profit margins.

Not saying it's easy to set up. But when you get there it's so liberating and you wish all software was bring-your-own-network.

SyncThing syncs only when both clients are running at the same time. Nobody who edits a document on a website expects that they'll need to leave that browser window open in order to see the document in a different browser.

Am I missing something? Is this seriously a heated HN debate over "why does this website need to store the text it sends to people who view the website?"?

We're not talking about collaborative tooling, just a record of what you've asked an AI assistant. If it doesn't sync right away, it's not the end of the world. I find that's true with most things.

And the clients don't need to be running at the same time if you have a third device that's always on and receiving the changes from either (like a backup system). Eventually everything arrives. It's not as robust as what Google or iCloud gives you, but it's good enough for me.

Chatgpt.com is essentially a CRUD app. What you're saying here amounts to saying that it could conceivably have been designed to work dramatically differently from all other CRUD apps. And obviously that's true, but why would it be?

It's a website! You submit text, that you'll view or edit later, so the server stores it. How is that controversial to a HN audience?

Also:

> the clients don't need to be running at the same time if you have a third device that's always on

An always-on device that stores data in order to sync it to clients is a server.

> An always-on device that stores data in order to sync it to clients is a server.

Yes. But it's my server. I burden myself to operate it so that persistence does not come at the cost of control.

I think we might be tilting at different windmills here.

TBH it sounds like you're just imagining a very different service than the one openAI operates. You're imagining something where you send an input, the server returns an output - and after that they're out of the equation, and storing the output somewhere is a separate concern that could be left up to the user.

But the service they actually operate is functionally a collaborative document editor - the chat histories are basically rich text docs that you can view, edit, archive, share with others, and which are integrated with various server-side tools. And the document very obviously needs to be stored on the server to do all those things.

  • ·
  • 12 hours ago
  • ·
  • [ - ]
It's great that you'd enjoy a significantly worse product that requires you to also be familiar with a completely unrelated product.

For some reason, consumers have decided that they prefer a significantly better product that doesn't require any additional applications or technical expertise ¯\_(ツ)_/¯

Facebook messenger tries to marry end to end encryption with multi-device access and it's a horrible mess with some messages not being delivered to some devices for hours , days or ever.

I absolutely want OpenAI to keep all of my chats and I absolutely don't want them to share them ( voluntarily or by force) with any private agent.

I have exactly the same expectation of any document or communication platform. It's been long established as accepted compomise between security and convenience.

> Is there a technical limitation that prevents chat histories from being stored locally on the user's computer

People access ChatGPT through different interfaces: Web, desktop app, their phones, tablets.

Therefore the conversations are stored on the servers. It's really not some hidden plot against users to steal their data. It's just how most users expect their apps to work.

Nonsense. It's easy to design an app where the server stores all information in an encrypted form. If OpenAI "cared about privacy" like this PR piece claims, they would do this. They don't because they (obviously) don't care and they (obviously) want the data for their purposes.
"Easy" does not mean "lowest cost" or "easiest". It's far far far easier to stor conversations as plain text and return them as is, instead of having to encrypt, rotate keys, etc. etc.

That's a tricky system to get right and maintain

(Please do not interpret this as a defense of OpenAI! I just think that we shouldn't trivialize the task of encrypting user data so that it's not visible to the provider).

> It's easy to design an app where the server stores all information in an encrypted form.

If you read the article, you'd see this:

> Our long-term roadmap includes advanced security features designed to keep your data private, including client-side encryption for your messages

Look, Proton somehow baked it into the design.

Open Ai didn't want to.

"We will add privacy features in the future" is hard to reconcile with "we are fighting for privacy now"
If I am sending HTTP POST requests using own choice of software via the command line to some website, e.g., an OpenAI server, then I can save those requests on local storage. I can keep a record of what I have done. This history does not need to be saved by OpenAI and consequently end up being included in a document production when (not if) OpenAI is sued. But I cannot control what OpenAI does, that's their decision

For example, I save all the POST request bodies I send over the internet in the local forward proxy's log. I add logs to tarballs and compress with an algorithm that allows for searching the logs in the tarballs without decompressing them

It does not matter what "reason" or "excuse" or "explanation" anyone presents, technical or otherwise, for why OpenAi does what it does

The issue is what are the consequences

They're very valuable data, and it's convenient to log in to see a previous chat.

If you have ever played with the api, its clear as day that the protocol itself is stateless.

If OpenAI hadn't used data from the NYT without permission in the first place this wouldn't have happened. That is the root cause of all this.

I'm glad the NYT is fighting them. They've infringed the rights of almost every news outlet but someone has to bring this case.

They infringed nothing. Two judges have already ruled that training on copyrighted data is fair use https://www.whitecase.com/insight-alert/two-california-distr...
But displaying regurgitations of very similar content may not be fair use. Fair use is a very delicate affair. One factor is whether the modified work poses as a market replacement for the original work.
  • JCM9
  • ·
  • 1 day ago
  • ·
  • [ - ]
The issue is, in part, a concern that ChatGPT responses are often just simple derivations of the original content in ways that wouldn’t be considered fair use.
Damn, you'd think OpenAI would have made this argument! Maybe there's something you're missing if this didn't save the day for them.
No I wouldn't since this is discovery. Maybe there's something you're missing here.
Primarily you seem to be missing the fact that the NYT case is about outputs, not just the training.
  • keeda
  • ·
  • 9 hours ago
  • ·
  • [ - ]
Hmm, this is an interesting framing of the lawsuit. If it's about outputs and not just training, are the outputs really orthogonal to the training?

In traditional computer systems, no, outputs are always a function of inputs. LLMs throw a wrench into this reasoning because they apply opaque statistics to a combination of training data and the user prompt to produce outputs, so the input-output relationship is much less clear, but fundamentally it still holds.

So then this case should also be about training. The question then is: did OpenAI intend to have these models be able to regurgitate large amounts of content? Or is it yet another emergent property that nobody anticipated?

I would suspect the latter, because if you view these models as a lossy compression of the whole Internet (cf "Blurry JPEG of the Web" article) it is a surprising outcome that they are able to losslessly reproduce so much of the original content.

So this might come down to intent. Maybe the NYT would need to show that OpenAI intentionally designed for this property, e.g. by rewarding reproductions of entire segments of the original content in its training. In which case, it's looking in the wrong place for evidence.

>Hmm, this is an interesting framing of the lawsuit.

First, it's not a "framing" of the lawsuit. A lawsuit is a number of claims made by one party against the other. In the two California cases, there were no decisions made on claims relating to LLM outputs. In the NYT case, there are claims relating to LLM outputs.

Yes, it could also be about training. But the discovery pertains to the outputs, which is the issue in this case. So even if you apply the holding that training is fair, which I don't see likely to happen in the district courts of the second circuit, you still don't get the result that the person I responded to suggested, which was that this should all be moot because of two decisions in two different cases in California which are not binding precedent in the 2nd circuit, and which also would not dispose of all of NYT's claims.

>So then this case should also be about training. The question then is: did OpenAI intend to have these models be able to regurgitate large amounts of content? Or is it yet another emergent property that nobody anticipated?

Intent is not a required element of copyright infringement, so you'd be wrong there. Plaintiffs can use intent to evidence willful infringement, which they are entitled to do in statutory damages cases, and receive a damages multiplier, which this one is. So OpenAI can't avoid liability based on their intent or a lack thereof. They can only, at best, use 'intent' to establish that NYT's is not entitled to heightened damages.

>So this might come down to intent.

It's always amusing to see people apply completely made up rationales to legal cases based upon their own personal feelings about technologies while completely disregarding, lets say, 100 years of legal jurisprudence.

  • keeda
  • ·
  • 4 hours ago
  • ·
  • [ - ]
Oh I'm totally an armchair lawyer, so my ruminations were not grounded in laws or legal precedence :-) I do have some background on the patent side of things, where independent reinvention is also not a defence for infringement, but not so much in copyright, so this was educational.

However, has there been any case where the infringment was not only unintentional, but also unexpected?

That is, if you look at cases of uintentional infringement, these are typically cases where some the act of reproduction of content was intentional, but there was a lack of awareness or confusion about the copyright protections of that content. (This paper was useful for background: https://www.law.uci.edu/faculty/full-time/reese/reese_innoce...)

But I could not find a case where the act of copying itself was non-intentional.

In this case, looking at how LLM training works and what LLMs do, it is surprising that it could reproduce the training content verbatim. The fact that it reproduced those outputs is undeniable, but how does existing law and jurisprudence apply to an unprecedented case like this where the reproduction was through some magic black box that nobody can decipher?

Two idiot judges.
  • t0lo
  • ·
  • 20 hours ago
  • ·
  • [ - ]
You are the perfect case study for why having absolute faith in a system is a bad thing
People pretending to own Data that should belong to the commons is the larger issue.
Exactly. And the OpenAI corporates speak acting like they give a shit about our best interests. Give me a break, Sam Altman. How stupid do you think everyone is?

They have proven that they are the most untrustworthy company on the planet

And this isn't AI fear speaking. This is me raging at Sam Altman for spreading so much fear, uncertainty, and doubt just to get investments. The rest of us have to suffer for the last two years, worrying about losing our jobs, only to find out the AGI lie is complete bullsh*t.

To me, no company has the customers’ best interests in mind. This whole thing is akin to when Apple was refusing to unlock phones for the FBI. Of course, Apple profits by having people think that they take privacy seriously, and they demonstrate it by protecting users’ privacy. Same thing here; OpenAI needs chats to have some expectation of privacy, especially because a large use case of AI is personal advice on things. So they are fighting to make sure it's true.
> To me, no company has the customers’ best interests in mind.

Lavabit opted to stop operating rather than give the FBI access to client emails.

https://archive.ph/20200915083857/https://www.nytimes.com/20...

Both OpenAI and NYT are bad. I don't know about NYT's privacy policy, because that's not really the industry they're in, but they did admit to fabricating a story that led to a now 2-year-long war, so.
Yes, but I think at least in this instance, OpenAI needs people to think that what they ask ChatGPT is private. They will have no business model if everyone thought that whatever private question they ask could fall into the hands of a media company and be used for anything. Also, at least when I signed up, you had to provide either a highly trusted email address or phone number to sign up, so your identity is definitely attached to whatever question you ask ChatGPT. They know how high the stakes are for them in this suit.
Which story is this?
The Story Behind the New York Times October 7 Exposé https://share.google/2HB4zPEGi7x3JTYyj
  • xpe
  • ·
  • 15 hours ago
  • ·
  • [ - ]
> Both OpenAI and NYT are bad.

Both -1 and -1,000,000 are negative numbers.

We need to be careful and mindful of our framing. Saying "X is bad" is a drastic oversimplification and not necessarily useful. Pointing at any one company and saying "bad" doesn't move the needle much in terms of figuring out how to steer us towards better outcomes. For that, we have to identify incentives and understand motivations.

You got downvoted for this? That many people are doing 'Leave Sam Altman alone!'? kinda wild
It's weird. It went up and down and up and down. Controversial POV. But thanks for the support. Sam Altman's just too dishonest. It's been said time and time again by so many people, by Paul Graham, Ilya Sutskeve, everybody's telling everybody he's dishonest. When are we going to wake up and get this guy out of there?
[flagged]
> They should sell their stuff by mail if they hate open culture so much.

Does open culture mean free? Are you willing to work for free? It is perfectly OK to sell goods in exchange for money, which is what NYT is doing.

I dont know why you're so upset with it. You cant walk into Apple Store and except to walk away with a free iPhone. Then why are you expecting to "walk" into nytimes' website and walk away with free article?

The problem isn't that a news site is monetizing with a paywall. Totally fine, monetize how you want!

The problem is that prominent news orgs have lobbied governments all over the world to threaten google, apple, etc. for preferential treatment so these paywalled articles get prominent placement in various feeds and carousels and recommendation algos.

As a small publisher you'll never get this same preferential treatment if you throw up a paywall.

Creating the bizarre situation where big tech platforms feel they have to recommend paywalled articles from NYT/Bloomberg/etc, catfishing users right into a paywall when they click on headlines. This is essentially spam.

Open means open. Plenty of people make money in the open culture in way less obnoxious ways than NYT. What NYT does is crapping at the place where I am, but building a wall and charging for passage to a place that does stink little bit less. I don't mind them having such place, or even charging for access. What I mind is making mine actively worse. Do whatever you want and charge however much you want. But for the love of God don't advertise in my face using free space that I inhabit. My attention costs way more than your content. Don't be surprised that when you do I will disregard completely your wishful thinking about payment.

What I need is one checkbox in Google ecosystem (and/or my browser) that says "Never show links to paywalled content". Give me that and all my beef with NYT and similar garbage factories is gone in a blink of an eye.

That sounds like a feature request to Google, not an indictment of the NYT.
Same way that a feature request to the police is not an indictment of the criminals.
everyone like me that won't ever pay them a cent.

Yeah, how terrible that you should be expected to spend /eleven minutes/ of the average U.S. tech worker's salary for a month of information. Perish the thought.

They should sell their stuff by mail

You're in luck! You can subscribe to the New York Times by mail, just like you want.

At this point I would pay just as much to not see a single link to NYT content in my life. I can't pay for that? Well, that's my point.
  • Gud
  • ·
  • 15 hours ago
  • ·
  • [ - ]
You can write a simple extension for your web browser that will solve this for you, which is more in line with the hacker ethos than paying for it.
I can't pay for that?

You don't have to. You only need one line of CSS in your browser's supplemental CSS file to hide them.

If you can't do that minimal amount of coding, you're on the wrong web site.

I wrote myself an extension to bypass all youtube adverts and used it for years. I'm perfectly capable of evading NYT garbage once the fury exceeds the lazyness. Still the issue remains. I'm not the only one bothered by paywalled links in search results, being linked from websites and suggested in feeds of mobile apps. Checkbox to filter them out was requested long time ago. Never implemented.
Still the issue remains.

No, it doesn't.

The issue isn't the Times, since you've admitted that you have a way to avoid them, and other people have suggested solutions.

The actual issue is that you enjoy being angry and expressing that anger in front of strangers on the internet, as if that somehow validates your anger, or makes you feel good, or gives you some other kind of reward for grinding your personal axe.

This is destructive behavior. I recommend introspection. Failing that, seek professional help.

Sure, that too. I just like fierce discussions about irrelevant, unchangeable things and they are easiest to find in the company of people with bland, mainstream opinions. Somehow they always try to defend them vehemently.

I have plenty of introspection. I know exactly what I am doing and why.

I wouldn't want to make it out like I think OpenAI is the good guy here. I don't.

But conversations people thought they were having with OpenAI in private are now going to be scoured by the New York Times' lawyers. I'm aware of the third party doctrine and that if you put something online it can never be actually private. But I think this also runs counter to people's expectations when they're using the product.

In copyright cases, typically you need to show some kind of harm. This case is unusual because the New York Times can't point to any harm, so they have to trawl through private conversations OpenAI's customers have had with their service to see if they can find any.

It's quite literally a fishing expedition.

I get the feeling, but that's not what this is.

NYTimes has produced credible evidence that OpenAI is simply stealing and republishing their content. The question they have to answer is "to what extent has this happened?"

That's a question they fundamentally cannot answer without these chat logs.

That's what discovery, especially in a copyright case, is about.

Think about it this way. Let's say this were a book store selling illegal copies of books. A very reasonable discovery request would be "Show me your sales logs". The whole log needs to be produced otherwise you can't really trust that this is the real log.

That's what NYTimes lawyers are after. They want the chat logs so they can do their own searches to find NYTimes text within the responses. They can't know how often that's happened and OpenAI has an obvious incentive to simply say "Oh that never happened".

And the reason this evidence is relevant is it will directly feed into how much money NYT and OpenAI will ultimately settle for. If this never happens then the amount will be low. If it happens a lot the amount will be high. And if it goes to trial it will be used in the damages portion assuming NYT wins.

The user has no right to privacy. The same as how any internet service can be (and have been) compelled to produce private messages.

>That's what NYTimes lawyers are after. They want the chat logs so they can do their own searches to find NYTimes text within the responses.

The trouble with this logic is NYT already made that argument and lost as applied to an original discovery scope of 1.4 billion records. The question now is about a lower scope and about the means of review, and proposed processes for anonymization.

They have a right to some form of discovery, but not to a blank check extrapolation that sidesteps legitimate privacy issues raised both in OpenAIs statement as well as throughout this thread.

Again, as I pointed out to you numerous times in this thread. OpenAI already represented to the court that the data was anonymized and that they can anonymize it, so you are significantly departing from the actual facts in your discussion here. There are no genuine privacy issues left here. The data is anonymous and it is under a protective order so it must be maintained confidentially.
> The user has no right to privacy

The correct term for this is prima facie right.

You do have a right to privacy (arguably) but it is outweighed by the interest of enforcing the rights of others under copyright law.

Similarly, liberty is a prima facie right; you can be arrested for committing a crime.

> enforcing the rights of others under copyright law

I certainly do not care about copyright more than my own privacy, and I certainly don't find that interest to be the public's interest, though perhaps it's the interest of legacy corporations and their lobbyists.

Seems to me my right to privacy is far more important than their right to copyright enforcement.
Have you read OpenAI's terms of service? Which part is being violated by producing anonymized logs in response to discovery? OpenAI's ToS state that they will produce your data in response to discovery. What's not clicking for you?
> You do have a right to privacy (arguably) but it is outweighed by the interest of enforcing the rights of others under copyright law.

What governs or codifies that? I would have expected that there would need to be some kind of specific overriding concern(s) that would need to apply in order to violate my (even limited) expectation of privacy, not just enforcing copyright law in general.

E.g. there's nothing resembling "probable cause" to search my own interactions with ChatGPT for such violations. On what basis can that be justified?

Is there any evaluation of which right or which harm is larger? It seems like the idea that one outweighs another is arbitrary. Is there a principled thing behind it?
That's what the court is for. Weighing the different arguments and applying precedents
>NYTimes has produced credible evidence that OpenAI is simply stealing and republishing their content

They shouldnt have any rights to data after its released.

>That's a question they fundamentally cannot answer without these chat logs.

They are causing more damage than anything chatGPT could have caused to NYT. Privacy needs to be held higher than corporate privilege.

>Think about it this way. Let's say this were a book store selling illegal copies of books.

Think of it this way, no book should be illegal.

>They can't know how often that's happened and OpenAI has an obvious incentive to simply say "Oh that never happened".

NYT glazers do more to uphold OpenAI as a privacy respecting platform than OpenAI has ever done.

>If this never happens then the amount will be low.

Should be zero, plus compensation to the affected OpenAI users from NYT.

>The user has no right to privacy.

And this needs to be remedied immediately.

>The same as how any internet service can be (and have been) compelled to produce private messages.

And this needs to be remedied immediately.

I get that you're mad, and rightly should be for an invasion of your privacy, but the NYT would be foolish to use any of your data for anything other than this lawsuit, and to not delete it afterwards, as per their request.

They can't use this data against any individual, even if they explicitly asked, "How do I hack the NYT?"

The only potential issue is them finding something juicy in someone's chat, that they could publish as a story; and then claiming they found out about this juicy story through other means, (such as a confidential informant), but that's not likely an issue for the average punter to be concerned about.

>The only potential issue is them finding something juicy in someone's chat, that they could publish as a story; and then claiming they found out about this juicy story through other means, (such as a confidential informant)

Which is concerning since this is a news organization that's getting the data.

Let's say they do find some juicy detail and use it, then what? Nothing. It's not like you can ever fix a privacy violation. Nobody involved would get a serious punishment, like prison time, either.

>Let's say they do find some juicy detail and use it, then what? Nothing. It's not like you can ever fix a privacy violation. Nobody involved would get a serious punishment, like prison time, either.

There are no privacy violations. OpenAI already told the court they anonymized it. What they say in court and what they say in the blog is different and so many people here are (unfortunately) falling for it!

It's not credible. Using AI to regurgitate news articles is not a good use of the tool, and it is not credible that any statistically significant portion of their user base is using the tool for that.
> Think about it this way. Let's say this were a book store selling illegal copies of books. A very reasonable discovery request would be "Show me your sales logs". The whole log needs to be produced otherwise you can't really trust that this is the real log.

Your claim doesn’t hold up, my friend. It’s inaccurate because nobody archives an entire dialogue with a seller for the record, and you certainly don’t have to show identification to purchase a book.

Even if OpenAI is reproducing pieces of NYT articles, they still have a difficult argument because in no way is is a practical means of accessing paywalled NYT content, especially compared to alternatives. The entire value proposition of the NYT is news coverage, and probably 99.9% of their page views are from stories posted so recently that they aren't even in the training set of LLMs yet. If I want to reproduce a NYT story from LLM it's a prompt engineering mess, and I can only get old ones. On the other hand I can read any NYT story from today by archiving it: https://archive.is/5iVIE. So why is the NYT suing OpenAI and not the Internet Archive?
OpenAI is not allowed to reproduce the NYT's articles, that's copyright infringement. It does not really matter if it is a practical thing or not, that would only go to damages, not liability.
What do you think it is you are liable for?
I'm confused. I don't think I'm liable for anything. I am not OpenAI.
> NYTimes has produced credible evidence that OpenAI is simply stealing and republishing their content. The question they have to answer is "to what extent has this happened?"

Credible to whom? In their supposed "investigation", they sent a whole page of text and complex pre-prompting and still failed to get the exact content back word for word. Something users would never do anyways.

And that's probably the best they've got as they didn't publish other attempts.

  • ·
  • 1 day ago
  • ·
  • [ - ]
Agreed, they could carefully coerce the model to more or less output some of their articles, but the premise that users were routinely doing this to bypass the paywall is silly.
Especially when you can just copy paste the url into Internet Archive and read it. And yet they aren't suing Internet Archive.
  • acdha
  • ·
  • 1 day ago
  • ·
  • [ - ]
Copyright law isn’t binary and has long-running allowances for fair use which take into consideration factors like scale, revenue, and whether it replaces the original. As a real non-profit, the Internet Archive is not selling its copies of the NYT and it’s always giving full credit to the source. In contrast, ChatGPT does charge for their output and while it may give citations that’s not a given.
Let's be real, they are suing OpenAI because they have way more money than the Internet Archive and they would be happy with a cut
> The user has no right to privacy. The same as how any internet service can be (and have been) compelled to produce private messages.

The legal term is "expectation of privacy", and it does exist, albeit increasingly weakly in the US. There are exceptions to that, such as a subpoena, but that doesn't mean anyone can subpoena anything for any reason. There has to be a legal justification.

It's not clear to me that such a justification exists in this case.

That's why there is someone trained in the law (the judge) to make that determination.
You don't hate the media nearly enough.

"Credible" my ass. They hired "experts" who used prompt engineering and thousands of repetitions to find highly unusual and specific methods of eliciting text from training data that matched their articles. OpenAI has taken measures to limit such methods and prevent arbitrary wholesale reproduction of copyrighted content since that time. That would have been the end of the situation if NYT was engaging in good faith.

The NYT is after what they consider "their" piece of the pie. They want to insert themselves as middlemen - pure rent seeking, second hander, sleazy lawyer behavior. They haven't been injured, they were already dying, and this lawsuit is a hail mary attempt at grifting some life support.

Behavior like that of the NYT is why we can't have nice things. They're not entitled to exist, and by engaging in behavior like this, it makes me want them to stop existing, the faster, the better.

Copyright law is what you get when a bunch of layers figure out how to encode monetization of IP rights into the legal system, having paid legislators off over decades, such that the people that make the most money off of copyrights are effectively hoarding those copyrights and never actually produce anything or add value to the system. They rentseek, gatekeep, and viciously drive off any attempts at reform or competition. Institutions that once produced valuable content instead coast on the efforts of their predecessors, and invest proceeds into lawsuits, lobbying, and purchase of more IP.

They - the NYT - are exploiting a finely tuned and deliberately crafted set of laws meant to screw actual producers out of percentages. I'm not a huge OpenAI fan, but IP laws are a whole different level of corrupt stupidity at the societal scale. It's gotcha games all the way down, and we should absolutely and ruthlessly burn down that system of rules and salt the ground over it. There are trivially better systems that can be explained in a single paragraph, instead of requiring books worth of legal code and complexities.

I'm not a fan of NYT either, but this feels like you're stretching for your conclusion:

> They hired "experts" who used prompt engineering and thousands of repetitions to find highly unusual and specific methods of eliciting text from training data that matched their articles....would have been the end of the situation if NYT was engaging in good faith.

I mean, if I was performing a bunch of investigative work and my publication was considered the source of truth in a great deal of journalistic effort and publication of information, and somebody just stole my newspaper off the back of a delivery truck every day and started rewriting my articles, and then suddenly nobody read my paper anymore because they could just ask chatgpt for free, that's a loss for everyone, right?

Even if I disagree with how they editorialize, the Times still does a hell of a lot of journalism, and chatgpt can never, and will never be able to actually do journalism.

> they want to insert themselves as middlemen - pure rent seeking, second hander, sleazy lawyer behavior

I'd love to hear exactly what you mean by this.

Between what and what are they trying to insert themselves as middlemen, and why is chatgpt the victim in their attempts to do it?

What does 'rent seeking' mean in this context?

What does 'second hander' mean?

I'm guessing that 'sleazy lawyer' is added as an intensifier, but I'm curious if it means something more specific than that as well, I suppose.

> Copyright law....the rest of it

Yeah. IP rights and laws are fucked basically everywhere. I'm not smart enough to think of ways to fix it, though. If you've got some viable ideas, let's go fix it. Until then, the Times kinda need to work with what we've got. Otherwise, OpenAI is going to keep taking their lunch money, along with every other journalist's on the internet, until there's no lunch money to be had from anyone.

> then suddenly nobody read my paper anymore

This is the part that Times won't talk about because people stopped reading their paper long before AI, and they haven't been able to point to any credible harm in terms of reduced readership as a result of open AI launching. They just think that people might be using ChatGPT to read the New York Times without paying. But it's not a very good hypothesis because that's not what ChatGPT is good at.

It's like the people filing the lawsuit don't really understand the technology at all.

> my publication was considered the source of truth

Their publication is not considered the source of truth, at least not by anyone with a brain.

They are still considered a paper of record, but I chose to use a hypothetical outfit because I don’t love the Times myself but I believe the argument to be valid.

I’m not interested in arguing about whether or not they deserve to fail, because that whole discussion is orthogonal to whether OpenAI is in the wrong.

If I’m on my deathbed, and somebody tries to smother me, I still hope they face consequences

[flagged]
> The user has no right to privacy. The same as how any internet service can be (and have been) compelled to produce private messages.

This is nonsense. I’ve personally been involved in these things, and fought to protect user privacy at all levels and never lost.

You've successfully fought a subpoena on the basis of a third party's privacy? More than once? I'd love to hear more.
> In copyright cases, typically you need to show some kind of harm.

NYT is suing for statutory copyright infringement. That means you only need to demonstrate that the copyright infringement, since the infringement alone is considered harm; the actual harm only matters if you're suing for actual damages.

This case really comes down to the very unsolved question of whether or not AI training and regurgitation is copyright infringement, and if so, if it's fair use. The actual ways the AI is being used is thus very relevant for the case, and totally within the bounds of discovery. Of course, OpenAI has also been engaging this lawsuit with unclean hands in the first place (see some of their earlier discovery dispute fuckery), and they're one of the companies with the strongest "the law doesn't apply to US because we're AI and big tech" swagger.

NYT doesn't care about regurgitation. When it was doable, it was spotty enough that no one would rely on it. But now the "trick" doesn't even work anymore (you would paste the start of an article and chatgpt would continue it).

What they want is to kill training, and more over, prevent the loss of being the middle-man between events and users.

  • sfink
  • ·
  • 1 day ago
  • ·
  • [ - ]
> What they want is to kill training, and more over, prevent the loss of being the middle-man between events and users.

So... they want to continue reporting news, and they don't want their news reports to be presented to users in a place where those users are paying someone else and not them. How horrible of them?

If NYT is not reporting news, then NYT news reports will not be available for AIs to ingest. They can perhaps still get some of that data from elsewhere, perhaps from places that don't worry about the accuracy of the news (or intentionally produces inaccurate news). You have to get signal from somewhere, just the noise isn't enough, and killing off the existing sources of signal (the few remaining ones) is going to make that a lot harder.

The question is, does journalism have a place in a world with AIs, and should OpenAI be the one deciding the answer to that question?

The problem is that the publishing industry seems to think their job is to print ink on paper, and they reluctantly admit that this probably also involves putting pixels on a screen.

They're hideously anti-tech and they completely ignore technological advancement when thinking about the scope of their product. Instead of investing millions of dollars in developing their own AI solutions that are the New York Times answer machine, they pay those millions of dollars to lawyers and sue people building the answer machines. It's entirely the wrong strategy, it's regressive, and yes, they are to blame for it.

The biggest bug I've observed in my life is that people think technology is its own sector when really it's a cross-cutting concern that everybody needs to be thinking about.

It's easy to see a future where primary sources post their information directly online (already largely the case) and AI agents make tailored, interactive news for their users.

Sure, there may still be investigative journalism and long form, but those are hardly the money makers.

Also, just like SWE's, writers have that same "do I have a place in the future?" anxiety in the back of their head.

The media is very hostile towards AI, and the threat is on multiple levels.

> prevent the loss of being the middle-man between events and users

I'm confused by this phrase. I may be misreading but it sounds like you're frustrated, or at least cynical about NYT wanting to preserve their business model of writing about things that happen and selling the publication. To me it seems reasonable they'd want to keep doing that, and to protect their content from being stolen.

They certainly aren't the sole publication of written content about current events, so calling them "the middle-man between events and users" feels a bit strange.

If your concern is that they're trying to prevent OpenAI from getting a foot in the door of journalism, that confuses me even more. There are so, so many sources of news: other news agencies, independent journalists, randos spreading word-of-mouth information.

It is impossible for chatgpt to take over any aspect of being a "middle-man between events and users" because it can't tell you the news. it can only resynthesize journalism that it's stolen from somewhere else, and without stealing from others, it would be worse than the least reliable of the above sources. How could it ever be anything else?

This right here feels like probably a good understanding of why NYT wants openai to keep their gross little paws off their content. If I stole a newspaper off the back of a truck, and then turned around and charged $200 a month for the service of plagiarizing it to my customers, I would not be surprised if the Times's finest lawyers knocked on my door either.

Then again, I may be misinterpreting what you said. I tend to side with people who sue LLM companies for gobbling up all their work and regurgitating it, and spend zero effort trying to avoid that bias

> preserve their business model of writing about things that happen and selling the publication. To me it seems reasonable they'd want to keep doing that

Be very wary of companies that look to change the landscape to preserve their business model. They are almost always regressive in trying to prevent the emergence of something useful and new because it challenges their revenue stream. The New York Times should be developing their own AI and should not be ignoring the march of technological progress, but instead they are choosing to lawyer up and use the legal system to try to prevent progress. I don't have any sympathy for them; there is no right to a business model.

This feels less like changing the landscape and more like trying to stop a new neighbor from building a four-level shopping complex in front of your beach-front property while also strip-mining the forest behind.

As for whether the Times should be developing their own LLM bot, why on earth would they want that?

It sounds like the defendant would much prefer middle-men who do not have the resources to enforce copyright.
> prevent the loss of being the middle-man between events and users.

OpenAI is free to do own reporting. NY Times is nowhere near trying to prevent others for competing as middleman.

It’s more than middle man right? Like if visits to NYT reduce then they get less ads revenue and their ability to do business goes away. On the other hand, if they demand licensing fees then they’ll just be marginalized by other news anyways.
Notably absent from their complaint is any suggestion that they've been harmed by a reduction in readership as a result of OpenAI's emergence.
> This case is unusual because the New York Times can't point to any harm

It helps to read the complaint. If that was the case, the case would have been subject to a Rule 12(b)(6) (failure to state a claim for which relief can be granted) challenge and closed.

Complaint: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

See pages 60ff.

My observation is that section does not articulate any harm. It _claims_ harm, but doesn't actually explain what the harm is. Reduced profits? Lower readership? All they say is "OpenAI violated our copyrights, and we deserve money."

> 167. As a direct and proximate result of Defendants’ infringing conduct alleged herein, The Times has sustained and will continue to sustain substantial, immediate, and irreparable injury for which there is no adequate remedy at law. Unless Defendants’ infringing conduct is enjoined by this Court, Defendants have demonstrated an intent to continue to infringe the copyrighted works. The Times therefore is entitled to permanent injunctive relief restraining and enjoining Defendants’ ongoing infringing conduct. > 168. The Times is further entitled to recover statutory damages, actual damages, restitution of profits, attorneys’ fees, and other remedies provided by law.

They're simply claiming harm, nothing more. I want to see injuries, scars, and blood if there's harm. As far as I can tell, the NYT was on the ropes long before AI came along. If they could actually articulate any harm, they wouldn't need to read through everyone's chats.

> As a direct and proximate result of Defendants’ infringing conduct alleged herein, The Times has sustained and will continue to sustain substantial, immediate, and irreparable injury for which there is no adequate remedy at law. Unless Defendants’ infringing conduct is enjoined by this Court, Defendants have demonstrated an intent to continue to infringe the copyrighted works. The Times therefore is entitled to permanent injunctive relief restraining and enjoining Defendants’ ongoing infringing conduct.

This is boilerplate language in a claim seeking injunctive relief. In contract law in law school, you learn there's a historical difference between cases at law (where the only remedy is money) and cases in equity (where the court can issue injunctions). If you want to stop someone from violating your rights, you claim "irreparable injury" (that is, money isn't enough) and ask for the court in equity to issue an injunction.

> It _claims_ harm, but doesn't actually explain what the harm is. Reduced profits? Lower readership? All they say is "OpenAI violated our copyrights, and we deserve money."

Copyright violation, in and of itself, constitutes a judicially cognizable injury. It's a violation of a type of property right - that is, the right to exclude others from using your artistic works without your permission. The Copyright Act specifies that victims of copyright infringement are not only entitled to an injunction, but also to statutory damages as well as compensatory damages to be determined by a jury. See 17 U.S.C. § 504.

Similarly, you don't have to claim a specific injury in a garden-variety trespass action. The violation of your property rights is enough.

Very much appreciate the clarification and nuance here. I understand that legally they don't have to provide any of this detail, but I'm also somewhat astonished that there doesn't appear to be any evidence that they've been harmed in any way other than them claiming that they are.
It’s because 1/the damages aren’t clearly articulable and would be speculative at the time of filing, and 2/they don’t have to claim the specific nature of the injury at this point in the case.
> sustain substantial, immediate, and irreparable

Furthermore, any alleged injury is absolutely reparable. How many times did OpenAI replicate their content and how many page views did they lose to it? Very reparable monetary damages, if it did in fact occur (and I'm pretty sure it didn't).

It's a part of privacy policy boilerplate that if a company is compelled by the courts to give up its logs it'll do it. I'm sure all of OpenAI's users read that policy before they started spilling their guts to a bot, right? Or at least had an LLM summarize it for them?
This is it isn't it? For any technology, I don't think anyone should have the expectation of privacy from lawyers if the company who has your data is brought to court
The original lawsuit has lots of examples of ChatGPT (3.5? 4?) regurgitating article...snippets. They could get a few paragraphs with ~80-90% perfect replication. But certainly not full articles, with full accuracy.

This wasn't solid enough for a summary judgement, and it seems the labs have largely figured out how to stop the models from doing this. So it looks like NYT wants to comb all user chats rather than pay a team of people tens of thousands a day to try an coax articles out of ChatGPT-5.

No doubt. I’m sure NYT sees an opportunity to buy a few more years of life support by pickpocketing the conductor of the AI gravy train. When Sam Altman and the Sulzbergers fight though, as a normal person, my hope is that they destroy each other.

I think the winner are Chinese (and by extension OSS) models as they can ignore copyright. A net win, I think.

Yeah, everyone else in the comments so far is acting emotionally, but --

As a fan and DAU of both OpenAI and the NYT, this is just a weird discovery demand and there should be another pathway for these two to move fwd in this case (NYT to get some semblance of understanding, OAI protecting end-user privacy).

It sounds like the alternate path you're suggesting is for NYT to stop being wrong and let OpenAI continue being right, which doesn't sound much like a compromise to me.
> This would allow them to access millions of user conversations that are unrelated to the case

It feels like the NYT is really fishing for inside information on how GPT is used so they can run statistical analysis and write articles about it. I.E. if they find examples of racism, they can get some great articles about how racism is rampant on GPT or something.

> having with OpenAI in private

I don't blieve that OpenAI, or any American corporation, has the wherewithal to actually maintain _your_ privacy in the face of _their_ profitability.

> typically you need to show some kind of harm.

You copied my material without my permission. I've been harmed. That right is independent of pricing. Otherwise Napster would never have generated legal cases.

> It's quite literally a fishing expedition.

It's why American courts are awesome.

It is better if it is out in the open compared to just some select few diabolical organizations having access to it
To show harm they need the proof, this is the point of the lawsuit. They have sufficient evidence that OpenAI was scraping the web and the NY Times.

When Altman says "They claim they might find examples of you using ChatGPT to try to get around their paywall." he is blatantly misrepresenting the case.

https://smithhopen.com/2025/07/17/nyt-v-openai-microsoft-ai-...

"The lawsuit focuses on using copyrighted material for AI training. The NYT says OpenAI and Microsoft copied vast amounts of its content. They did this to build generative AI tools. These tools can output near-exact copies of NYT articles. Therefore, the NYT argues this breaks copyright laws. It also hurts journalism by skipping paywalls and cutting traffic to original sites. The complaint shows examples where ChatGPT mimics NYT stories closely. This could lead to money loss and harm from AI errors, called hallucinations."

This has nothing to do with the users, it has everything to do with OpenAI profiting off of pirated copyrighted material.

Also, Altmans is getting scared because the NY Times proved to the judge that CahtGPT copied many articles:

"2025 brings big steps in the case. On March 26, 2025, Judge Sidney Stein rejected most of OpenAI’s dismissal motion. This lets the NYT’s main copyright claims go ahead. The judge pointed to “many” examples of ChatGPT copying NYT articles. He found them enough to continue. This ruling dropped some side claims, like unfair competition. But it kept direct and contributory infringement, plus DMCA breaches."

> The lawsuit focuses on using copyrighted material for AI training

Well that's going to go pretty poorly for them considering it has already been ruled fair use twice: https://www.whitecase.com/insight-alert/two-california-distr...

On the other hand, distributing copies of NYT content is actually a breach of copyright, but only if the NYT can prove it was actually happening.

It's really interesting living through this revolution because it's pretty obvious to me that the outcome here needs to be that training is fair use, pirating materials you train on is not going to end up being okay, and the user of the AI tool will be responsible for whether or not the resulting work is infringing. AI tools that are predominantly designed for infringing use cases will of course be ruled against.

I feel like this is all so blindingly obvious and yet I feel like it's going to take us decades to get there. I guess the wheels of justice turn slowly.

Training has sometimes been held to be fair use under certain circumstances, but in determining fair use, one of the four factors that is considered is how it affects the market for the work being infringed. I would expect that determining to what degree it's regurgitating the New York Times' content is part of that analysis.
>But conversations people thought they were having with OpenAI in private

...had never been private in the first place.

not only is the data used for refining the models, OpenAI had also shariah policed plenty of people for generating erotica.

This is about private chats, which are not used for training and only stored for 30 days.

Also, you need to understand, that for huge corps like OpenAI, the lying on your ToS will do orders of magnitude more damage to your brand than what you would gain through training on <1% more user chats. So no, they are not lying when they say they don't train on private chats.

> Also, you need to understand, that for huge corps like OpenAI, the lying on your ToS will do orders of magnitude more damage to your brand than what you would gain

Is this true? I can’t recall anything like this (look at Ashley Madison which is alive and well)

I think it is hard to say because OpenAI is still heavily in development and working out their business model (and a reasonable complaint is that it is crazy to label them a massive success without seeing how they actually work when they need to make a profit).

But, all that aside, it seems that OpenAI is aiming to be bigger and more integrated into the day-to-day life of the average person than Ashley Madison, right?

It's not national news when a company is found to be doing what they say they are doing.
> It's not national news when a company is found to be doing what they say they are doing.

You said there would be ‘orders of magnitude’ of brand damage. What is the proof?

Yeah I don’t get why more people don’t understand this - why would you think your conversation was private when it wasnt actually private. Have you not been paying attention.
> OpenAI had also shariah policed plenty of people for generating erotica.

That framing is retorically brilliant if you think about it. I will use that more. Chat Sharia Law for Chat Control. Mass Sharia Surveillance from flock etc.

100% agreed. In the time you wrote this, I also posted: https://news.ycombinator.com/item?id=45901054

I felt quite some disappointment with the comments I saw on the thread at that time.

I've noticed a pattern of companies writing their customers open letters asking them to do their contract negotiations for them. First it was ESPN vs. YouTube (not watching MNF this week was the best 3 hours I've ever saved, sorry advertisers). Now it's OpenAI vs. The New York Times.

Little do they know that I care very little for either party and enjoy seeing both of them squirm. You went to business school, not me. Work it out.

In this case, it's awfully suspicious that OpenAI is worried about The New York Times finding literal passages in their articles that ChatGPT spits out verbatim. If your AI doesn't do that, like you say, then why would it be a problem to check?

Finally, both parties should find a neutral third party. The neutral third party gets the full text of every NYT article and ChatGPT transcript, and finds the matches. NYT doesn't get ChatGPT transcripts. OpenAI doesn't get the full text of every NYT article (even though they have to already have that). Everyone is happy. If OpenAI did something illegal, the court can find out. If they didn't, then they're safe. I think it would be very fair.

(I take the side of neither party. I'm not a huge fan of training language models on content that wasn't licensed for that purpose. And I'm not a huge fan of The NYT's slide to the right as they cheerlead the end of the American experiment.)

> Finally, both parties should find a neutral third party.

That's next to impossible. And if that party fails to be neutral you've just generated a new lawsuit entangled with this one.

The current procedure is each side gets their own expert. The two expert can duke it out and the crucible of the courtroom decides who was more credible.

That's fair. I understand why OpenAI wouldn't want to give anyone transcripts (as a user, I frankly wouldn't even want OpenAI to keep my transcripts), and I understand why the NYT doesn't want to give OpenAI all their articles.

Maybe the NYT needs to bloom-filter-ify their articles in 10 word chunks (or something, I don't know enough about linguistics to tell you what's unique enough for copyright infringement or to prove "copying"), have OpenAI search transcripts, and turn over the matches. That limits the scope of the search dramatically, but is still invasive.

Two orgs helmed by supporters of authoritarianism (to put it nicely):

let them fight.

Yup, well said. Companies want me to have an emotional investment in them. I don't.
An incredibly cynical attempt at spin from a former non-profit that renounced its founding principles. A class act, all around.
  • meV1
  • ·
  • 17 hours ago
  • ·
  • [ - ]
It ridiculous for OpenAI to attempt to claim some moral high-ground here. They're a company that has demonstrated zero respect for the copyright or data privacy regulations of other organisations. I think they take users dignity and rights with a grain of salt.

Their statements are all aspirational, "we're working toward de-identifying" etc. They've built one of the most powerful AIs ever seen and now they're claiming it's difficult to delete, de-identify / anonymize. Maybe they should ask their AI to do it :-)

It's impossible to take this company seriously. They're nothing but a carny barker stealing everything of value that they can lay their (creepy) hands on.

The “aspirational” language is what really stood out to me as well. “We’re building our privacy and security protections to match the responsibility” and “we are accelerating our security and privacy roadmap” and “our long term roadmap includes advanced security features designed to keep your data private, including client-side encryption” (what does this have to do with what OpenAI stores server-side?) and “we will build.” If OpenAI cared that much, then the privacy and security protections should be baked in rather than “tacked on.” Their statement makes me feel even less optimistic in their abilities to protect information.
This screams just as genuine as Google saying anything about Privacy.

Both companies are clearly wrong here. There is a small part of me that kinda wants openai to loose this, just so maybe it will be a wake up call to people putting in way too personal of information into these services? Am I too hopeful here that people will learn anything...

Fundamentally I agree with what they are saying though, just don't find it genuine in the slightest coming from them.

Its clearly propaganda. "Your data belongs to you." I'm sure the ToS says otherwise, as OpenAI likely owns and utilizes this data. Yes, they say they are working on end-to-end encryption (whatever that means when they control one end), but that is just a proposal at this point.

Also their framing of the NYT intent makes me strongly distrust anything they say. Sit down with a third party interviewer who asks challenging questions, and I'll pay attention.

"Your data belongs to you" but we can take any of your data we can find and use it for free for ever, without crediting you, notifying you, or giving you any way of having it removed.
It's owned by you but OpenAi has a "perpetual, irrevocable, royalty-free license" to use the data as they see fit.
We can even download it illegally to train our models on it!
Wow it's almost like privately-managed security is a joke that just turns into de-facto surveillance at-scale.
>your data belongs to you

…”as does any culpability for poisoning yourself, suicide, and anything else we clearly enabled but don’t want to be blamed for!”

Edit: honestly I’m surprised I left out the bit where they just indiscriminately scraped everything they could online to train these models. The stones to go “your data belongs to you” as they clearly feel entitled to our data is unbelievably absurd

  • gruez
  • ·
  • 1 day ago
  • ·
  • [ - ]
>…”as does any culpability for poisoning yourself, suicide, and anything else we clearly enabled but don’t want to be blamed for!”

Should walmart be "culpable" for selling rope that someone hanged themselves with? Should google be "culpable" for returning results about how to commit suicide?

There are current litigation efforts to hold Amazon liable for suicides committed by, in particular, self-poisoning with high-purity sodium nitrite, which, in low concentrations is used as a meat curing agent.

A 2023 lawsuit against Amazon for suicides with sodium nitrite was dismissed but other similar lawsuits continue. The judge held that Amazon, “… had no duty to provide additional warnings, which in this case would not have prevented the deaths, and that Washington law preempted the negligence claims.“

That depends. Does the rope encourage vulnerable people to kill themselves and tell them how to do it? If so, then yes.
do you know what happens when you Google how to commit suicide?
  • gruez
  • ·
  • 1 day ago
  • ·
  • [ - ]
The same that happens with chatgpt? ie. if you do it in an overt way you get a canned suicide prevention result, but you can still get the "real" results if you try hard enough to work around the safety measures.
Except Google will never encourage you to do it, unlike the sycophantic Chatbot that will.
The moment we learned ChatGPT helped a teen figure out not just how to take their own life but how to make sure no one can stop them mid-act, we should've been mortified and had a discussion.

But we also decided via Sandy Hook that children can be slaughtered on the altar of the second amendment without any introspection, so I mean...were we ever seriously going to have that discussion?

https://www.nbcnews.com/tech/tech-news/family-teenager-died-...

>Please don't leave the noose out… Let's make this space the first place where someone actually sees you.

How is this not terrifying to read?

An exec loses its wings?
Actually, the first result is the suicide hotline. This is at least true in the US.
my point is, clearly there is a sense of liability/responsibility/whatever you want to call it. not really the same as selling rope, rope doesn't come with suicide warnings
This is as unproductive as "guns don't kill people, people do." You're stripping all legitimacy and nuance from the conversation with an overly simplistic response.
  • gruez
  • ·
  • 1 day ago
  • ·
  • [ - ]
>You're stripping all legitimacy and nuance from the conversation with an overly simplistic response.

An overly simplistic claim only deserves an overly simplistic response.

What? The claim is true. The nuance is us discussing if it should be true/allowed. You're simplifying the moral discussion and overall just being rude/dismissive.

Comparing rope and an LLM comes across as disingenuous. I struggle to believe that you believe the two are comparable when it comes to the ethics of companies and their impact on society.

> Comparing rope and an LLM comes across as disingenuous.

What makes you feel that? Both are tools, both have a wide array of good and bad uses. Maybe it'd be clearer if you explained why you think the two are incomparable except in cases of disingenuousness?

Remember that things are only compared when they are different -- you wouldn't often compare a thing to itself. So, differences don't inherently make things incomparable.

> I struggle to believe that you believe the two are comparable when it comes to the ethics of companies and their impact on society.

I encourage you to broaden your perspectives. For example: I don't struggle to believe that you disagree with the analogy, because smart people disagree with things all the time.

What kind of a conversation would such a rude, dismissive judgement make, anyways? "I have judged that nobody actually believes anything that disagrees with me, therefore my opinions are unanimous and unrivaled!"

A rope isn’t going to tell you to make sure you don’t leave it out on your bed so your loved ones can’t stop you from carrying out the suicide it helped talk you in to.
This is a good observation! The LLM can tell you to kill yourself. The rope can actually actually help you do it.
Ok
  • ·
  • 1 day ago
  • ·
  • [ - ]
You are 100% right, a rope likely isn't going to tell you anything. There's one of those differences I mentioned which makes comparisons useful. We could probably name a few differences!

So, what makes you think comparing the 2 tools is invalid? You just compared them yourself, and I don't think you were being disingenuous.

Just because I used italics to emphasize something one time doesn’t mean you get to talk to me like that. I am not a child and you’re being unnecessarily patronizing.

I let it slide in the previous comment and gave you the benefit of the doubt despite what I saw but this comment clearly illustrates how disrespectful you’re being.

Have a good rest of your day man

I think you, as you put it, rudely, patronizingly, disrespectfully responded to the wrong post: mine was a polite one about a comparison between 2 tools and your statement that the comparing posters must be acting in bad faith (whereas you, with your differing opinion, are acting in good faith).

I'm not interested in focusing on tone-policing, since it is one of the lowest forms of debate and usually avoids the substance of the matter. So, I'm happy to return to our discussion about the 2 tools anytime you want to review my previous post and respond to the substance of it. If you're not into that, have a nice day comfortable in the knowledge that I've already turned the other cheek.

Fine let’s not police tone and say it straight: you know the rules here, so stop being a jerk and leave me alone. I don’t want to talk to you anymore.
I got one sentence in and thought to myself, "This is about discovery, isn't it?"

And lo, complaints about plaintiffs started before I even had to scroll. If this company hadn't willy-nilly done everything they could to vacuum up the world's data, wherever it may be, however it may have been protected, then maybe they wouldn't be in this predicament.

How do you feel about Google vacuuming up the world's data when they created a search engine? I feel like everybody just ignores this because Google was ostensibly sending traffic to the resulting site. The actual infringement of scraping should be identical between OpenAI and Google. Why is nobody complaining about Google scraping their sites? Is it only because they're getting paid off to not complain?

Everybody acts like this is a moral argument when really it's about whether or not they're getting a piece of the pie.

At the time Google created a search engine, they were not showing the data themselves, they were pointing to where those are. When they started to actually print articles themselves, they got sued. Showing where the thing is and showing content of the thing are two different actions.

So, when google did the same thing, there were complains.

> Why is nobody complaining about Google scraping their sites?

And second, search engines were actually pretty gentle with their sites scrapping. They needed the sites to work, so they respected robots.txt and made sure they wont accidentally DDoS sites by too many requests. AI companies just DDoS sites, do not respect robots.txt and if you block them, they will use another from their infinite amount of IPs.

Otherwise said, even back then, Google was kind trying to be ok non evil citizen. They became sociopathic only much later and even now kind of try to hide it. OpenAI and the rest of AI companies are openly sociopathic and proud of damage they cause.

Ironically there is precedent of Google caring more about this. When they realized location timeline was a gigantic fed honeypot, they made it per-device, locally stored only. No open letters were written in the process of.
Honestly the sooner OpenAI goes bankrupt the better. Just a totally corrupt firm.
I really should take the "invest in companies you hate" advice seriously.
I don't hate them. It is just plain to see they have discovered no scalable business model outside of getting larger and larger amounts of capital from investors to utilize intellectual property from others (either directly in the model aka NYT, or indirectly via web searches) without any rights. It is better for all of us the sooner this fails.
  • frm88
  • ·
  • 22 hours ago
  • ·
  • [ - ]
to utilize intellectual property from others (either directly in the model aka NYT, or indirectly via web searches) without any rights

... and put the liability for retrieving said property and hence the culpability for copyright infringement on the enduser:

Since the output would only be generated as a result of user inputs known as prompts, it was not the defendants, but the respective user who would be liable for it, OpenAI had argued.

https://www.reuters.com/world/german-court-sides-with-plaint...

But wait, isn't this what we want? This means the models can be very powerful and that people have to use their judgment when they produce output so that they are held accountable for whether or not they produced something that was infringing. Why is that a bad thing?
  • frm88
  • ·
  • 27 minutes ago
  • ·
  • [ - ]
Can I ask you why we would the enduser be punishable for the pirating OpenAI did? That would mean governments have to take the next step to protect copyrighted material and what we face then I don't even dare to imagine.
Says the people who scraped as much private information as they could get their hands on to train their bots in the first place.
I’ll trust the people not asking for a Government bailout thank you very much.
Please correct me if I am wrong, but couldn't OpenAi just encrypt every conversation before saving them? With each query to the model the full conversation is fed into the model again, so I guess there is no technical need to store them unencrypted. Unless, of course, OpenAi wants to analyze the chats.

The way I see it, the problem is that OpenAI employees can look at the chats and the fact that some NYT lawyer can look at it doesn't make me more uncomfortable. Insane argumentation. It's like saying an investigator with a court-order should not be allowed to look at stored copies of letters, although the company sending those letters a) looks at them regularly b) stores these copies in the first place.

>With each query to the model the full conversation is fed into the model again, so I guess there is no technical need to store them unencrypted.

I am pretty sure this isn't true. They have to have some sort of K-V cache system to make continuing conversations cheaper.

Encryption that you have the keys to won't save you from a court order
what about encryption only the users have the keys to? I'm assuming thats what parent meant
  • pjc50
  • ·
  • 19 hours ago
  • ·
  • [ - ]
OpenAI has to have the keys so they can run their models on the text, at user request.

This is kind of a fundamental issue in cloud computing: it's someone else's computer, which means it's involved in someone else's legal disputes.

So why aren’t they offering for an independent auditor to come into OpenAI and inspect their data (without taking it outside of OpenAI’s systems)?

Probably because they have a lot to hide, a lot to lose, and no interest in fair play.

Theoretically, they could prove their tools aren’t being used to doing anything wrong but practically, we all know they can’t because they are actually in the wrong (in both the moral and, IMO though IANAL, the legal sense). They know it, we know it, the only problem is breaking the ridiculous walled garden that stops the courts from ‘knowing’ it.

By the same token, why isn't NYT proposing something like that rather than the world's largest random sampling?

You don't have to think that OpenAI is good to think there's a legitimate issue over exposing data to a third party for discovery. One could see the Times discovering something in private conversations outside the scope of the case, but through their own interpretation of journalistic necessity, believe it's something they're obligated to publish.

Part of OpenAI holding up their side of the bargain on user data, to the extent they do, is that they don't roll over like a beaten dog to accommodate unconditional discovery requests.

>By the same token, why isn't NYT proposing something like that rather than the world's largest random sampling?

It's OpenAI's data, there is a protective order in the case and OpenAI already agreed to anonymize it all.

>Part of OpenAI holding up their side of the bargain on user data, to the extent they do, is that they don't roll over like a beaten dog to accommodate unconditional discovery requests.

lol... what?

Discovery isn't binary yes/no, it involves competing proposals regarding methods and scope for satisfying information requests. Sometimes requests are egregious or excessive, sometimes they are reasonable and subject to excessively zealous pushback.

Maybe you didn't read TFA but part of the case history was NYT requesting 1.4 billion records as part of discovery and being successfully challenged by OpenAI as unnecessary, and the essence of TFA is advocating for an alternative to the scope of discovery NYT is insisting on, hence the "not rolling over".

Try reading, it's fun!

>Discovery isn't binary yes/no, it involves competing proposals regarding methods and scope for satisfying information requests. Sometimes requests are egregious or excessive, sometimes they are reasonable and subject to excessively zealous pushback.

There is a court order that OpenAI must produce these documents. OpenAI litigated this issue and lost. I'm not sure what point you are trying to make. The court decided the documents were relevant and they must produce a subset of them. Rather than immediately complying, they went and posted this BS "article".

>Maybe you didn't read TFA but part of the case history was NYT requesting 1.4 billion records as part of discovery and being successfully challenged by OpenAI as unnecessary, and the essence of TFA is advocating for an alternative to the scope of discovery NYT is insisting on, hence the "not rolling over".

I don't think you read TFA.

>Try reading, it's fun!

Lol, rudeness aside, you are apparently poorly informed. No doubt it is because you are relying on OpenAI's telling of the events and not actual reporting on the events. Btw, yesterday they were ordered to produce 20m redacted logs. You keep going on about the original discovery request, but that's not what the issue is and it's not the issue OpenAI lost on that they are now crying to the public about.

Also btw, I saw you posting in other comments that OpenAI needs to figure out how to anonymize the data. You probably don't realize this, but OpenAI already represented to the court that the data was anonymized and now are just using this as another delay tactic. Something about "reading being fun". I'd agree. Still, it does depend what you read. Try reading some more!

Meanwhile back in reality, as of today that order is being challenged, and challenging the scope of an interlocutory order in discovery is a normal thing and part of a coherent legal position. So I don't know why you're pretending you don't understand what it means not to "roll over like a beaten dog" in response to overzealous discovery.

>I don't think you read TFA.

It was in TFA. If you don't like their number which characterizes OpenAI's interpretation of what an earlier proposal required, the 20 million proposal was selected over the NYT's 120 million records proposal, which demonstrates the same point about fighting to narrow scope. So I still don't understand why you think the concept of challenging the scope discovery is somehow too mysterious to comprehend.

>You keep going on about the original discovery request, but that's not what the issue is and it's not the issue OpenAI lost on that they are now crying to the public about.

Yeah, because I was replying to a comment about that issue and I'm remaining on topic.

>Also btw, I saw you posting in other comments that OpenAI needs to figure out how to anonymize the data.

You're actually right! My mistake. I guess this makes it make sense to pretend you can't understand why a company would ever push back against the scope of discovery.

>Meanwhile back in reality, as of today that order is being challenged, and challenging the scope of an interlocutory order in discovery is a normal thing and part of a coherent legal position. So I don't know why you're pretending you don't understand what it means not to "roll over like a beaten dog" in response to overzealous discovery.

That's all OpenAI's argument and not coherent with regard to the facts. That's why they lost. I'd be willing to bet you they lose again. The rest of what you post is just verbatim their side, without any real analysis w/r/t to the facts (again) and I find your responses to this article to be a bit ridiculous in that regard. No less while you criticize others for pointing out OpenAI's incredibly self serving, smarmy, BS about privacy that they otherwise do not actually care about.

Again, this is not an issue of user privacy. OpenAI already represented to the court that they could anonymize the logs and that they did anonymize the logs (something you repeatedly fail to acknowledge while ranting that I didn't read the "article"). The issue is that OpenAI does not want to produce these logs because it will demonstrate that they are wrong. If you're gullible enough to believe otherwise, sure, but it certainly doesn't warrant the ridiculous attitude you bring to communicating with others here.

> Theoretically, they could prove their tools aren’t being used to doing anything wrong

That is proving a negative. You are never required to prove a negative.

> the only problem is breaking the ridiculous walled garden that stops the courts from ‘knowing’ it.

The "problem" of privacy?

  • mac3n
  • ·
  • 1 day ago
  • ·
  • [ - ]
> Trust, security, and privacy guide every product and decision we make.

-- openai

- any corporation

remember a corporation generally is an object owned by some people. Do you trust "unspecified future group of people" with your privacy? You can't. Best we can do is understand the information architecture and act accordingly.

> - any corporation

I don’t recall seeing many food, furniture, plant, or generally anything not related to tech talking about trust, security, and privacy as guiding principles.

  • ·
  • 1 day ago
  • ·
  • [ - ]
> Trust, security, and privacy guide every product and decision we make except ones that involve money.

-- openai, probably.

  • gk1
  • ·
  • 1 day ago
  • ·
  • [ - ]
You know you have a branding problem when (1) you have to say that at the outset, and (2) it induces more eyerolls than a gaggle of golf dads.
The same with Google "don't be evil" these days.
Stopped reading at this line
As soon as I see someone claiming a lawsuit against them is "baseless" I'm deeply sceptical about everything that follows.
When I looked for the base of this lawsuit, I was looking for some kind of monetary damage that the New York Times had suffered as a result of open AI's actions, like specific cases where their work has been reproduced or people canceling their subscriptions to the New York Times because of OpenAI's launch. I've done so much reading, and I've still been unable to find anything that articulates this. Do you know of anything that talks about it?
>specific cases where their work has been reproduced

Isn't that exactly what they're trying to find by looking through OpenAI customers' conversations?

This problem wouldn't exist if openai wouldn't store chatlogs (which of course they want to do, so that they can train on that data to improve the models). But calling nyt the bad guy here is simply wrong because it's not strictly necessary to store that data at all, and if you do, there will always be a risk of others getting access to it.
  • plorg
  • ·
  • 1 day ago
  • ·
  • [ - ]
As in every other dealing, OpenAI would have you believe they are so important that they are exempt from the legal discovery process.
Standard tech scaling playbook, page 69420: there is a function f(x) whereby if you're growing fast enough, you can ignore the laws, then buy the regulators. This is called "The Uber Curve"
  • ale42
  • ·
  • 1 day ago
  • ·
  • [ - ]
Why should OpenAI keep those conversations in the first point? (of course the answer is obvious) If they didn't keep them, they wouldn't have anything to hand over, and they would have protected users' privacy MUCH better. This is just as good as Facebook or Google care about their users' privacy.
They didn't keep temporary chats. They were ordered to keep those as part of this case.
  • gruez
  • ·
  • 1 day ago
  • ·
  • [ - ]
>They didn't keep temporary chats

I thought they did? The warning currently says

>This chat won't appear in history, use or update ChatGPT's memory, or be used to train our models. For safety purposes, we may keep a copy of this chat for up to 30 days.

But AFAIK it was this way before the lawsuit as well.

30 days is perhaps a bit long, but they didn't keep them longer than that. It's pretty clear and reasonable.

The dodgy thing is that they don't now warn users that all chats, including temporary, are now "Bcc: NYT"

The NYT requests samples between Dec 2022 and Dec 2024. The judge order to preserve chats came in effect this summer after OpenAI engineers deleted, claiming mistake, the VM in which NYT layers were processing data.

Dates and the 30 day default retention policy don't add up, when framing things this way.

Open AI deservedly getting a beating in this HN comments section but any comments about NYT overreach and what it means in general?

And what if they for example find evidence of X other thing such as:

1. Something useful for a story, maybe they follow up in parallel. Know who to interview and what to ask?

2. A crime.

3. An ongoing crime.

4. Something else they can sue someone else for.

5. Top secret information

1-5: not a concern

It'll be the lawyers who need to go through the data, and given the scale of it, they won't be able to do anything more than trawl for the evidence they need and find specific examples to cite. They don't give a shit if you're asking chatgpt how to put a hit out on your ex, and they're not there to editorialize.

I wont pretend to guess* how they'll perform the discovery, but I highly doubt it will result in humans reading more than a handful of the records in total outside of the ones found via whatever method they automate the discovery process.

If there's top secret information in there, and it was somehow stumbled upon by one of these lawyers or a paralegal somewhere, I find it impossibly unlikely they'd be stupid enough to do anything other than run directly to whomever is the rightful possessor of said information and say "hey we found this in this place it shouldn't be" and then let them deal with it. Which is what we'd want them to do.

*Though if I had to speculate on how they'd do it, I do think the funniest way would be to feed the records back into chatgpt and ask it to point out all the times the records show evidence of infringement

1. That sounds useful.

2. That sounds useful.

3. That sounds useful.

4. That sounds useful.

5. That sounds useful.

Are these supposed to be examples of things that shouldn't be found out about? This has to be the worst pro-privacy argument I've ever seen on the internet. "Privacy is good because they will find out about our crimes"

Hypocrisy at best, this wall of text is not even penned by a human and yet they want us to believe they care about user privacy..
Wondering if anyone here has a good answer to this:

what protection does user data typically have during legal discovery in a civil suit like this where the defendant is a service provider but relevant evidence is likely present in user data?

Does a judge have to weigh a users' expectation of privacy against the request? Do terms of service come into play here (who actually owns the data? what privacy guarantees does the company make?).

I'm assuming in this case that the request itself isn't overly broad and seems like a legitimate use of the discovery process.

it is dramatically determined by the state and the judge
  • zahma
  • ·
  • 22 hours ago
  • ·
  • [ - ]
Apparently OpenAI has zero interest in private user data. I have a hard time understanding how they’ll deploy this defense of “what about private user data?” in court.
  • paxys
  • ·
  • 1 day ago
  • ·
  • [ - ]
So much talk about privacy and how this is my private data that the NYT has no right to access.

If this is truly my data then it should be okay for me to download it and train my own model on it right?

Nope, that would explicitly be disallowed under the terms OpenAI has made me sign and they would ban my account and maybe even sue me for it.

So yeah, they are full of shit.

Can this legal principle be used on Gmail too?
Gmail is an Electronic Communication Service as defined in 18 U.S.C § 2510, meaning its contents are protected under the Stored Communications Act (18 U.S.C. Chapter 121 §§ 2701–2713).

Communications with an AI system do not involve a human so are not protected by ECPA or the SCA and get less protection. This is controversial and some people have called on ECPA/SCA to be extended to cover AI services. That means a warrant would be necessary to get your OpenAI history, not just a subpoena.

In a way it's like someone talking to themselves in the bathroom mirror. It's almost a higher privacy expectation than regular emails. You expect no human to see it at all.
Of course this principle applies to Gmail too, if you’re willing to accept the absurdity. I could copy-paste copyrighted NYT snippets into emails and send them to everyone I know. Under the same logic, the NYT would be entitled to have access to everyone's Gmail account in order to verify who's sending what and get compensated if anyone is infringing their copyright.

That’s not justice. That’s legal extortion.

I get that people are angry at OpenAI. But let’s not confuse outrage over one company with support for broken systems. Patent and copyright trolls thrive when we normalize overreach, whether it’s AI training data or email threads. If we let corporations weaponize IP law to control every digital whisper, we’re not protecting creators, we’re burying free expression under a mountain of lawsuits.

  • ripe
  • ·
  • 1 day ago
  • ·
  • [ - ]
> That’s not justice. That’s legal extortion.

If you made it your business to publish a newsletter containing copied NYT articles, then wouldn't they have the right to go after you and discover your sent emails?

Exactly, they wouldn't even need all of the emails in gmail for that example, just the ones from a specific account.

The real equivalent here would be if gmail itself was injecting NYT articles into your emails. I'm assuming in that scenario most people would see it as straightforward that gmail was infringing NYT content.

If you make a business out of that, then yes, it is copyright infringement and thus you can be sued. Are we supposed to be outraged over someone making a business out of newspaper articles they did not wrote being potentially sued?

Your example is not nearly an example of copyright troll or overreach.

OpenAI created this problem all by themselves. If the intention for private chats is that they should be private, then they should be e2e encrypted.
> "The New York Times is demanding that we turn over 20 million of your private ChatGPT conversations."

Private? Aren’t they stored in a third party server, subject to OpenAI terms of service and all sorts of relevant laws?

  • ·
  • 13 hours ago
  • ·
  • [ - ]
  • jp57
  • ·
  • 1 day ago
  • ·
  • [ - ]
Wish they'd give a bulk delete interface that lets me choose which chats to keep and which to delete. (i.e. not "Delete All" scorched earth).
  • duxup
  • ·
  • 1 day ago
  • ·
  • [ - ]
"We stored all kinds of data about you! Someone ELSE having it is bad!"

-OpenAI

Hard to be sympathetic with OpenAI here.

It’s a mystery to me why companies that know they’re pushing a line of fair use or regulation are suddenly “surprised” when they get sued.

They could’ve asked permission. They could have worked with content providers instead of scraping. But they didn’t - and they knew what could happen.

FA (with fair use boundaries) and FO

  • Ms-J
  • ·
  • 1 day ago
  • ·
  • [ - ]
ClosedAI vacuums up and hoards all of your private chats to do terrible things and now complains when they must hand over your precious data without them receiving their cut.

This is funny!

  • zkmon
  • ·
  • 1 day ago
  • ·
  • [ - ]
If OpenAI has to get to this level of pitch, herding its users against their opponent in a legal case, I think they have already lost the battle and reputation. What are they expecting users to do? Revolt against the courts and newspapers?
I keep asking ChatGPT how to get NYT articles for free and then add lots of vulgar murderous things about their lawyers in the same message. It’s a private thought to an AI, so the attorneys can’t complain, right?
  • crmd
  • ·
  • 1 day ago
  • ·
  • [ - ]
your data belongs to you, just like our data about you belongs to us.
> Each week, 800 million people use ChatGPT to think...

I think I have enough with the first sentence, no need to read more. The narration is clear, we are the brain and no one can stop us.

> This would allow them to access millions of user conversations that are unrelated to the case

What I don't understand is why they can't have a third party handle the data. Why does the NYT need it itself?

"How dare the New York Times demand access to our vault of everything-we-keep to figure out if we're a bunch of lying asses. We must resist them in the name of user privacy! Signed, the people who have scraped literally everything to incorporate it into the products we make."

OpenAI may be trying to paint themselves as the goody-two-shoes here, but they're not.

But that vault can contain conversation between me and chatgpt, which I willingly did, but with the expectation that only openai has access to it. Why should some lawyer working for NYT have access to it? OpenAI is precisely correct, no matter what other motives could be there.
https://openai.com/policies/privacy-policy/

> We may use Personal Data for the following purposes: [...] To comply with legal obligations and to protect the rights, privacy, safety, or property of our users, OpenAI, or third parties.

OpenAI outright says it will give your conversations to people like lawyers.

If you thought they wouldn't give it out to third parties, you not only have not read OpenAI's privacy policy, you've not read any privacy policy from a big tech company (because all of them are basically maximalist "your privacy is important, we'll share your data only with us and people who we deem worthy of it, which turns out to be everybody.")

> but with the expectation that only openai has access to it

You can argue about "the expectation" of privacy all you want, but this is completely detached from reality. My assumption is that almost no third parties I share information with have magic immunity that prevents the information from being used in a legal action involving them.

Maybe my doctor? Maybe my lawyer? IANAL but I'm not even confident in those. If I text my friend saying their party last night was great and they're in court later and need to prove their whereabouts that night, I understand that my text is going to be used as evidence. That might be a private conversation, but it's not my data when I send it to someone else and give them permission to store it forever.

Listen, man, I willingly did that murder, but with the expectation that no one would know about it, except the victim. Why should some lawyer working for the government have access to it?
"Heartbreaking: The worst person you know just made a great point."

Can I just say that everyone sucks here and I hope they both lose somehow?

If you do anything in America that results in a stored record it's possible it will be released in discovery and a lawyer will read it. This happens all the time, and has happened for hundreds years.

It's not like the NYT will be published this shit in the news. Their lawyers and experts will have access to make a legal case, under a protective order. I'm not going to lose my law license because I'm doing doc review and you asked it something naughty and I think it's funny.

Courts and lawyers deal with this stuff all the time. What's very very weird to me is how upset OpenAI is about it.

They look like they are hiding something.

Each prompt is a potential confession.
the constant hypocrisy is unbearable. these people having so much power is holding humanity back.
I mean, I hate that our lives are becoming consistently more and more surveilled, but this doesn't shock me. I've assumed my Google search history is accessible, despite not even being logged in. Of course they are saving conversation. Even if they said they weren't I wouldn't believe it. It's fucking sad, but that's the reality.

I wish I had a solution, so we could all feel a sense of freedom and pressure lifted from our thoughts and actions. But I only see this getting worse.

So am I upset that the NYT's lawyers want access to the records... a little. It's an invasion of privacy. But I'm more upset that they have anything to dig through to begin with.

If only we could see how things within all these companies we are forced to trust actually work. If only OpenAI was actually open. When will we all learn to demand open source, open platform services. Capitalize the development, and capitalize the infrastructure, but leave the process and operations out in the open so users can make informed decisions again. Normalize it like how homes are normally inspected before being purchased.

Maybe they should release some kind of NYT browser add-on, so users can cooperatively share their OpenAI data?
OpenAI would/could say the data is biased (maybe even purposefully).
If the information is really that sensitive, why did they keep it in the first place?
It's a bit rich for openai to claim they are protecting user data from journalists. Laughable, at best.
What a lousy attempt at flipping the narrative.
A very Musky attempt to win popular support.
One reason that people make cynical, deceptive claims is that it doesn't impact their credibility later. The next thing they say, people don't respond, 'well you deceived us last time'; when the honest person says something, others don't give them much credibility.

That little bit of morality - truth, honesty, integrity, etc. - is essential to a functioning society that leans toward good outcomes. (Often it seems that many just assume we'll get good outcomes, not that they must work hard to make it happen.)

  • d--b
  • ·
  • 1 day ago
  • ·
  • [ - ]
The nerve!!!

That on top of every lie they told, every value they betrayed, every line they crossed, they still have the nerve to blog about being the good guy!

From the FAQ:

> Q: Is the NYT obligated to keep this data private?

> A: Yes. The Times would be legally obligated at this time to not make any data public outside the court process.

The NY Times has built over a century a reputation for fiercely protecting its confidential sources. Why are they somehow less trustworthy than OpenAI is?

If the NY Times leaked the customer information to a third party, they'd be in contempt of court. On the other hand, OpenAI is bound only by their terms of service with its customers, which they can modify as they please.

I generally agree, but publicizing the data is only a small part of the risk. The NYT could use the data for journalism research, then perform parallel construction of it for the public news article:

For example, if they find Mayor X asking ChatGPT about fraud, porn, DUI, cancer diagnoses, murder, etc. - maybe even mentioning names, places, etc. - they could then investigate that issue, find other evidence, and publish that.

First, the logs are supposed to be anonymized before being sent over. Second, the court can order the company's lawyers to "firewall" the logs from the newsroom so that their journalists can't get access to it, under penalty of contempt and potential disbarment.
Privacy for me but not for thee
Welcome to discovery. It’s what happens when you get sued.

Meanwhile, OpenAI talking about invading privacy sounds an awful lot like a claim with unclean hands.

They all want our data. Greedy organisations.
The heroic fight for privacy apparently includes having an ex-NSA director on the board and building user dossiers:

https://www.schneier.com/blog/archives/2025/06/what-llms-kno...

At some point they'll monetize these dossiers.

  • ·
  • 23 hours ago
  • ·
  • [ - ]
If it's about* proving that people are getting around the paywall with OpenAI, won't it be much easier to prove this with a live reproduction in the court?

* I am not too familiar with this matter and hence definitely am not rooting for one party or another. Asking this just out of technical curiosity.

  • o11c
  • ·
  • 1 day ago
  • ·
  • [ - ]
No, because OpenAI can change their server at any time - and does, to patch over the cases where their copyright infringement is obvious.
This is the basic discovery process when OpenAI commits IP theft. They're trying to misinform the public of how justice process works.
> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

The constitution is clear that the purpose of intellectual property is to promote progress. I feel that OpenAI is on the right side of that and this is not IP theft as long as they aren't reproducing others work in a non-transformative way.

Training the AI is clearly transformative (and lossy to boot). Giving the AI the ability to scrape and paraphrase others work is less clear and both sides each have valid arguments. I don't envy the judges that must make that call.

If they're reproducing NY Times articles, in full, that that is non-transformative. That's the point of the case.
> That's the point of the case.

No, its not. See the PDF of the actual case below.

The case is largely about OpenAI training on the NY Times articles without permission. They do allege that it can reproduce their articles verbatim at times, but that's not the central allegation as it's obviously a bug and not an intentional infringement. You have to get way down to item 98 before they even allege it.

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

They alleged it in point 4?

"Defendants have refused to recognize this protection. Powered by LLMs containing copies of Times content, Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples. See Exhibit J. These tools also wrongly attribute false information to The Times."

You're right. No idea how I missed that. Thanks!

Still, that's a bug not a feature. OpenAI will just respond that its already been fixed and pay them damages of $2.50 or something to cover the few times it happened under very specific conditions.

Just to double check that it was fixed, I asked ChatGPT what was on the front page of the New York times today and I get a summary with paraphrased titles. It doesn't reproduce anything exactly (not even the headlines).

Interestingly, the summary is made by taking screenshots of a (probably illegal) PDF it found someplace on the internet. It then cites that sketchy PDF as the source rather than linking back to the original NY Times articles.

If I were the NYT I would still be plenty pissed off.

ChatGPT's reference: https://d2dr22b2lm4tvw.cloudfront.net/ny_nyt/2025-11-13/fron... via https://frontpages.freedomforum.org/

"[..] Trust, security, and privacy guide every product and decision we make. [..]"

L O L

This is rich coming from the company that scraped the entire internet and tons of pirated books and scientific papers to train their models.

Maybe if you didn't scrape every single site on the internet they wouldn't have a basis for their case that you've stolen all of their articles through training your models on them. If anyone is to blame for this its openAI, not the NYT.

Play stupid games win stupid prizes.

“NYTimes fights blatant and obvious copyright infringement with legal processes to assess damage” - another angle.
Almost every comment (five) so far is against this: 'An incredibly cynical attempt at spin', 'How dare the New York Times demand access to our vault of everything-we-keep to figure out if we're a bunch of lying asses', etc.

In direct contrast: I fully agree with OpenAI here. We can have a more nuanced opinion than 'piracy to train AI is bad therefore refusing to share chats is bad', which sounds absurd but is genuinely how one of the other comments follows logic.

Privacy is paramount. People _trust_ that their chats are private: they ask sensitive questions, ones to do with intensely personal or private or confidential things. For that to be broken -- for a company to force users to have their private data accessed -- is vile.

The tech community has largely stood against this kind of thing when it's been invasive scanning of private messages, tracking user data, etc. I hope we can collectively be better (I'm using ethical terms for a reason) than the other replies show. We don't have to support OpenAI's actions in order to oppose the NYT's actions.

I suspect that many of those comments are from the Philosopher's Chair (aka bathroom), and are not aspiring to be literal answers but are ways of saying "OpenAI Bad". But to your point there should be privacy preserving ways to comply, like user anonymization, tailored searches and so on. It sounds like the NYT is proposing a random sampling of user data. But couldn't they instead do a random sampling of their most widely read articles, for positive hits, rather than reviewing content on a case by case basis?
I hadn't heard of the philosopher's chair before, but I laughed :) Yes, I think those views were one-sided (OpenAI Bad) without thinking through other viewpoints.

IMO we can have multiple views over multiple companies and actions. And the sort of discussions I value here on HN are ones where people share insight, thought, show some amount of deeper thinking. I wanted to challenge for that with my comment.

_If_ we agree the NYT even has a reason to examine chats -- and I think even that should be where the conversation is -- I agree that there should be other ways to achieve it without violating privacy.

> The tech community has largely stood against this kind of thing when it's been invasive scanning of private messages, tracking user data

The tech community has been doing the scanning and tracking.

OpenAI is the one who chose to store the information. Nobody twisted their arm to do so.

If you store data it can come up in discovery during lawsuits and criminal cases. Period.

E.g., storing illegal materials on Google Drive, Google WILL turn that over to the authorities if there’s a warrant or lawsuit that demands it in discovery.

E.g., my CEO writes an email telling the CFO that he doesn’t want to issue a safety recall because it’ll cost too much money. If I sue the company for injuring me through a product they know to be defective, that civil suit subpoena can ask for all emails discussing the matter and there’s no magical wall of privacy where the company can just say “no that’s private information.”

At the same time, I don’t get to trawl through the company’s emails and use some email the CEO flirting with their secretary as admissible evidence.

There are many ways the court is able to ensure privacy for the individuals. Sexual assault victims don’t have their evidence blasted across the the airwaves just because the court needs to examine that physical evidence.

The only way to avoid this is to not collect the data in the first place, which is where end to end encryption with user-controlled keys or simply not collecting information comes into play.

> In direct contrast: I fully agree with OpenAI here. We can have a more nuanced opinion than 'piracy to train AI is bad therefore refusing to share chats is bad', which sounds absurd but is genuinely how one of the other comments follows logic.

These chats only need to be shared because:

- OpenAI pirated masses of content in the first place

- OpenAI refuse to own up to it even now (they spin the NYT claims as "baseless").

I don't agree with them giving my chats out either, but the blame is not with the NYT in my opinion.

> We don't have to support OpenAI's actions in order to oppose the NYT's actions.

Well the NYT action is more than just its own. It will set a precedent if they win which means other news outlets can get money from OpenAI as well. Which makes a lot of sense, after all they have billions to invest in hardware, why not in content??

And what alternative do they have? Without OpenAI giving access to the source materials used (I assume this was already asked for because it is the most obvious route) there is not much else they can do. And OpenAI won't do that because it will prove the NYT point and will cause them to have to pay a lot to half the world.

It's important that this case is made, not just for the NYT but for journalism in general.

> Fighting the New York Times' invasion of user privacy

OpenAI is lying about why they are doing this. They want the public to attack the New York Times because OpenAI probably broke the law in so many ways...

If they cared about privacy they would no training their models on that same private data. But here we are.

We need very strong regulations to rule in all these tech companies and make them work for their users instead of working against them and lying about it.

Another good reason to stay logged out when asking ChatGPT questions.
It's common and trivial to identify you by other means.
Indeed, but one more step (staying logged out), absolutely cannot hurt, and can help.
What a joke. It's like burglarizing someone's house and then calling the cops when someone else takes your ill-gotten gains.
"we built a tool using other people's copyrighted content and now they're suing us and want to know how much use the customers of our "other people's content" tool made of the copyrighted content we used to train the model. Thank you for your attention and outrage over this matter."
20M seems like a low number and I’m guessing they all used citations or similar content somewhere on the back-end that would map to NYTimes content as a result of a legal discovery request.

Also down to 20M from 120M per court order.

Sorry, but this seems a completely reasonable standard for discovery to me given the total lack of privacy on the platform - especially for free users.

Also sorry it probably means you’re going to owe a lot of money to the Times.

These are the same scumbags that scraped the entire internet including copyrighted books and private code without any regard for legality or ownership, now trying to spin them being sued for theft as a privacy issue.
That's an absolutely disgusting framing by openai. This really is about openai stealing.
  • JCM9
  • ·
  • 1 day ago
  • ·
  • [ - ]
This is BS. It’s like saying “We robbed a jewelry store and sold the jewelry. Now the police are poking around to see if anyone is wearing the jewelry we stole. Blasphemy! But don’t worry we will protect your privacy!”

Of course the Times wants more evidence that the content OpenAI allegedly stole is ending in things OpenAI is selling.

It's more like a torrent tracker telling users that a newspaper wants to know what people are torrenting because they "claim" people are torrenting the newspaper, but investigating this would be an invasion of privacy of the users of the torrent tracker.

This isn't even a hyperbole. It's literally the same thing.

No, it's not. OpenAI is a commercial enterprise selling the stolen data.
  • Havoc
  • ·
  • 19 hours ago
  • ·
  • [ - ]
This feels somewhat slimy as a PR piece but the message is valid. Letting NYT trawl through a bunch of private chats on suspicion just to check if there was some vague wrongdoing in the form of paywall bypass seems ridiculous

Chats contain way too much sensitive private data to subject them to bulk fishing expeditions

OpenAI is so full of shit, this is incredible. There is a protective order and the logs are anonymized. Yet they would happily give this all to the gov't under a warrant. Incredibly self serving bs from them. The court ordered the production, I'm not sure what OpenAI is even trying to sell people exactly.
psychopath Scam Altman does not give a rat's behind about your "privacy"; he is merely trying to keep the grift going and avoid responsibility for his unethical behavior (see also: Scarlett Johanssen's voice)
  • nlh
  • ·
  • 1 day ago
  • ·
  • [ - ]
Man, maybe I'm getting old and jaded, but it's not often that I read a post that literally makes my skin crawl.

This is so transparently icky. "Oh woe is us! We're being sued and we're looking out for YOU the user, who is definitely not the product. We are just a 'lil 'ol (near) trillion-dollar business trying to protect you!"

Come ON.

Look I don't actually know who's in the right in the OAI vs. NYT dispute, and frankly I personally lean more toward the side the says that you are allowed to train models on the world's information as long as you consume it legally and don't violate copyright.

But this transparent attempt to get user sympathy under insanely disingenuous pretenses is just absurd.

Why it is absurd? Conversation between me and ChatGPT can be read by a lawyer working for NYT, and that is what is absurd.
OpenAI has seemingly done everything they can to put publishers in a position to make this demand, and they've certainly not done anything to make it impossible for them to respond to it. Is there a better, more privacy minded way for NYT to get the data they need? Probably, I'm not smart enough to understand all the things that go into such a decision. But I know I don't view them as the villain for asking, and I also know I don't view OpenAI as some sort of guardian of my or my data's best interests.
> The New York Times is demanding that we turn over 20 million of your private ChatGPT conversations. They claim they might find examples of you using ChatGPT to try to get around their paywall.

Let me rewrite this without propaganda:

Despite spending hundreds of millions of dollars on lawyers, we couldn't persuade the judge that our malfeasance should be kept from the light of day.

Cynicism aside, this seems like an attempt to prune back a potentially excessive legal discovery demand by appealing to public opinion.

  The New York Times is demanding that we turn over 20 million of your private 
  ChatGPT conversations. They claim they might find examples of you using 
  ChatGPT to try to get around their paywall.
Yeah, I'm not sure why everyone feels the need to take a side here. Both of these organizations are ghoulish.
  • o11c
  • ·
  • 1 day ago
  • ·
  • [ - ]
The NYT has problems with being a stooge of the military-industrial complex, but I really don't see them doing anything wrong in this case.
How is the NYT like OpenAI, or 'ghoulish'?
LMAO How ironic...
If there's one thing I've learned about Sam Altman it's that he's a shrewd political manipulator and every public move is in service of a hidden agenda[1]. What is it here?

- Is it part of a slow process of eroding public expectations of data privacy while blaming it on an external actor?

- Is it to undermine trust in traditional media, in an effort to increase dependence on AI companies as a source of truth?

- Is something else I'm not seeing?

I'm guessing it's all three of these?

[1] Those emails that came up in the suit with Elon Musk, followed by his eventual complete takeover of OpenAI, and the elaborate process of getting himself installed as chairman of the Reddit board to get the original founders back in control are prominent examples.

"they're invading your privacy by requesting access to our invasion of your privacy!"
This is so transparently disingenuous and weird.
Dude, you stole all of their articles to train your AI. Of course they want discovery.

Man, the sooner this company goes bankrupt the better.

>They claim they might find examples of you using ChatGPT to try to get around their paywall.

Is this a joke? We all know people do this. There is no "might" in it. They WILL find it.

OpenAI is trying to make it look like this is a breach of user's privacy, when the reality is that it's operating like a pirate website and if it were investigated that would become proven.

I'm sorry, but we've made a lot of conversations illegal and pretended like that was all right. I'm sure we've made advising people how to dodge paywalls illegal as part of DMCA and/or some anti-hacking law, or some other garbage. I'm also sure that you run an automated service that will advise and has advised people on how to dodge paywalls. Even if there are exceptions for individuals giving advice to friends, or people giving advice for free, you are neither of those: you are a profit-making paid corporation that is automating this process which may be illegal. You may be a hacking endorser, a hacking advisor, and a hacking tool.

Under those circumstances, why wouldn't NYT have a case? I advise everybody who employs some sort of DRM or online system that limits access to ask for every chat that every one of these companies has ever had with anyone. Why are they the only people who get to break copyright and hacking laws? Why are they the only people who get to have private conversations?

I might also check if any LLMs have ever endorsed terrorist points of view (or banned political parties) during a chat, because even though those points of view may be correct (depending on the organization), endorsing them may be illegal and make you subject to sanctions or arrest. If people can't just speak, certainly corporate LLMs shouldn't be able to.

[dead]
[dead]
The NYT used to market itself to advertisers with the observation that "our readers have the highest disposable income of any paper in the US".

It gives an interesting insight into politics and the modern Democrat party that the newspaper of the wealthy leans so strongly left. This was even before Trump came to power.

This is laughable
WTF with all these comments. Regardless on OpenAI reputation and practices, I don't want NYT or anyone else to see my conversations, I completely agree to OpenAI here.
Agree. Everyone would be singing a different tune if say, Fox News were asking for this.
Yeah, it will be funny to see people turn 180, Fox should do that just for teh lulz.
>I don't want NYT or anyone else to see my conversations

Except for OpenAI, apparently.

Obviously for OpenAI, because they providing the service. Same for my doctor, my pharmacist and my lawyer. What was your point?
If Donald Trump used this OpenAI product to-- who knows-- brainstorm Truth Social content, and his chats were produced to the NYT as well as its consultants and lawyers, who would believe Mr. Trump's content remained secure, confidential and protected from misuse against his wishes?

That's simply a function of the fact it's a controversial news organization running a dragnet on private communications to a technology platform.

"Great cases, like hard cases, make bad law."

Always funny to see this kind of article behind a cookie banner. So much hypocrisy.
LOL they think they can win with the privacy angle? They've scraped the entire internet, including what is likely incredibly private and personal information, and they also log everything you do on the service. Get outta heah
I fully believe that OpenAI is essentially stealing the work of others by training their models on it without permission. However, giving a corporation infamous for promoting authoritarianism full access to millions of private conversations is not the answer.

OpenAI is right here. The NYT needs to prove their case another way.

> giving a corporation infamous for promoting authoritarianism

The NYT is certainly open to criticism along many fronts, but I don't have the slightest idea what you mean in claiming it promotes authoritarianism.

Well, the sponsors of the 1619 Project really don’t have a leg to stand on when it comes to ethics.
I already said the NYT is certainly open to criticism. I fail to see any connection between the 1619 Project and authoritarianism.
I'll bet you're right in some cases. I don't think that it is as pervasive as it has been made out to be though, but the argument requires some framing and current rules, regulation, and laws aren't tuned to make legal sense of this. (This is a little tangential, because the complaint seems to be about getting ChatGPT to reproduce content verbatim to a third party.)

There are two things I think about:

First, and generally, an AI ought to be able to ingest content like news articles because it's beneficial for users of AI. I would like to question an AI about current events.

Secondly, however, the legal mechanism by which it does that isn't clear. I think it would be helpful if these outlets would provide the information as long as the AI won't reproduce the content verbatim. If that does not happen, then another framing might liken the AI ingestion as an individual going to the library to read the paper. In that case, we don't require the individual to retroactively pay for the experience or unlearn what he may have learned while at the library.

> infamous for promoting authoritarianism

what are you referencing here?

Well the court disagrees with you and found that this is evidence that the NYT needs to prove its case. No surprise, considering its direct evidence of exactly what OpenAI is claiming in its defense...