I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?
(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)
Microsoft was unique among the companies I worked for in that they gave you some guidelines and then let you blog without having to go through some approval or editing process. It made blogging much more personal and organic IMO; company-curated blog posts read like marketing.
I didn’t see the original post but it looks like somebody made a bad judgment call on what to put in a company blog post (and maybe what constitutes ethical activity) and that it was taken down as soon as someone noticed.
I care much less about whether the person exercised good judgment in posting, and don’t care (and am happy) that there was not some process that would have caught it pre-publication.
I care much more if the person works in a team that believes that copyright infringement for AI training is a justifiable behavior in a corporate environment.
And now we know that is a thing, and I suspect that there will be some hard questions asked by lawyers inside the company, and perhaps by lawyers outside the company.
It feels out of character for a company like Microsoft to have such a policy, but I agree that it's insanely cool that some very cool folks get to post pretty freely. Raymond Chen could NEVER run his blog like that at FAANG.
Bruce Dawson was publishing debugging stories (including things debugged about Google products done as part of his job) for the entire time he was working at Google: https://randomascii.wordpress.com/
I was/am a nobody, I have no idea how that happened and it was mind blowing that MS was interacting with me.
If you or anyone else who sees this wants to see the original post, it's still available in the Wayback Machine: https://web.archive.org/web/20260105115129/https://devblogs....
Copywriter aside it looks like an interesting blog post.
Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?
There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.
But one doesn't necessarily say anything about the other.
What they say is that low quality in the documentation does not mean low quality in the code. Nothing says that they are related.
> I don't know if you are just playing devil's advocate
Indeed, that is playing Devil's Advocate but one should remember that such Advocacy is performed to make sure that arguments against the Devil are as strong as they can be. It's not straightforward to see how simply repeating an assertion helps to argue for the veracity of it.
I realize BSOD is no longer nearly as common as it once was, but let's not forget that Windows used to be very fragile indeed.
Anecdotally, installing wrong drivers (in my case it was drivers for COM-port STM32 interaction) could make it as common as twice a day on Win11. While my windows server 2008 still doing just great, no BSOD through lifetime.
I agree that for a common user BSOD is now less likely to happen, but wonder whether it's less to do with windows core, and more with windows defender default aggressive settings
It was more robust 5 years ago than it is today.
Or at least that's been my impression. I can't back that up with hard data.
I have never even heard of a software company that acts otherwise (except IBM, and much of the world of Silicon Valley software engineering is reactionary to IBM's glacial pace).
I'm not saying docs == code for importance is a bad way to be, just that if you can name firms that treat them that way other than IBM (or aerospace), I'd be interested to learn more.
What I'm saying is, you have to review code to get it out the door with a certain degree of quality. That's your core product. That's the minimum standard you have to pass, the lowest bar.
In contrast, reviewing documentation is usually less core. You do that after the code gets reviewed. If there's time. If it doesn't get done, that's not necessarily saying anything about code quality.
Even if it's easier to review documentation, that doesn't mean it's getting prioritized. So it's not a lower bar in the sense that lower bars get climbed first.
Organizations are large, so much so that different levels of rigor across different parts of the organization. Furthermore, more rigorous controls would be applied to code than for documentation (you would assume).
I wasn't mad, just disappointed.
Uber is a rebadged taxi service with seedier people than before.
AirBnB is a less disguised but still rebadged B&B service with seedier people than before.
Charlie Munger said it best. Cryptocurrency is like seeing a bunch of people trading turds and saying to yourself "well.. I don't want to miss out!" The seediest of all people.
AI doesn't even really exist by any common definition. They have supremely weak and power hungry language models trained on terabytes of stolen data and reddit conversations.
Hell, watching a guy hammer himself in his own nuts on youtube is an innovation, and I think I'm going to go do /that/ now instead of being depressed. Watching "ow my balls" and baitin'. What's left?
https://www.kaggle.com/datasets/shubhammaindola/harry-potter...
More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.
When I try to fill the questionaire, my request is rejected with this message:
We understand that you are not legally authorized to file a copyright complaint on behalf of the copyright owner.
In accordance with applicable copyright laws, we only accept copyright complaints from copyright owners or their authorized representatives. If you have legal questions about copyright law, please consult your own legal counsel.
We are sorry we cannot assist you further.
Hysterical. What a farce. That data set is pure theft.(e.g. see youtube, where this is (used to be?) poorly enforced, it's a mess)
Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.
If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.
Why wouldn't that apply?
How many people think they can rip off Disney characters even if they don't know how much Disney lobbied to extend their ownership? People can observe that no one but Disney gets to use them and understand, even if not consciously, that those are Disney's to use.
^ Probably poorly written without time to proof cause time constraint.
https://en.wikipedia.org/wiki/List_of_copyright_duration_by_...
So in short, I kept my mouth shut. I assumed I would lose my job if my public comment reached the right people.
Media file: https://pdst.fm/e/clrtpod.com/m/pscrb.fm/rss/p/arttrk.com/p/...
https://github.com/Azure-Samples/azure-sql-db-vector-search/...
https://devblogs.microsoft.com/azure-sql/?p=4796
"Build a RAG App in 5 Minutes
Ever tried setting up an Al-powered project on
Azure and felt overwhelmed? As a student or first- time user to cloud computing, I've been there too. The idea of creating a chatbot or search app using GPT sounds exciting, but the process of setting up everything right from the vector database, provisioning OpenAl models, to integrating them,
it can f..."
I'm disappointed people continue to use it.
If anything Kaggle would be on the hook for including the data as CC0. Or perhaps to Shubham Maindola for uploading it. In fact the "provenance" listed would give me chills. Crazy how this got a 10.0 score. "I downloaded the ebooks of Harry Potter. Then converted them to txt files."
Microsoft has a market cap of almost $3 trillion. I think they can afford to pay for the texts they use in their AI research.
I hate the current copyright environment as much as anyone, but I do not abide double-standards, with a two-tier justice system wherein a corporation gets to freely enforce the draconian copyright regime against individuals while also getting to abuse individuals' creative works in ways much more egregious.
I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.
Does anyone know whether there is some special reason why this has lasted so long without being taken down?
[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.
But this is just a lie.
Approximately nobody is prosecuted for copyright infringement.
We’re moving the goalposts from the government systematically targeting normal people “if caught”, to only a handful of civil cases.
I think most would agree that cases like that act as a deterrent?
> I think most would agree that cases like that act as a deterrent?
I think we could hardly get any further from “the rest of us are prosecuted as harshly as possible if caught”.
[1]https://en.wiktionary.org/wiki/throw_the_book_at
Now, as for "the rest of us are prosecuted as harshly as possible if caught". You are correct in your pedantry that this statement not expressed as rigorously as it possibly could have been. There are different classes of copyright infringement; "receiving" and "perpetuating" being two of them [to avoid further pedantry, I am not asserting this is precise legal terminology but rather a lay distinction for the purposes of discussion]. It is the latter case which is tried as harshly as possible when caught, and there are many such examples other than Swartz, and I think it was clear my intent when I said it despite the fact that I did not write about the distinction at length.
That is not to say the situation around the former type of copyright infringement is so kind, either. While in some countries it is mostly overlooked, which I believe to be the case in the US, in other countries it is more strictly enforced, such being the case in my own country. While "as harshly as possible" isn't accurate to prosecution against infringement of this nature, you can still be disproportionately punished relative to the damage caused when downloading pirated material for personal viewing, if caught (and ISPs/rightsholders do monitor for it to the best of their abilities).
There is also a third class of copyright infringement to consider which is highly disfavourable to individuals: derivative works. Strictly speaking, even as something as simple as drawing fanart of a character or remixing a song is illegal, even if the activity is completely non-commercial in nature. This is, of course, absolutely ridiculous. Rightsholders know that copyright law reformation would gain tremendous popular support if they were draconian about enforcing their rights against derivative works, and that allowing fan communities to bloom is actually beneficial to their own IP, so enforcement is highly selective. However, that arbitrary, selective nature of enforcement is itself dangerous to individuals, and is sometimes used to punish specific individuals as harshly as possible at the whims of the IP holder. It is true that not everyone is actually subjected to this, but the threat of it happening looms over everyone who expresses their creativity through derivative works.
None of this sits right with me, especially as corporations are hoovering up every piece of copyrighted material they possibly can and creating commercial derivative-work-machines that mass-produce sloppified derivative works, and are getting a completely free pass by the legal system to do so while individuals are still treated like felons for 'crimes' that are at most marginally harmful, or in the case of the creative production of derivative works, not only not harmful but actually beneficial to society.
Shit, I don't even think the people who screamed "Snape killed Dumbledore" at lines for book 6 based on leaked copies that hit before the street date got in any trouble.
How could anyone possibly get in trouble for something that isn't a crime?
(done, contacted her lawyers too)
But ignoring that: I do not think that these txt files being online do any economic harm. Noone will go and say "hey, I'm going to read these un-formatted text files instead of buying the 30 year old books for little money or pirating proper epubs which are trivial to find". If at all the kaggle dataset is free publicity. So as the author I would leave them online.
I mean, books3 contained hundreds of thousands of copyrighted books, and people released it under their own name.
Archived copy: https://web.archive.org/web/20260105115129/https://devblogs....
It is very worrying that people with no ethics work for these trillion dollar companies who are supposed to be shaping the technology of tomorrow.
Disrespecting the copyright on a multi-billion dollar franchise hardly comes close to the major unethical behavior the trillion dollar companies are committing.
[1] actual indian
So far, the only thing I've found AI to be consistently good at is entertainment of the humourous kind.
Everything new is AI slop, and there seems to be no coming back from it.
Very low code. Infinite scale. Name a better AI startup to invest.
The implicit motto of this class of hyper-wealthy people is: "it's not yours if you cannot keep it". Well, game on.
(There are 56.5e6 millionaires, and 3e3 billionaires -- making them 0.7% of the global population. They are outnumbered 141.6 to 1. And they seem to reside and physically congregate in a handful of places around the world. They probably wouldn't even notice that their property is being stolen, and even if they did, a simple cycle of theft and recovery would probably drive them into debt).
https://web.archive.org/web/20260105115129/https://devblogs....
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
Even if MS could claim that they were acting in good faith there really isn't much legal wiggle room for that. But it doesn't even come to that because I don't think anyone would buy that they really thought that the Harry Potter books were under the CC0
Same thing applies here.
Up to 80% off all works that are in copyright terms are accidentally in the public domain. A well known example is Night of the Living Dead. It is not your job to check that the copiright on a work you use is the correct one.
And it is your job to check that you have the rights to use other people's work. Ignorance is not a defence.
Which ones? As far as I was aware, it's a crime to redistribute copyrighted works, not receive.
(a) the defendant was not aware, and had no reasonable grounds for suspecting, that copyright subsisted in the work or other subject - matter to which the action relates;
(b) where the articles converted or detained were infringing copies--the defendant believed, and had reasonable grounds for believing, that they were not infringing copies; or
(c) where an article converted or detained was a device used or intended to be used for making articles--the defendant believed, and had reasonable grounds for believing, that the articles so made or intended to be made were not or would not be, as the case may be, infringing copies.
Does this not mean the opposite of your claim? It sounds to me that if you unwittingly bought a dodgy copy of something, the law thinks the copyright owner can get you to pay for a legit copy, but not punish you for your mistake.
In the specific case of the Harry Potter works, the fame might meet the threshold of reasonable grounds for believing, but noosphr's argument that "Up to 80% off all works that are in copyright terms are accidentally in the public domain" could grant a reasonable grounds for believing it is not.
This is one of those things that causes interesting court cases because a reasonable grounds for believing X is not the same thing as not reasonable grounds for believing not X. Reasonable grounds for suspicion probably carries more weight here than reasonable grounds for the absence of suspicion, but cases have hung on things like this before , like the presence or absence of an Oxford comma.
Although this seems is not reciprocal. Rule for thee, but not for me.
If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...
Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.
Why exactly?
A search index might also contain copyrighted material. As long as it's used for search queries as opposed to regurgitation there's no problem. Search indexes and LLMs are both clearly very beneficial tools to have access to.
Since we're talking about an electronic system the search index example is the more directly relevant one. Anyone who wants to object to LLMs is going to need to take care to ensure consistency with his views on Google's search index.
Also can you point out how copyright law changes because we're using an "electronic system" as opposed to an "analog system?"
I never claimed any change in copyright law. Only that one analogy was more direct than the other for the purpose of the current discussion.
You didn't answer my question. What point were you trying to make with your earlier reply?
My playing copyrighted music on my synths at home, or singing lyrics along are different than if I am a professional musician benefiting financially from playing someone else's music in public.
Producing a product = market rules apply Just living as a human = totally different thing
the merge commits in those repositories are all digitally signed by GitHub public key, so the previous history is fully authenticated and non-repudiable
so any copies now can be trivially proven to be genuine output by Microslop
hoisted by your own petard
signed merge commit is: 987eee6af61788647ae0cab82ae8a5d9402a5bd0
PGP signature (using GitHub's key: B5690EEEBB952194) is:
for posterity:
-----BEGIN PGP SIGNATURE-----
wsFcBAABCAAQBQJnPIphCRC1aQ7uu5UhlAAAUgMQACyp7apkh0e413K7ipGd7Z+K
JCMq93GoJm4OSgzzZzCp1DbeEq2u1mX1ZAXLq5XKqM0cL6cTg13IF4oumq8QmTzQ
bFykqKfrkCDSTIa2v5CucJedmIoJl976jX96bnV8YXgoKx8/43044galo23bjoJ8
9tUcVnC10FYj7NTI9/uCN9C3f2Up3t9xUaJzJv3OdgjJ9B3cNwYBfF6sDCj3QnUu
AWRNdGIyqyO1WKnj2XL2Qo9jMWNX3uHSBYYGqIvZqu2bjpYS89Dt3X086JlLdQG9
Pef2PHX6VeZ6j8J4NPqi28mB2n9Dn7V6q0SQIF1z4hsa9fLC0kljyrrO3T/RT6Ut
D8r3Y7vjGUHPNkVXSo1oNCiNMV9LjDQwiJc/AuF6smupxivIFCKe8nDPBlCvi6gr
uPz5KK5MfpmG5rO2+NA0LcrUPAk6F3nxDI46+Lsu2nCvO+pOauQQ+oUvxJNCnI3Y
5PAReulGOZHXbiCj/9j6+H7rUBCGk2phVtXOsXxitCorigNXAeAJ8hP2cgjXZH25
NGGtjyp75VVBydzSCz9yY+VypITovsDmEC1CxfbJRS7SaTdU7bGCLN08JcmfOzNb
u/3iPkKMXXWMNYO6J1bUeAqVpueGkqsAqnhY32NylIni07Oz/he8nEsQCXC+4ueG
uYgSpEu8IaERBIQLVntK
=yDvq
-----END PGP SIGNATURE-----The biggest irony would be if the page itself was generated by an LLM.
https://news.microsoft.com/source/2004/02/12/statement-from-...
In case the new anti-copyright Microslop memory-holes that link:
https://web.archive.org/web/20260215220230/https://news.micr...
The tutorial could have used that leaked source code for "educational purposes", as many here claim.
I'm sure the scripts of Star Wars would be similarly ignored if they were used.
ftfy
Something like Harry Potter might be shared every day. And I mean as pirate work distributed as new copy. Staying on top of that will be very hard work.
This however is a very, VERY poor situation when you end up placing your employer at risk because you think copyright doesn't matter and everything on the internet is fair game.
This is probably the most polite way I would describe this to most, UG. For the rest, jus stop acting like cheating through a situation to get a step up is the norm, it's just dirty behaviour.
But come on … these guides really are for learning purposes. Doesn’t seem like a big deal to me at all. They aren’t even hosting it, just pointing to kaggle who is hosting it.
On principle copyright law should allow this kind of learning use case anyway.
Rowling is known for actively protecting her rights as an author, they couldn't have picked a worse author to slop up
Everyone should torrent and rip off those books, anyway.
In fact if you do this as a nonprofit or at an educational institution in a teaching context it’s explicitly allowed by fair use already.
If you do it individually, idk I’m not a lawyer. But it should be allowed on principle.
But if you then go take your trained AI and deploy it for commercial purposes that’s a different story and should have protections for the original rights holders.