The New York Times is suing OpenAI and Microsoft for copyright infringement

btp@kbin.social · 10 months ago

The New York Times is suing OpenAI and Microsoft for copyright infringement

Midnitte@kbin.social · 10 months ago

Really seemed like this was inevitable - it will be interesting to see if their fair use defense pans out.

I don’t expect it will, and I’m worried of the impact of that precedent on the legitimate fair use circuit…

lemonflavoured@kbin.social · 10 months ago

I’m amazed that it’s taken this long for a high profile lawsuit about it.

CJOtheReal@ani.social · edit-2 10 months ago

Removed by mod

EvilMonkeySlayer@kbin.social · 10 months ago

How so?

The trained model includes vast swathes of copyrighted material. It’s the rights holders who get to decide whether someone can use it.

Just because it makes it inconvenient or harder for someone to train an AI model does not justify wholesale stealing.

A lot of models are even trained on large numbers of pirated material like books downloaded from pirate sites etc. I guarantee you OpenAI and others didn’t even buy a lot of the material they use to train the AI models on.

CJOtheReal@ani.social · edit-2 10 months ago

Removed by mod

Zima@kbin.social · 10 months ago

the poem poem poem thing shows that the llms actually do memorize at least some training data. chatgpt changed their eula to forbid users from asking it to repeat words forever after this was in the news.

also as far as I understand there are usually fair use and non profit exceptions for use of training data but they generally limit how it can be used. so training a model for commercial purposes might be against the license of the training data.

I don’t necessarily agree with the nyt but they seem to be framing this as someone aggregating their data and packeting it in a better way so they are hurting their profits. i don’t really see that as necessarily being true. they could argue the same about google news showing their news…

CJOtheReal@ani.social · edit-2 10 months ago

Removed by mod

Zima@kbin.social · 10 months ago

that’s the theory. previous models also were supposed to be doing 3 digit math but they dicovered that the questions were in the training data.

so you should look into what happens when people ask chat gpt to repeat a word forever, it prints the word for a while and then prints training data, check this link https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

edit: relevant part:

It also, crucially, shows that ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.

“In total, 16.9 percent of generations we tested contained memorized PII,”

I should also reiterate that I agree that the intent is to avoid memorization, but they are not successful yet.

HarkMahlberg@kbin.social · 10 months ago

I guarantee you OpenAI and others didn’t even buy a lot of the material they use to train the AI models on.

My hunch is that if they did actually buy or properly license that material, they would have been bankrupt before the first version of ChatGPT came online. And if that’s true, then OpenAI owes it’s entire existence to it’s piracy.

CJOtheReal@ani.social · edit-2 10 months ago

Removed by mod

HarkMahlberg@kbin.social · 10 months ago

Yeah… That’s not a good defense if you think about it. If someone made a Reddit comment with the entire contents of Discworld (idk, just an example), and OpenAI scraped all of Reddit to train their model, well now they’ve used copyrighted material without paying for a commercial license, and now they’re on the hook. By being unscrupulous about their scraping, they actually open themselves up to more liability than if they were more careful about what they scrape and where.

This is all to say nothing of the fact that several other major companies were caught pants down by training with databases explicitly created by torrenting a ton of books.

https://torrentfreak.com/authors-accuse-openai-of-using-pirate-sites-to-train-chatgpt-230630/

There is no direct evidence that OpenAI used pirate sites to train ChatGPT. That said, it is no secret that some AI projects have trained on pirated material in the past, as an excellent summary from Search Engine Journal highlights.

The mainstream media has picked up this issue too. The Washington Post previously reported that the “C4 data set,” which Google and Facebook used to train their AI models, included Z-Library and various other pirate sites.

lemonflavoured@kbin.social · 10 months ago

Its not piracy to just webscrap everything for data…

Yes it is.

CJOtheReal@ani.social · edit-2 10 months ago

Removed by mod

lemonflavoured@kbin.social · 10 months ago

Publicly available =/= public domain.

shiveyarbles@beehaw.org · 10 months ago

Is AI just a giant screen scraper with a presentation layer? I always thought of it more like Asimov’s positronic brain.

leaskovski@kbin.social · 10 months ago

To be fair some of the chat bots are effectively just that. They have “scrapped” their data models and outputing it in a way that seems like you are having a conversation with the “bot”.