The New York Times is suing OpenAI and Microsoft for copyright infringement

btp@kbin.social · 9 months ago

The New York Times is suing OpenAI and Microsoft for copyright infringement

Zima@kbin.social · 9 months ago

the poem poem poem thing shows that the llms actually do memorize at least some training data. chatgpt changed their eula to forbid users from asking it to repeat words forever after this was in the news.

also as far as I understand there are usually fair use and non profit exceptions for use of training data but they generally limit how it can be used. so training a model for commercial purposes might be against the license of the training data.

I don’t necessarily agree with the nyt but they seem to be framing this as someone aggregating their data and packeting it in a better way so they are hurting their profits. i don’t really see that as necessarily being true. they could argue the same about google news showing their news…

CJOtheReal@ani.social · edit-2 9 months ago

Removed by mod

Zima@kbin.social · 9 months ago

that’s the theory. previous models also were supposed to be doing 3 digit math but they dicovered that the questions were in the training data.

so you should look into what happens when people ask chat gpt to repeat a word forever, it prints the word for a while and then prints training data, check this link https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

edit: relevant part:

It also, crucially, shows that ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.

“In total, 16.9 percent of generations we tested contained memorized PII,”

I should also reiterate that I agree that the intent is to avoid memorization, but they are not successful yet.