rulebots.txt

Furbland@lemmy.world · 8 months ago

rulebots.txt

itsnicodegallo@lemm.ee · 8 months ago

As annoying as this is, it’s to prevent LLMs from training themselves using Reddit content, and that’s probably the greater of the two evils.

Furbland@lemmy.world · 8 months ago

That’s all well and good, but how many LLMs do you think actually respect robots.txt?

colin@lemmy.uninsane.org · 8 months ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i’d never notice unless i checked the logs, at least.

jbk@discuss.tchncs.de · 8 months ago

I thought major LLMs ignored robots.txt

cheddar@programming.dev · 8 months ago

It’s to profit from training LLMs: https://arstechnica.com/information-technology/2024/02/your-reddit-posts-may-train-ai-models-following-new-60-million-agreement/

Anas@lemmy.world · 8 months ago

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY