LLM training bots are a plague

mapto@feddit.bg · 9 days ago

LLM training bots are a plague

Jayjader@jlai.lu · 9 days ago

How feasible is it to configure my server to essentially perform a reverse-slow-lorris attack on these LLM bots?

If they won’t play nice, then we need to reflect their behavior back onto themselves.

Or perhaps serve a 404, 304 or some other legitimate-looking static response that minimizes load on my server whilst giving then least amount of data to train on.

raoul@lemmy.sdf.org · edit-2 9 days ago

The only simple possibles ways are:

robot.txt
rate limiting by ip
blocking by user agent

From the article, they try to bypass all of them:

They also don’t give a single flying fuck about robots.txt …

If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

It then become a game of whac a mole with big tech 😓

~~The more infuriating for me is that it’s done by the big names, and not some random startup.~~ Edit: Now that I think about it, this doesn’t prove it is done by Google or Amazon: it can be someone using random popular user agents

jherazob@fedia.io · 9 days ago

I do believe there’s blocklists for their IPs out there, that should mitigate things a little

raoul@lemmy.sdf.org · edit-2 9 days ago

A possibility to game this kind of bots is to add a hidden link to a randomly generated page, which contain itself a link to another random page, and so one.: The bots will still consume resources but will be stuck parsing random garbage indefinitely.

~~I know there is a website that is doing that, but I forget his name.~~

Edit: This is not the one I had in mind, but I find https://www.fleiner.com/bots/ describes be a good honeypot.

skillissuer@discuss.tchncs.de · 9 days ago

maybe you mean this incident https://news.ycombinator.com/item?id=40001971

raoul@lemmy.sdf.org · 8 days ago

This is it, thanks: https://www.web.sp.am/