• Aatube@kbin.social
    link
    fedilink
    arrow-up
    11
    ·
    10 months ago

    robots.txt is purely textual; you can’t run JavaScript or log anything. Plus, one who doesn’t intend to follow robots.txt wouldn’t query it.

    • BrianTheeBiscuiteer@lemmy.world
      link
      fedilink
      English
      arrow-up
      55
      ·
      10 months ago

      If it doesn’t get queried that’s the fault of the webscraper. You don’t need JS built into the robots.txt file either. Just add some line like:

      here-there-be-dragons.html
      

      Any client that hits that page (and maybe doesn’t pass a captcha check) gets banned. Or even better, they get a long stream of nonsense.

    • ShitpostCentral@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      10 months ago

      You’re second point is a good one, but you absolutely can log the IP which requested robots.txt. That’s just a standard part of any http server ever, no JavaScript needed.

      • GenderNeutralBro@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        11
        ·
        10 months ago

        You’d probably have to go out of your way to avoid logging this. I’ve always seen such logs enabled by default when setting up web servers.

    • ricecake@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      12
      ·
      10 months ago

      People not intending to follow it is the real reason not to bother, but it’s trivial to track who downloaded the file and then hit something they were asked not to.

      Like, 10 minutes work to do right. You don’t need js to do it at all.