American nonprofit OCLC is known globally for its leading database of bibliographic records, WorldCat. A few months ago, many of these records were posted publicly by the shadow library search engine, Anna’s Archive. OCLC believes that this is the result of a year-long hack and, with a lawsuit filed at an Ohio federal court, it demands damages.

WorldCat Sues Anna’s Archive

It is no secret that publishers fiercely oppose the search engine’s stated goals. The same also applies to OCLC, which has now elevated its concerns into a full-blown lawsuit, filed this month at a federal court in Ohio.

The complaint accuses Washington citizen Maria Dolores Anasztasia Matienzo and several “John Does” of operating the search engine and scraping WorldCat data. The scraping is equated to a cyberattack by OCLC and started around the time Anna’s Archive launched.

“Beginning in the fall of 2022, OCLC began experiencing cyberattacks on WorldCat.org and OCLC’s servers that significantly affected the speed and operations of WorldCat.org, other OCLC products and services, and OCLC’s servers and network infrastructure,” OCLC’s complaint notes.

“These attacks continued throughout the following year, forcing OCLC to devote significant time and resources toward non-routine network infrastructure enhancements, maintenance, and troubleshooting.”

The non-profit says that it spent roughly $68 million over the past two years developing and enhancing WorldCat records, which are an essential part of its operation. Having a copy of the data publicly available through Anna’s Archive is a direct threat to its business.

OCLC claims that Anna’s Archive unmasked itself as the “perpetrator of the attacks on WorldCat.org” when it publicly announced its scraping effort. This includes a detailed blog post the operators published on the matter, encouraging the public to use the scraped data.

In addition to harvesting data from WorldCat.org, the defendants are also accused of obtaining and using credentials of a member library to access WorldCat Discovery Services. This opened the door to yet more detailed records that are not available on WorldCat.org.

OCLC says that it spent significant time and resources to address the ‘attacks’ on its systems.

“These hacking attacks materially affected OCLC’s production systems and servers, requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers.

“To respond to these ongoing attacks, OCLC spent over 1.4 million dollars on its systems’ infrastructure and devoted nearly 10,000 employee hours to the same,” the complaint adds.

  • MotoAsh@lemmy.world
    link
    fedilink
    English
    arrow-up
    19
    ·
    edit-2
    9 months ago

    I mean… it’ll all come down to how they accessed the data. If they had a public portal and no EULA, they can push rocks. If the data wasn’t public or the ‘theives’ had to use non-standard channels, or otherwise violated an EULA, they’re likely screwed. Especially if they had to go through abnormal channels.

    I know their data can be accessed publicly, but I’m pretty sure it’s under license. You cannot just use any old thing found in public… That’s the biggest reasons the AI models are technically theft: they weren’t licensed to commercially profit off of 99.99% of the things their LLMs are trained on, but the law and politicians are WAY behind the times. Commercial data they’d normally have to pay for is suddenly magically OK when laundered through an LLM…

    • Snot Flickerman
      link
      fedilink
      English
      arrow-up
      19
      ·
      9 months ago

      https://annas-blog.org/worldcat-scrape.html

      WorldCat

      That is when we set our sights on the largest book database in the world: WorldCat. This is a proprietary database by the non-profit OCLC, which aggregates metadata records from libraries all over the world, in exchange for giving those libraries access to the full dataset, and having them show up in end-users’ search results.

      Even though OCLC is a non-profit, their business model requires protecting their database. Well, we’re sorry to say, friends at OCLC, we’re giving it all away. :-)

      Over the past year, we’ve meticulously scraped all WorldCat records. At first, we hit a lucky break. WorldCat was just rolling out their complete website redesign (in Aug 2022). This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days.

      After that, security flaws were slowly fixed one by one, until the final one we found was patched about a month ago. By that time we had pretty much all records, and were only going for slightly higher quality records. So we felt it is time to release!

      • MotoAsh@lemmy.world
        link
        fedilink
        English
        arrow-up
        14
        ·
        edit-2
        9 months ago

        Yea OK they’re fucked. I really really doubt they’ll be able to claim the data is solely comprised of the open works saved within that database. The only way they’d be able to get away with it is if they’ve meticulously harvested the data such that they only ever retrieved the open works or public domain works.

        Anything not in that list or otherwise made available solely via their nonprofit efforts is going to be ammo in the lawsuit. Ammo that will hit its target.

    • Dkarma@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 months ago

      “AI models are technically theft: they weren’t licensed to commercially profit off of 99.99%”

      This is simply a lie. There is no license like what you describe. You never need a license to view or learn from something given away completely free on the internet. You guys keep pretending there’s a law that says otherwise . There is not or you’d post it.

      Copyright does not cover viewing or experiencing a piece.

      • MotoAsh@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        9 months ago

        Notice how I said “commercially profit” too. Read all the words next time.

        Also LLMs do not “learn” anything, you idiot. That’s the entire point. They mathematically blender things. They DO NOT learn and create.

    • BearOfaTime@lemm.ee
      link
      fedilink
      English
      arrow-up
      4
      ·
      9 months ago

      Honest question: if you connect to say an FTP server, and there’s no dialog claiming a EULA, would you be bound by one?

      I don’t know how they got the data, but the whole EULA thing would rely on there being proof Anna agreed to one, right? That seems a bit tricky. As for “unauthorized access”, if a path is available, and Anna used it, again with no warnings, where’s the legal line?

      Having been in civil court a few times, judges will ask people “do you have a document proving there was an agreement?”, over any circumstance that could be misconstrued, or is a verbal claim.

      No doc, verbal claim is dismissed unless other party admits to the verbal claim in court, to the judge.

      Just seems to me EULAs are terribly hard to enforce.

      Again, I’m more thinking out loud. I have no idea how these cases tend to proceed.

      • FigMcLargeHuge@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        5
        ·
        9 months ago

        That is going to depend on what type of access the ftp server allows. If it’s anonymous then I would argue that no, you cannot be bound by a EULA if no dialog is presented. But the article mentions “In addition to harvesting data from WorldCat.org, the defendants are also accused of obtaining and using credentials of a member library to access WorldCat Discovery Services.” Now it’s just my speculation, but if they used someone else’s id to scrape the data, then WorldCat can just produce any documents that id agreed to, and it will apply here. Sounds like they done goofed.

      • Snot Flickerman
        link
        fedilink
        English
        arrow-up
        4
        ·
        9 months ago

        You are generally required to put up unauthorized access warnings.

        Similar to how you have to post “no trespassing” signs if you don’t want to be trespassed.

        • WarmApplePieShrek@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          That’s not true. Trespass works like that because big corporations don’t get trespassed much, but they lobbied for copyright to be automatic.

      • MotoAsh@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        9 months ago

        I think that would depend on how intentional the open port was.

        If it’s something there and advertised, even if mentioned in one place in some archaic document, they’d probably be fine just for accessing it.

        Though that would only absolve them of acquisition issues. If they’re using someone else’s work for profit, there is almost certainly enough room for the lawsuit.

        Only a select few licenses even allow for open and unrestricted commercial use. Especially if the data itself is the licensed thing, since valuable data is far easier to convert than something like source code.