Avram Piltch is the editor in chief of Tom’s Hardware, and he’s written a thoroughly researched article breaking down the promises and failures of LLM AIs.

  • lily33@lemm.ee
    link
    fedilink
    arrow-up
    30
    ·
    edit-2
    1 year ago

    They have the right to ingest data, not because they’re “just learning like a human would". But because I - a human - have a right to grab all data that’s available on the public internet, and process it however I want, including by training statistical models. The only thing I don’t have a right to do is distribute it (or works that resemble it too closely).

    In you actually show me people who are extracting books from LLMs and reading them that way, then I’d agree that would be piracy - but that’d be such a terrible experience if it ever works - that I can’t see it actually happening.

      • lily33@lemm.ee
        link
        fedilink
        arrow-up
        22
        ·
        1 year ago

        I’m sick and tired of this “parrots the works of others” narrative. Here’s a challenge for you: go to https://huggingface.co/chat/, input some prompt (for example, “Write a three paragraphs scene about Jason and Carol playing hide and seek with some other kids. Jason gets injured, and Carol has to help him.”). And when you get the response, try to find the author that it “parroted”. You won’t be able to - because it wouldn’t just reproduce someone else’s already made scene. It’ll mesh maaany things from all over the training data in such a way that none of them will be even remotely recognizable.

          • keegomatic@kbin.social
            link
            fedilink
            arrow-up
            13
            ·
            edit-2
            1 year ago

            So is your comment. And mine. What do you think our brains do? Magic?

            edit: This may sound inflammatory but I mean no offense

            • RickRussell_CA@beehaw.orgOP
              link
              fedilink
              English
              arrow-up
              3
              ·
              1 year ago

              No, I get it. I’m not really arguing that what separates humans from machines is “libertarian free will” or some such.

              But we can properly argue that LLM output is derivative because we know it’s derivative, because we designed it. As humans, we have the privilege of recognizing transformative human creativity in our laws as a separate entity from derivative algorithmic output.

          • conciselyverbose@kbin.social
            link
            fedilink
            arrow-up
            7
            ·
            edit-2
            1 year ago

            So is literally every human work in the last 1000 years in every context.

            Nothing is “original”. It’s all derivative. Feeding copyrighted work into an algorithm does not in any way violate any copyright law, and anyone telling you otherwise is a liar and a piece of shit. There is no valid interpretation anywhere close.

            • zygo_histo_morpheus@programming.dev
              link
              fedilink
              arrow-up
              6
              ·
              edit-2
              1 year ago

              Every human work isn’t mechanically derivative. The entire point of the article is that the way LLMs learn and create derivative text isn’t equivalent to the way humans do the same thing.

              • conciselyverbose@kbin.social
                link
                fedilink
                arrow-up
                3
                ·
                1 year ago

                It’s complete and utter nonsense and they’re bad people for writing it. The complexity of the AI does not matter and if it did, they’re setting themselves up to lose again in the very near future when companies make shit arbitrarily complex to meet their unhinged fake definitions.

                But none of it matters because literally no part of this in any way violates copyright law. Processing data is not and does not in any way resemble copyright infringement.

                • RickRussell_CA@beehaw.orgOP
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  ·
                  1 year ago

                  This issue is easily resolved. Create the AI that produces useful output without using copyrighted works, and we don’t have a problem.

                  If you take the copyrighted work out of the input training set, and the algorithm can no longer produce the output, then I’m confident saying that the output was derived from the inputs.

                  • conciselyverbose@kbin.social
                    link
                    fedilink
                    arrow-up
                    2
                    ·
                    1 year ago

                    There is literally not one single piece of art that is not derived from prior art in the past thousand years. There is no theoretical possibility for any human exposed to human culture to make a work that is not derived from prior work. It can’t be done.

                    Derivative work is not copyright infringement. Straight up copying someone else’s work directly and distributing that is.

          • lily33@lemm.ee
            link
            fedilink
            arrow-up
            4
            ·
            edit-2
            1 year ago

            From Wikipedia, “a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work”.

            You can probably can the output of an LLM ‘derived’, in the same way that if I counted the number of 'Q’s in Harry Potter the result derived from Rowling’s work.

            But it’s not ‘derivative’.

            Technically it’s possible for an LLM to output a derivative work if you prompt it to do so. But most of its outputs aren’t.

            • RickRussell_CA@beehaw.orgOP
              link
              fedilink
              English
              arrow-up
              4
              ·
              1 year ago

              a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work

              What was fed into the algorithm? A human decided which major copyrighted elements of previously created original work would seed the algorithm. That’s how we know it’s derivative.

              If I take somebody’s copyrighted artwork, and apply Photoshop filters that change the color of every single pixel, have I made an expressive creation that does not include copyrightable elements of a previously created original work? The courts have said “no”, and I think the burden is on AI proponents to show how they fed copyrighted work into an mechanical algorithm, and produced a new expressive creation free of copyrightable elements.

              • lily33@lemm.ee
                link
                fedilink
                arrow-up
                4
                ·
                edit-2
                1 year ago

                I think the test for “free of copyrightable elements” is pretty simple - can you look at the new creation and recognize any copyrightable elements in it? The process by which it was created doesn’t matter. Maybe I made this post entirely by copy-pasting phrases from other people, who knows (well, I didn’t, only because it would be too much work), but it does not infringe either way…

        • state_electrician@discuss.tchncs.de
          link
          fedilink
          arrow-up
          0
          ·
          1 year ago

          Well, I think that these models learn in a way similar to humans as in it’s basically impossible to tell where parts of the model came from. And as such the copyright claims are ridiculous. We need less copyright, not more. But, on the other hand, LLMs are not humans, they are tools created by and owned by corporations and I hate to see them profiting off of other people’s work without proper compensation.

          I am fine with public domain models being trained on anything and being used for noncommercial purposes without being taken down by copyright claims.

          • RickRussell_CA@beehaw.orgOP
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            it’s basically impossible to tell where parts of the model came from

            AIs are deterministic.

            1. Train the AI on data without the copyrighted work.

            2. Train the same AI on data with the copyrighted work.

            3. Ask the two instances the same question.

            4. The difference is the contribution of the copyrighted work.

            There may be larger questions of precisely how an AI produces one answer when trained with a copyrighted work, and another answer when not trained with the copyrighted work. But we know why the answers are different, and we can show precisely what contribution the copyrighted work makes to the response to any prompt, just by running the AI twice.

      • RandoCalrandian@kbin.social
        link
        fedilink
        arrow-up
        4
        ·
        1 year ago

        Is there a meaningful difference between reproducing the work and giving a summary? Because I’ll absolutely be using AI to filter all the editorial garbage out of news, setup and trained myself to surface what is meaningful to me stripped of all advertising, sponsorships, and detectable bias

        • Tarte@kbin.social
          link
          fedilink
          arrow-up
          6
          ·
          edit-2
          1 year ago

          I have yet to find an LLM that can summarize a text without errors. I already mentioned this in another post a few days back, but Google‘s new search preview is driving me mad with all the hidden factual errors. They make me click only to realize that the LLM told me what I wanted to find, not what is there (wrong names, wrong dates, etc.).

          I greatly prefer the old excerpt summaries over the new imaginary ones (they‘re currently A/B testing).

    • donuts@kbin.social
      link
      fedilink
      arrow-up
      20
      ·
      edit-2
      1 year ago

      You’re making two, big incorrect assumptions:

      1. Simply seeing something on the internet does not give you any legal or moral rights to use that thing in any way other than things which are, or have previously been, deemed to be “fair use” by a court of law. Individuals have personal rights over their likeness and persona, and copyright holders have rights over their works, whether they are on the internet or not. In other words, there is a big difference between “visible in public” and “public domain”.
      2. More importantly, something that might be considered “fair use” for a human being do to is not necessary “fair use” when a computer or “AI” does it. Judgements of what is and is not fair use are made on a case by case basis as a legal defense against copyright infringement claims, and multiple factors (purpose of use, nature of original work, degree and sustainability of use, market effect, etc.) are often taken into consideration. At the very least, AI use has serious implications on sustainability and markets, especially compared to examples of human use.

      I know these are really tough pills for AI fans to swallow, but you know what they say… “If it seems too good to be true, it probably is.”

      • lily33@lemm.ee
        link
        fedilink
        arrow-up
        10
        ·
        edit-2
        1 year ago

        One the contrary - the reason copyright is called that is because it started as the right to make copies. Since then it’s been expanded to include more than just copies, such as distributing derivative works

        But the act of distribution is key. If I wanted to, I could write whatever derivative works in my personal diary.

        I also have the right to count the number of occurrences of the letter ‘Q’ in Harry Potter workout Rowling’s permission. This I can also post my count online for other lovers of ‘Q’, because it’s not derivative (it is ‘derived’, but ‘derivative’ is different - according to Wikipedia it means ‘includes major copyrightable elements’).

        Or do more complex statistical analysis.

    • raccoona_nongrata@beehaw.org
      link
      fedilink
      arrow-up
      15
      ·
      edit-2
      1 year ago

      I think this opinion is going to be looked at a lot like the anti-privacy arguments when Facebook and Google were first revealed to be massively invading people’s privacy.

      We look at those platforms with disdain now, but at the time all you ever heard people saying over and over was “If you have nothing to hide who cares about privacy?” and “Anything you put on the Internet is fair game.” and “your privacy is already gone, nothing we can or should do now.”

      And then that careless attitude led to things that those people hadn’t foreseen, like the Cambridge Analytica scandal, massive troll farm campaigns and Trump’s election.

      Looking back we’re going to see this argument about data scraping to fuel LLMs in the same way.