Thousands of authors demand payment from AI companies for use of copyrighted works::Thousands of published authors are requesting payment from tech companies for the use of their copyrighted works in training artificial intelligence tools, marking the latest intellectual property critique to target AI development.

  • nickwitha_k (he/him)@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    No, it really doesn’t, nor does it function like human cognition. Take this example:

    I, personally, to decide that I wanted to make a sci-fi show. I don’t want to come up with ideas so, I want to try to do something that works. I take the scripts of every Star Trek: The Search for Spock, Alien, and Earth Girls Are Easy and feed them into a database, seperating words into individual data entries with some grammatical classification. Then, using this database, I generate a script, averaging the length of the films, with every word based upon its occurrence in the films or randomized, if it’s a tie. I go straight into production with “Star Alien: The Girls Are Spock”. I am immediately sued by Disney, Lionsgate, and Paramount for trademark and copyright infringement, even though I basically just used a small LLM.

    You are right that nothing is created in a vacuum. However, plagiarism is still plagiarism, even if it is using a technically sophisticated LLM plagiarism engine.

    • joe@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      ChatGPT doesn’t have direct access to the material it’s trained on. Go ask it to quote a book to you.

      • nickwitha_k (he/him)@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        That really doesn’t make an appreciable difference. It doesn’t need direct access to source data, if it’s already been transferred into statistical data.

        • joe@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          It does rule out “plagiarism”, however, since it means it can’t pull directly from any training material.

          I should have asked earlier: what do you think plagiarism is?

          • nickwitha_k (he/him)@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            1 year ago

            It really doesn’t. The data is just tokenized and encoded into the model (with additional metadata).

            If I take the following:

            Three blind mice, three blind mice See how they run, see how they run

            And encode it based upon frequency: 1:{"word": "three", "qty": 2} 2:{"word": "blind", "qty": 2} 3:{"word": "mice", "qty": 2} 4:{"word": "see", "qty": 2} 5:{"word": "how", "qty": 2} 6:{"word": "they", "qty": 2} 7:{"word": "run", "qty": 2}

            The original data is still present, just not in its original form. If I were then to use the data to generate a rhyme and claim authorship, I would both be lying and committing plagiarism, which is the act of attempting to pass someone else’s work off as your own.

            Out of curiosity, do you currently or intend to make money using LLMs? I ask because I’m wondering if this is an example of Upton Sinclair’s statement “It is difficult to get a man to understand something when his salary depends on his not understanding it.”

            • joe@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              1 year ago

              That’s not how LLMs work, and no, I have no financial skin in the game. My field is software QA; I can’t nail down whether it would affect me or not, because I could imagine it going either way. I do know that it doesn’t matter-- legislation is not going to stop this-- it’s not even going to do much to slow it down.

              What about you? I find that most the hysteria around LLMs comes from people whose jobs are on the line. Does that accurately describe you?

              Edit: typos

              • nickwitha_k (he/him)@lemmy.sdf.org
                link
                fedilink
                English
                arrow-up
                2
                ·
                1 year ago

                It is not literally how they work, no. But, an oversimplified approximation. Data is encoded into mathematical functions in neural network nodes but, it is still encoded data in the same way that an MP3 and WAV of a song are both still the song; the neural network is the medium.

                Just because the data is stored in a different, possibly more-efficient manner doesn’t mean that it is not there for all intents and purposes (I suppose one could make the argument of it being transformed into metadata but if it is able to reconstruct verbatim, this seems like a fallacy). Nor is it within free use exemptions of most IP laws to use others’ copyrighted, trademarked, or copy-left data to power a commercial product in ways contrary to licensing terms.

                As for my job, well, yes, I do have some anxieties in that area but as a software engineer focused in automation, tooling, and security, I suspect that my position is fairly secure. I would hope yours is too, both for youself and overall software quality. Likely there will be more demand for both of our skillsets with the CRA.

                • joe@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  1 year ago

                  Data is encoded into mathematical functions in neural network nodes but, it is still encoded data in the same way that an MP3 and WAV of a song are both still the song; the neural network is the medium.

                  Here: https://www.understandingai.org/p/large-language-models-explained-with

                  It’s not plagiarism by any definition of the word that makes sense; while the analogy may not be literal, it is perfectly analogous to suggest that learning new words from a Harry Potter book means that any book you write going forward is plagiarizing JK Rowling; the training data helps map the words in the model-- it’s never used as a blueprint when predicting what word comes next in any given scenario. It’s even farther away from copyright infringement-- there is no limited right granted that allows a IP holder to say how that IP can be processed. That’s just not a thing. You’d have just as much leg to stand on if you suggested that Stephen King had the right to prevent people from reading his books in a room with green walls. You can’t just make up new rights. Trademark law is totally insane. I don’t know why you even mention it. It doesn’t even have the same goals as the others.

                  as a software engineer

                  I am not so sure that this specific role is in any way secure, myself. You may come to the same conclusion after reading that link I provided-- pay attention to how rapidly the LLMs are growing in complexity. I do not wish for anyone to lose their financial security, even a stranger like you, but I can’t help but look at the available information and come to that conclusion.

                  • nickwitha_k (he/him)@lemmy.sdf.org
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    1 year ago

                    there is no limited right granted that allows a IP holder to say how that IP can be processed.

                    There very much is. Literally all intellectual property law concerns how intellectual property may or may not be used and licensed. For example, one may not record and sell a cover of a song that is in copyright without explicit permission in the form of a mechanical license. In our industry, one may not use code that is covered by a GNU GPL license without fulfilling the source code distribution requirements (see: IBM RedHat drama).

                    The training data is what gives the LLM value in the problematic situations so, it is very clear that the material is a key component in the business plan and commercial use. This is not an educational, parody, or other exempt fair-use activity. This means that if any data used for training is not licensed appropriately, such use is a clear violation of intellectual property laws, even if but explicitly covered due to the technology not existing when they were written.

                    I am not so sure that this specific role is in any way secure, myself. You may come to the same conclusion after reading that link I provided-- pay attention to how rapidly the LLMs are growing in complexity. I do not wish for anyone to lose their financial security, even a stranger like you, but I can’t help but look at the available information and come to that conclusion.

                    I do agree that there are software engineering jobs at risk in the short-term due to management desire to cut labor while riding the hype train as well as US taxation on R&D but, given the widespread failures found when companies have replaced engineers and others, I have been expecting wave of desperate re-hiring to occur in 1-3 years after layoff. The particular segment that I’m involved in is generally considered high-ROI so, likely less vulnerable (but no guarantee).

                    I don’t see how QA could be sanely replaced though as, from my experience, it’s already frequently under-funded and, as I mentioned, for all the bad in the CRA drafts, one of the positives is that QA-related work is going to be mandatory for software and devices sold in the EU market.