I created this account two days ago, but one of my posts ended up in the (metaphorical) hands of an AI powered search engine that has scraping capabilities. What do you guys think about this? How do you feel about your posts/content getting scraped off of the web and potentially being used by AI models and/or AI powered tools? Curious to hear your experiences and thoughts on this.
#Prompt Update
The prompt was something like, What do you know about the user llama@lemmy.dbzer0.com on Lemmy? What can you tell me about his interests?" Initially, it generated a lot of fabricated information, but it would still include one or two accurate details. When I ran the test again, the response was much more accurate compared to the first attempt. It seems that as my account became more established, it became easier for the crawlers to find relevant information.
It even talked about this very post on item 3 and on the second bullet point of the “Notable Posts” section.
For more information, check this comment.
Edit¹: This is Perplexity. Perplexity AI employs data scraping techniques to gather information from various online sources, which it then utilizes to feed its large language models (LLMs) for generating responses to user queries. The scraping process involves automated crawlers that index and extract content from websites, including articles, summaries, and other relevant data. It is an advanced conversational search engine that enhances the research experience by providing concise, sourced answers to user queries. It operates by leveraging AI language models, such as GPT-4, to analyze information from various sources on the web. (12/28/2024)
Edit²: One could argue that data scraping by services like Perplexity may raise privacy concerns because it collects and processes vast amounts of online information without explicit user consent, potentially including personal data, comments, or content that individuals may have posted without expecting it to be aggregated and/or analyzed by AI systems. One could also argue that this indiscriminate collection raise questions about data ownership, proper attribution, and the right to control how one’s digital footprint is used in training AI models. (12/28/2024)
Edit³: I added the second image to the post and its description. (12/29/2024).
It’s Perplexity AI, so it’ll do web searches on demand. You asked about your username, then it searched for your username on the web. Fediverse content is indexed, even content from instances that blocks web crawling (e.g. via robots.txt, or via UA blacklisting on server-side), because the contents will be federated to servers that are indexed by web crawlers.
Now, when we say about offline models and pre-trained content, the way transformers work will often “scramble” the art and the artist. If a content doesn’t explicitly mention the author (also, if the content isn’t well spread across different sources), LLMs will “know” the information you posted online, but it won’t be capable of linking such content to you when asked for it.
Let me exemplify it: suppose you conveyed an unique quote. Nobody else wrote it. You published it on Lemmy. Your quote becomes part of the training data for GPT-n or any other LLM out there. When anyone ask them “Who said the quote ‘…’?”, it’ll either hallucinate (i.e. citing a very random famous writer) or it’ll say something like “I don’t have such information”.
It’s why AIs are often (and understandably) called as plagiarist by the anti-AI people, because AIs don’t cite their sources. Technically, the current state-of-the-art transformers even can’t because LLMs are, under the hood, some fancy-crazy kind of “Will it blend?” for entire corpora across the web, where AI devs gather the most data they possibly can (legally or illegally), dropping it all inside the “AI blender cup” and voila, an LLM was trained, without actually storing each content entirely, just their statistical associations.
I understand that Perplexity employs various language models to handle queries and that the responses generated may not directly come from the training data used by these models; since a significant portion of the output comes from what it scraped from the web. However, a significant concern for some individuals is the potential for their posts to be scraped and also used to train AI models, hence my post.
I’m not anti AI, and, I see your point that transformers often dissociate the content from its creator. However, one could argue this doesn’t fully mitigate the concern. Even if the model can’t link the content back to the original author, it’s still using their data without explicit consent. The fact that LLMs might hallucinate or fail to attribute quotes accurately doesn’t resolve the potential plagiarism issue; instead, it highlights another problematic aspect of these models imo.