cross-posted from: https://infosec.pub/post/8775123

Reddit said in a filing to the Securities and Exchange Commission that its users’ posts are “a valuable source of conversation data and knowledge” that has been and will continue to be an important mechanism for training AI and large language models. The filing also states that the company believes “we are in the early stages of monetizing our user base,” and proceeds to say that it will continue to sell users’ content to companies that want to train LLMs and that it will also begin “increased use of artificial intelligence in our advertising solutions.”

The long-awaited S-1 filing reveals much of what Reddit users knew and feared: That many of the changes the company has made over the last year in the leadup to an IPO are focused on exerting control over the site, sanitizing parts of the platform, and monetizing user data.

Posting here because of the privacy implications of all this, but I wonder if at some point there should be an “Enshittification” community :-)

  • umbraroze@kbin.social
    link
    fedilink
    arrow-up
    4
    ·
    8 months ago

    Reddit has an user data checkout feature (IIRC, check out the user settings or maybe reddit help pages to find it).

    It’s a bit crap though.

    It takes a long time to process, especially if you happened to post in the era when the Reddit data infrastructure was horribly terrible instead of merely ordinarily terrible, and apparently this involves some handwork in the worst cases on behalf of the staff.

    Some data may be missing or truncated. It doesn’t give you data from privated/banned subreddits (which was a fun thing to discover because last time I tried to do this the blackouts were on), and even for legit stuff, long comments/posts may be truncated. Even so, I’m pretty sure that the dumps just straight up didn’t have all of my posts from several years ago, even if those were on public subreddits. So you need to make sure the checked out data is sensible.

    In conjunction to the official dumps, I recommend a few other tools, especially since the dumps aren’t really magnificently usable on their own. One tool that I found personally invaluable is reddit-user-to-sqlite, which allows you to import Reddit data dumps and available live user data (I think it does this by scraping or something, I’m sure it worked despite the API being shut down) to sqlite database, and Datasette is a nice frontend for browsing the posts.

    As for scrubbing, there’s tools for that are supposed to work. I think.