- cross-posted to:
- technology@lemmy.zip
- cross-posted to:
- technology@lemmy.zip
You must log in or register to comment.
However, because he knew that the dataset was “being fed by essentially unguided crawling” of the web, including “a significant amount of explicit material,” he also didn’t rule out the possibility that image generators could also be directly referencing CSAM included in the LAION-5B dataset.
I wonder what a minimum dataset size would be to produce useful LLMs. Dealing with massive a uncurated dataset seems like a bad idea, if you’re concerned about harmful outputs.
I assumed the images were used for an image generator, not an LLM.