Comment on A Project to Poison LLM Crawlers
Taldan@lemmy.world 4 days agoLet’s say I believe you. If that’s the case, why are AI companies still scraping everything?
Comment on A Project to Poison LLM Crawlers
Taldan@lemmy.world 4 days agoLet’s say I believe you. If that’s the case, why are AI companies still scraping everything?
FaceDeer@fedia.io 4 days ago
Raw materials to inform the LLMs constructing the synthetic data, most likely. If you want it to be up to date on the news, you need to give it that news.
The point is not that the scraping doesn't happen, it's that the data is already being highly processed and filtered before it gets to the LLM training step. There's a ton of "poison" in that data naturally already. Early LLMs like GPT-3 just swallowed the poison and muddled on, but researchers have learned how much better LLMs can be when trained on cleaner data and so they already take steps to clean it up.