Comment on 'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' * TorrentFreak
Grimy@lemmy.world 1 week agoAll LLMs and Gen AI use data they don’t own. The Pile is all scraped or pirated info, which served as a starting point for most LLMs. Image gen is all scraped from the web. Speech to text and video gen mainly uses YouTube data.
So either you put a price tag on that data, which means only a handful of companies can afford to build these tools (including Meta), or you understand that piracy is the only way for most to aquire this data but since it’s highly transformative, it isn’t breaching copyrights or directly stealing from them as piracy “normally” is.
I’m being pragmatic.