Comment on Congress Wants Tech Companies to Pay Up for AI Training Data

<- View Parent
Mechanize@feddit.it ⁨10⁩ ⁨months⁩ ago

Any foundation model is trained on a subset of common crawl.

All the data in there is, arguably, copyrighted by one individual or another. There is no equivalent open - or closed - source dataset to it.

Each single post, page, blog, site, has a copyright holder. In the last year big companies have started to change their TOS to make that they are able to use, relicense and generally sell your data hosted in their services as their own for the intent of AI training, so potentially some small parts of common crawl will be licensable in bulk - or directly obtained from the source.

This does still leave out the majority of the data directly or indirectly used today, even if you were willing to pay, because it is unfeasable to search and contract every single rights holder.

On the other side of it there have been work to use less but more heavily curated data, which could potentially generate good small, domain specific, models. But still they will not be like the ones we currently have, and the open source community will not be able to have access to the same amount and quality of data.

It’s an interesting problem that I’m personally really interested to see where it leads.

source
Sort:hotnewtop