Comment on Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data

<- View Parent
ignirtoq@kbin.social ⁨10⁩ ⁨months⁩ ago

Whether or not data was openly accessible doesn’t really matter [...] ChatGPT also isn’t just reading the data at its source, it’s copying it into its training dataset, and that copying is unlicensed.

Actually, the act of copying a work covered by copyright is not itself illegal. If I check out a book from a library and copy a passage (or the whole book!) for rereading myself or some other use that is limited strictly to myself, that's actually legal. If I turn around and share that passage with a friend in a way that's not covered under fair use, that's illegal. It's the act of distributing the copy that's illegal.

That's why whether the AI model is publicly accessible does matter. A company is considered a "person" under copyright law. So OpenAI can scrape all the copyrighted works off the internet it wants, as long as it didn't break laws to gain access to them. (In other words, articles freely available on CNN's website are free to be copied (but not distributed), but if you circumvent the New York Times' paywall to get articles you didn't pay for, then that's not legal access.) OpenAI then encodes those copyrighted works in its models' weights. If it provides open access to those models, and people execute these attacks to recover pristine copies of copyrighted works, that's illegal distribution. If it keeps access only for employees, and they execute attacks that recover pristine copies of copyrighted works, that's keeping the copies within the use of the "person" (company), so it is not illegal. If they let their employees take the copyrighted works home for non-work use (or to use the AI model for non-work use and recover the pristine copies), that's illegal distribution.

source
Sort:hotnewtop