Comment on [AI] Niwatari Kutaka

<- View Parent
Even_Adder@lemmy.dbzer0.com ⁨1⁩ ⁨week⁩ ago

I’m not telling you to ponder this from a legal perspective, look at it what those laws protect from an ethical perspective. And I urge you again to actually read the material. It goes in depth and explains how all this works and the ways in it’s all related. A quick excerpt:

Break down the steps of training a model and it quickly becomes apparent why it’s technically wrong to call this a copyright infringement. First, the act of making transient copies of works – even billions of works – is unequivocally fair use. Unless you think search engines and the Internet Archive shouldn’t exist, then you should support scraping at scale:

pluralistic.net/…/how-to-think-about-scraping/

And unless you think that Facebook should be allowed to use the law to block projects like Ad Observer, which gathers samples of paid political disinformation, then you should support scraping at scale, even when the site being scraped objects (at least sometimes):

pluralistic.net/2021/…/get-you-coming-and-going/#…

After making transient copies of lots of works, the next step in AI training is to subject them to mathematical analysis. Again, this isn’t a copyright violation.

Making quantitative observations about works is a longstanding, respected and important tool for criticism, analysis, archiving and new acts of creation. Measuring the steady contraction of the vocabulary in successive Agatha Christie novels turns out to offer a fascinating window into her dementia:

theguardian.com/…/agatha-christie-alzheimers-rese…

Programmatic analysis of scraped online speech is also critical to the burgeoning formal analyses of the language spoken by minorities, producing a vibrant account of the rigorous grammar of dialects that have long been dismissed as “slang”:

researchgate.net/…/373950278_Lexicogrammatical_An…

Since 1988, UCL Survey of English Language has maintained its “International Corpus of English,” and scholars have plumbed its depth to draw important conclusions about the wide variety of Englishes spoken around the world, especially in postcolonial English-speaking countries:

www.ucl.ac.uk/english-usage/projects/ice.htm

The final step in training a model is publishing the conclusions of the quantitative analysis of the temporarily copied documents as software code. Code itself is a form of expressive speech – and that expressivity is key to the fight for privacy, because the fact that code is speech limits how governments can censor software:

eff.org/…/remembering-case-established-code-speec…

If you’re not willing to do that, there isn’t much I can do, since all of your questions are answered there.

source
Sort:hotnewtop