In theory sure, but in practice who has the resources to do large scale model training on huge datasets other than large corporations?
Comment on AI industry horrified to face largest copyright class action ever certified
FauxLiving@lemmy.world 2 days ago
People cheering for this have no idea of the consequence of their copyright-maximalist position.
If using images, text, etc to train a model is copyright infringement then there will NO open models because open source model creators could not possibly obtain all of the licensing for every piece of written or visual media in the Common Crawl dataset, which is what most of these things are trained on.
As it stands now, corporations don’t have a monopoly on AI specifically because copyright doesn’t apply to AI training. Everyone has access to Common Crawl and the other large, public, datasets made from crawling the public Internet and so anyone can train a model on their own without worrying about obtaining billions of different licenses from every single individual who has ever written a word or drawn a picture.
If there is a ruling that training violates copyright then the only entities that could possibly afford to train LLMs or diffusion models are companies that own a large amount of copyrighted materials. Sure, one company will lose a lot of money and/or be destroyed, but the legal president would be set so that it is impossible for anyone that doesn’t have billions of dollars to train AI.
People are shortsightedly seeing this as a victory for artists or some other nonsense. It’s not. This is a fight where large copyright holders (Disney and other large publishing companies) want to completely own the ability to train AI because they own most of the large stores of copyrighted material.
If the copyright holders win this the the open source training material, like Common Crawl, would be completely unusable to train models in the US/the West because any person who has ever posted anything to the Internet in the last 25 years could simply sue for copyright infringement.
JustARaccoon@lemmy.world 2 days ago
FauxLiving@lemmy.world 2 days ago
Distributed computing projects, large non-profits, people in the near future with much more powerful and cheaper hardware, governments which are interested in providing public services to their citizens, etc.
Look at other large technology projects. The Human Genome Project spent $3 billion to sequence the first genome but now you can have it done for around $500. This cost reduction is due to the massive, combined, effort of tens of thousands of independent scientists working on the same problem. It isn’t something that would have happened if Purdue Pharma owned the sequencing process and required every scientist to purchase a license from them in order to do research.
LLM and diffusion models are trained on the works of everyone who’s ever been online (which is stored in the Common Crawl datasets). We should not be cheering for a world where it is illegal to use this dataset and, instead, we are forced to license massive datasets from publishing companies.
The amount of progress on these types of models would immediately stop, there would be 3-4 corporations would could afford the licenses. They would have a de facto monopoly on LLMs and could enshittify them without worry of competition.
JustARaccoon@lemmy.world 2 days ago
The world you’re envisioning would only have paid licenses, who’s to say we can’t have a “free for non commercial purposes” license style for it all?
sunbytes@lemmy.world 2 days ago
Or it just happens overseas, where these laws don’t apply (or can’t be enforced).
But I don’t think it will happen. Too many countries are desperate to be “the AI country” that they’ll risk burning whole industries to the ground to get it.
LustyArgonianMana@lemmy.world 2 days ago
Copyright is a leftover mechanism from slavery and it will be interesting to see how it gets challenged here, given that the wealthy view AI as an extension of themselves and not as a normal employee. Genuinely think the copyright cases from AI will be huge.
FauxLiving@lemmy.world 2 days ago
My last comment was wrong, I’ve read through the filings of the case.
The judge already ruled that training the LLMs using the books was so obviously fair use that it was dismissed in summary judgement (my bolds):
To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. The digitization of the books purchased in print form by Anthropic was also a fair use, but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient, space-saving, and searchable digital copies without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library, and creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.
The only issue remaining in this case is that they downloaded copyrighted material with bittorrent, the kind of lawsuits that have been going on since napster. They’ll probably be required to pay for all 196,640 books that they priated and some other damages.
barryamelton@lemmy.world 2 days ago
Anybody can use copyrighted works under fair use for research, more so if your LLM model is open source. You are wrong.
We don’t need to break copyright rights that protect us from corporations in this case, or also incidentally protect open source and libre software.