Nvidia accused of trying to cut a deal with Anna’s Archive for high‑speed access to the massive pirated book haul — allegedly chased stolen data to fuel its LLMs

Submitted ⁨⁨3⁩ ⁨months⁩ ago⁩ by ⁨schizoidman@lemmy.zip⁩ to ⁨technology@lemmy.world⁩

https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-accused-of-trying-to-cut-a-deal-with-annas-archive-for-high-speed-access-to-the-massive-pirated-book-haul-allegedly-chased-stolen-data-to-fuel-its-llms

source

Comments

Sort:hotnew top

theunknownmuncher@lemmy.world ⁨3⁩ ⁨months⁩ ago
Allegedly most valuable company on the planet in all of history (can’t afford books). Allegedly not a bubble or fraud.

source
- MrScottyTay@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
  Sadly I think it’s more that there isn’t really a standard way to buy books and other media in bulk at the scale of which AI training usually requires. So the companies realise they can save both time and money in just pirating after calculating the fine risk. Its just a bonus that they usually get away with it and that the fines would likely be cheaper than a legit transaction. But i do think it’s the bulk data packaging that makes piracy look more attractive to them at the get-go.
  
  Heck, even video game publishers often source their roms for their official re-releases from pirated copies because pirates are better at preserving data and keeping it on a nice friendly format. Easier to search for it on the web and download it then it is too goo into their own archives and rip it themselves, if they even still have original copies, cause they sure as hell didn’t keep their source code.
  
  source
  - amzd@lemmy.world ⁨3⁩ ⁨months⁩ ago
    There is also no standard way of buying a DRM free epub for personal use so I’m fine downloading them from Anna too :)
    
    source
  - theunknownmuncher@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Yeah, no, this genuinely doesn’t make sense as there are legitimate repositories for these books and can do business-to-business negotiations for access to them. Even libraries have access to ebooks at bulk scale.
    
    source
    -> View More Comments
  - Waphles@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Well, I suppose they could buy access to Amazon’s kindle servers
    
    source
    -> View More Comments
- UnspecificGravity@piefed.social ⁨3⁩ ⁨months⁩ ago
  Are you suggesting that there is a use case for piracy that has less to do with saving money than it does with convenience and easy access to media in one place?
  
  source
rafoix@lemmy.zip ⁨3⁩ ⁨months⁩ ago
Will they be sued per book?

source
- UnspecificGravity@piefed.social ⁨3⁩ ⁨months⁩ ago
  It’s not stealing when corpos do it.
  
  Meta torrented their training data from the pirate bay. Hell, Spotify initially built their catalog from pirated music. They all do this shit. Corporations are built to steal our shit and sell it back to us. This isn’t any different from pumping oil out of pubic lands and selling it back to us.
  
  source
  - demonsword@lemmy.world ⁨3⁩ ⁨months⁩ ago
    
    pumping oil out of pubic lands
    
    this sounds really painful lol
    
    source
  - ICastFist@programming.dev ⁨3⁩ ⁨months⁩ ago
    wish meta had torrented all the viruses, too, would be fun to read the news of “facebook and instagram are offline as meta suffers from cyberattack”
    
    source
- Goodlucksil@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
  No becaese the lawyer cohort will destroy them.
  
  source
- jim3692@discuss.online ⁨3⁩ ⁨months⁩ ago
  No, it’s fair use
  
  source
  - Filetternavn@lemmy.blahaj.zone ⁨3⁩ ⁨months⁩ ago
    Pirating books is not fair use. Using copyrighted works to train an AI model is not fair use. People seem to grossly misunderstand what fair use is, and how limited its scope is. Don’t believe me? Here’s legal the precedent
    
    source
    -> View More Comments
  - rafoix@lemmy.zip ⁨3⁩ ⁨months⁩ ago
    A business is not fair use. They’re taking someone’s intellectual property and using it to make their product useful.
    
    source
FaceDeer@fedia.io ⁨3⁩ ⁨months⁩ ago
Seems strange. Anna's Archive makes their collection available for bulk download as torrent files, they shouldn't need to "cut a deal" for access to that. Just download the torrent and now you've got the whole collection available locally.

source
- nialv7@lemmy.world ⁨3⁩ ⁨months⁩ ago
  They do provide direct access to their books for business who are willing to pay.
  
  source
  - dukemirage@lemmy.world ⁨3⁩ ⁨months⁩ ago
    chaotic neutral
    
    source
  - FaceDeer@fedia.io ⁨3⁩ ⁨months⁩ ago
    Which, as I said, seems strange. Why don't those businesses just download the torrents?
    
    source
    -> View More Comments
scytale@piefed.zip ⁨3⁩ ⁨months⁩ ago
Holy shit the greed knows no bounds.

source
null@piefed.nullspace.lol ⁨3⁩ ⁨months⁩ ago
Wait, so piracy is theft?

source
- 0x0@lemmy.zip ⁨3⁩ ⁨months⁩ ago
  Not if it’s the rich guys doing it.
  
  source
Appoxo@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
But…why?
Just torrent it?

source
- Knock_Knock_Lemmy_In@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Just like Meta did torrentfreak.com/meta-torrented-over-81-tb-of-dat…
  
  source
  - Appoxo@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
    Exactly
    
    source
- Dadifer@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Not fast enough
  
  source
  - Appoxo@lemmy.dbzer0.com ⁨3⁩ ⁨months⁩ ago
    But my gbit seedbox is trying really hard ;-;
    
    source
sureshot0@discuss.online ⁨3⁩ ⁨months⁩ ago
It would be so funny if this ended with Nvidia getting robbed.

source
brokenwing@discuss.tchncs.de ⁨3⁩ ⁨months⁩ ago
AA might be digging their own grave. Overtime the knowledge gets accumulated in the hands of a select few and then they’re gonna block people from accessing pirated sites like AA or even worse, AA gets shutdown due to lack of traffic.

source
- Dadifer@lemmy.world ⁨3⁩ ⁨months⁩ ago
  It has torrent backup. How would it do either of those things?
  
  source
- Cherry@piefed.social ⁨3⁩ ⁨months⁩ ago
  It’s a really good thought. IMO what they will be producing with AI wont be knowledge it will be slop.
  
  There is always gonna be an indie writer, a local at the pub singing. They cant stop people creating. Download or buy analog of the stuff you like and store it. We don’t have to be a slave to the mainstream dream…i will say though its hard changing habits…but for me, it starts with me.
  
  source
flowers_galore2@lemmynsfw.com ⁨3⁩ ⁨months⁩ ago
Hmm so nvidia is training llms as well. Are they going to compete with their customers now too? Like anthropic and cursor?

Good. Can’t wait for the bubble to pop.

source
DandomRude@lemmy.world ⁨3⁩ ⁨months⁩ ago
So we can assume that in the future, only slob written by LLMs will be available. I mean, who would be willing to spend hundreds of hours writing a book when even huge corporations that earn billions from it won’t pay the author a single dime?

source
- Cherry@piefed.social ⁨3⁩ ⁨months⁩ ago
  The trick is not to pay a dime to read it. Even producing Ai slop has a cost. If no one pays for that it must leave a negative.
  
  Stop buying. Simples.
  
  source
- dukemirage@lemmy.world ⁨3⁩ ⁨months⁩ ago
  Why should this development stop at books? There are already generated books available, mostly children’s books (no one’s thinking about them now).
  
  source
  - DandomRude@lemmy.world ⁨3⁩ ⁨months⁩ ago
    This development will certainly not end with books - countless other creative and intellectual achievements have long been affected. That is precisely the problem with generative models, whether they involve text, code, video, images, or whatever else. All of this boils down to the fact that the already precarious situation for everyone who creates value by themselves is continuing to deteriorate. Professional work in all these areas will undoubtedly become even more precarious in the future, with artists, designers, and writers, who were already in a difficult position, now being joined by industries such as software development and administrative work.
    
    Please don’t get me wrong: I am anything but a technology pessimist, but the business model of the so-called AI companies is so exploitative and their owners so unscrupulous that, given the status quo (cloud models), I can hardly imagine that this will lead to even halfway fair working conditions or remuneration models for people who create value in the form of intellectual achievements. I mean, this post is a vivid example.
    
    source
PierceTheBubble@lemmy.ml ⁨3⁩ ⁨months⁩ ago
So the amend alleges, Nvidia having used/stored/copied/obtained/distributed copyrighted works (including plaintiffs’), both through databases available on HugginFace (‘Books3’ featured in both ‘The Pile’ and ‘SlimPajama’), or pirating from shadow libraries (like Anna’s Archive), to train multiple LLMs (primarily their ‘NeMo Megatron’ series), and distributing the copyrighted data through the ‘NeMo Megatron Framework’; data which was ultimately sourced from shadow libraries.

It’s quite an interesting read actually, especially the link to this Anna’s Archive blog post. Which it grossly pulls out of context, as plaintiffs clearly despise the shadow libraries too: as they have ultimately provided access to their copyrighted material.

Especially the part: “Most (but not all!) US-based companies reconsidered once they realized the illegal nature of our work. By contrast, Chinese firms have enthusiastically embraced our collection, apparently untroubled by its legality.” makes me wonder if that’s the reason why models like Deepseek blew Western models out of the water.

source
- Knock_Knock_Lemmy_In@lemmy.world ⁨3⁩ ⁨months⁩ ago
  You can ask deepseek detailed questions about Harry Potter books and it responds intelligently with (almost) quotes from the book.
  
  Ask chatGPT and it will respond to questions but denys it has read any book.
  
  source
  - Corkyskog@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
    Interesting, I was using Deepseek for book recommendations and it was exceptionally good at recommending books that are similar to one I just read compared to other models.
    
    source
SabinStargem@lemmy.today ⁨3⁩ ⁨months⁩ ago
I support the destruction of copyright. Humanity should have free access to media, be it for enhancing their commercial products or for individuals to develop their personhood.

source
- neuromorph@lemmy.world ⁨3⁩ ⁨months⁩ ago
  We need to remove any copyright from whatever is developed by the AI companies.
  
  If the AI can use copyrighted material without compensating the owners, then it should be free for everyone to use/own the content AI creates
  
  source
SparrowHawk@feddit.it ⁨3⁩ ⁨months⁩ ago
I dont know why but this is all so funny and ridicolous to me.

Infuriating too, but so ridicolous. Like, capitalism is proving how much it sucks for it to need to go against its own rules. Like it always did this but now it is so pathetically clear.

source
random_character_a@lemmy.world ⁨3⁩ ⁨months⁩ ago
Allegedly, but holy shit if true. Hard to explain yourself out of that one.

source