Comment

Comment on Nvidia accused of trying to cut a deal with Anna’s Archive for high‑speed access to the massive pirated book haul — allegedly chased stolen data to fuel its LLMs

PierceTheBubble@lemmy.ml ⁨13⁩ ⁨hours⁩ ago

So the amend alleges, Nvidia having used/stored/copied/obtained/distributed copyrighted works (including plaintiffs’), both through databases available on HugginFace (‘Books3’ featured in both ‘The Pile’ and ‘SlimPajama’), or pirating from shadow libraries (like Anna’s Archive), to train multiple LLMs (primarily their ‘NeMo Megatron’ series), and distributing the copyrighted data through the ‘NeMo Megatron Framework’; data which was ultimately sourced from shadow libraries.

It’s quite an interesting read actually, especially the link to this Anna’s Archive blog post. Which it grossly pulls out of context, as plaintiffs clearly despise the shadow libraries too: as they have ultimately provided access to their copyrighted material.

Especially the part: “Most (but not all!) US-based companies reconsidered once they realized the illegal nature of our work. By contrast, Chinese firms have enthusiastically embraced our collection, apparently untroubled by its legality.” makes me wonder if that’s the reason why models like Deepseek blew Western models out of the water.

source

Sort:hotnew top

Knock_Knock_Lemmy_In@lemmy.world ⁨5⁩ ⁨hours⁩ ago
You can ask deepseek detailed questions about Harry Potter books and it responds intelligently with (almost) quotes from the book.

Ask chatGPT and it will respond to questions but denys it has read any book.

source
- Corkyskog@sh.itjust.works ⁨1⁩ ⁨hour⁩ ago
  Interesting, I was using Deepseek for book recommendations and it was exceptionally good at recommending books that are similar to one I just read compared to other models.
  
  source