Comment

Comment on AI and Copyright: Expanding Copyright Hurts Everyone—Here’s What to Do Instead.

tal@lemmy.today ⁨5⁩ ⁨months⁩ ago

So, I agree with the EFF that we should not introduce some kind of new legal right to prohibit training on something just because it’s copyrighted. There’s nothing that keeps a human from training themselves on content, so neither should an AI be prohibited.

However.

It is possible for a human to make a work that will infringe existing copyright rights, by producing a derivative work. Not every work inspired by something else will meet the legal bar for being derived, but some can. And just as a human can do that, so too can AIs.

I have no problem with, say, an AI being able to emulate a style. But it’s possible for AIs today to produce works that do meet the bar for being derivative works. As things stand, I believe that that’d make the user of the AI liable. And yet, there’s not really a very good way for them to avoid that. That’s a legit point of complaint, I think, because it leads to people making derivative works.

The existing generative AI systems don’t have a very good way of trying to hint to a user of the model whether a work is derivative.

However, I’d think that what we could do is operate something like a federal registry of images. For published, copyrighted works, we already have mandatory deposit with the Library of Congress.

If something akin to Tineye were funded by the government, it would be possible to maintain an archive of registered, copyrighted work. It would then be practical for someone who had just generated an image to check whether there was a pre-existing image.

I don’t know whether Tineye works like this, but for it to work, we’d probably have to have a way to recognize an image under a bunch of transformations: scale, rotation, color, etc. I don’t know what Tineye does today, but I’d assume some kind of feature recognition – maybe does line-detection, vectorizes it, breaks an image up into a bunch of chuns, performs some operation to canonicalize the rotation based on the content of the chunk, and then performs some kind of fuzzy hash on the lines.

Then one could place an expectation that if one is to distribute an LLM-generated work, it be fed into such a system, and if not so verified and distributed and the work is derivative of a registered work, the presumption being that the infringement was intentional (which IIRC entitles a rights holder to treble damages under US law). We don’t have a mathematical model today to determine whether one work is “derivative” of another, but we could make one or at least give an approximation and warning.

I think that that’s practical for most cases for for holders of copyrighted images and LLM users. It permits people to use LLMs to generate images for non-distributed use. It doesn’t create a legal minefield for an LLM user. It places no restrictions on model creators. It’s doable using something like existing technology. It permits a viewer of a generated image to verify that the image is not derivative.

source

Sort:hotnew top

Pyro@pawb.social ⁨5⁩ ⁨months⁩ ago
It is a mine field. The fact that it can generate almost an exact copy of some things if it’s over trained on an image or if the stars hit just right

On a different not llm is Large Language Model not the image generator

source
artificialfish@programming.dev ⁨5⁩ ⁨months⁩ ago
You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).

You MIGHT even be able to do that while masking the data using hashing.

source
- tal@lemmy.today ⁨5⁩ ⁨months⁩ ago
  
  You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).
  
  True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.
  
  source
  - artificialfish@programming.dev ⁨5⁩ ⁨months⁩ ago
    Minhash might be able to produce a similarity metric without needing exactness and without revealing the training data.
    
    source