Comment

utopiah@lemmy.world ⁨5⁩ ⁨months⁩ ago

Surprising, I would expect it’d rely at some point on something like CLIP in order to be prompted.

Sort:hotnew top

Hackworth@piefed.ca ⁨5⁩ ⁨months⁩ ago
As I understand it, CLIP (and other text encoders in diffusion models) aren’t trained like LLMs, exactly. They’re trained on image/text pairing, which ya get from the metadata creators upload with their photos in Adobe Stock. That said, Adobe hasn’t published their entire architecture.

source