As I understand it, CLIP (and other text encoders in diffusion models) aren’t trained like LLMs, exactly. They’re trained on image/text pairing, which ya get from the metadata creators upload with their photos in Adobe Stock. That said, Adobe hasn’t published their entire architecture.
Hackworth@piefed.ca 13 hours ago
As I understand it, CLIP (and other text encoders in diffusion models) aren’t trained like LLMs, exactly. They’re trained on image/text pairing, which ya get from the metadata creators upload with their photos in Adobe Stock. That said, Adobe hasn’t published their entire architecture.