Making up answers is kinda their entire purpose. LMMs are fundamentally just a text generation algorithm, they are designed to produce text that looks like it could have been written by a human. Which they are amazing at, especially when you start taking into account how many paragraphs of instructions you can give them, and they tend to rather successfully follow.
The one thing they can’t do is verify if what they are talking about is true. If they could, they would stop being LLMs and start being AGIs.
Knock_Knock_Lemmy_In@lemmy.world 2 weeks ago
A well trained model should consider both types of lime. Failure is likely down to temperature and other model settings. This is not a measure of intelligence.