Comment on [deleted]
sbv@sh.itjust.works 1 week ago
However, according to Opper. ai, only 11/53 cloud-based Al passed the test (~20%). Worrying, about the same error rate as humans
lololol
Comment on [deleted]
sbv@sh.itjust.works 1 week ago
However, according to Opper. ai, only 11/53 cloud-based Al passed the test (~20%). Worrying, about the same error rate as humans
lololol
SuspciousCarrot78@lemmy.world 1 week ago
Sorry; brain fart. That could have been clearer.
On a single call, only 11 out of 53 LLM got it right (~20%) Humans: about 71.5% (so, almost 1 in 3 gave the incorrect answer)
Of the 20% of LLMs got it right, 5 got it right every time across multiple tests Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4
sbv@sh.itjust.works 1 week ago
Phew. I’m glad humans did better than bots.
SuspciousCarrot78@lemmy.world 1 week ago
Still…1 in 3. Woof.
A “charitable” read might be
I think it’s fair if we’re willing to do that for people we extend it to the clankers. At least a bit. Like I said, I think there’s some interesting stuff going on under the hood.