Comment

Comment on [deleted]

sbv@sh.itjust.works ⁨2⁩ ⁨months⁩ ago

However, according to Opper. ai, only 11/53 cloud-based Al passed the test (~20%). Worrying, about the same error rate as humans

lololol

Sort:hotnew top

SuspciousCarrot78@lemmy.world ⁨2⁩ ⁨months⁩ ago
Sorry; brain fart. That could have been clearer.

On a single call, only 11 out of 53 LLM got it right (~20%) Humans: about 71.5% (so, almost 1 in 3 gave the incorrect answer)

Of the 20% of LLMs got it right, 5 got it right every time across multiple tests Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, Grok-4

source
- sbv@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  Phew. I’m glad humans did better than bots.
  
  source
  - SuspciousCarrot78@lemmy.world ⁨2⁩ ⁨months⁩ ago
    Still…1 in 3. Woof.
    
    A “charitable” read might be
    
    Misunderstood the question
    
    Assume priors (eg: people come to wash your car from nearby gas station)
    
    [Schitzoid embolism] (…substack.com/…/movie-neurobabble-total-recalls-s…)
    
    I think it’s fair if we’re willing to do that for people we extend it to the clankers. At least a bit. Like I said, I think there’s some interesting stuff going on under the hood.
    
    source