Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless.

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨ModerateImprovement@sh.itjust.works⁩ to ⁨technology@lemmy.world⁩

https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

source

Comments

Sort:hotnew top

superminerJG@lemmy.world ⁨7⁩ ⁨months⁩ ago
Goodhart’s law:

When a measure becomes a target, it ceases to be a good measure.

source
- bionicjoey@lemmy.ca ⁨7⁩ ⁨months⁩ ago
  The Turing Test (as some people believe it to be): if you can have a conversation with a computer and not tell if it’s a computer, then it must be intelligent.
  
  AI companies: writes ML model that is specifically designed to convincingly play one side of a conversation, even though it has no ability to understand the things it talks about.
  
  source
  - technocrit@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago
    It’s worth emphasizing that the “Turing Test” is not a good test since it’s not at all scientific.
    
    It’s just another thought experiment that grifters have taken to the bank.
    
    source
    -> View More Comments
exu@feditown.com ⁨7⁩ ⁨months⁩ ago
There’s a reason why the open llm leaderboard was changed a while ago.
Basically, scores didn’t improve much anymore and many tests were contained in the training data.

See this blogpost for more info.

huggingface.co/spaces/open-llm-leaderboard/blog

source
MajorHavoc@programming.dev ⁨7⁩ ⁨months⁩ ago
“close to meaningless” sums up my expert opinion on the whole current AI hype machine sales pitch.

Highly tuned models for incredibly specific, not-dangerous use cases is the next pragmatic step. There’s a lot to excited about, in that very narrow band.

Anyone selling more than that is part of a con, or in very rare cases, doing genuine “fuck off and ask me again in a decade” kinds of research.

source
Buffalox@lemmy.world ⁨7⁩ ⁨months⁩ ago
Much like IQ tests for humans are flawed too. Figuring out series of numbers or relations in a graphic representation, only tells how good you are at these specific tasks, and doesn’t provide a reliable picture of “general” intelligence.

source
sunbeam60@lemmy.one ⁨7⁩ ⁨months⁩ ago
The article makes the valid argument that LLMs simply predict next letters based on training and query.

But is that actually true of latest models from OpenAI, Claude etc?

And even if it is true, what solid proof do we have that humans aren’t doing the same? I’ve met endless people who could waffle for hours without seeming to do any reasoning.

source
- rottingleaf@lemmy.world ⁨7⁩ ⁨months⁩ ago
  Information theory, entropy in Markovian processes. Read up on these buzzwords to see why.
  
  source
  - sunbeam60@lemmy.one ⁨7⁩ ⁨months⁩ ago
    I think I know enough about these concepts to know that there isn’t any conclusive proof, observed in output or system state, to establish consensus that human speech output is generated differently to how LLMs generate output. If you have links to any papers that claim otherwise, I’ll be happy to read them.
    
    source
    -> View More Comments
- technocrit@lemmy.dbzer0.com ⁨7⁩ ⁨months⁩ ago
  
  what solid proof do we have that humans aren’t doing the same?
  
  Humans are not computers. Brains are not LLMs…
  
  Given a totally reasonable hypothesis (humans =/= computers) and a completely outlandish hypothesis (humans = computers), I would need much more ‘proof’ for the later.
  
  source
  - sunbeam60@lemmy.one ⁨7⁩ ⁨months⁩ ago
    Well, brains are a network of neurons (we can evidentially verify this) trained on … eyes, ears, sense of touch, taste, smell and balance. LLMs are a network of neurons trained on text and images.
    
    It’s not given that this results in the same way of dealing with language, given the wider set of input data for a human, but it’s not given that it doesn’t either.
    
    source
    -> View More Comments
water@lemmy.world ⁨7⁩ ⁨months⁩ ago
This is the way:

chat.lmsys.org/?arena

source
A_A@lemmy.world ⁨7⁩ ⁨months⁩ ago
Looks quite satisfying to me, otherwise, we can still create new tests … :

The tests cover an astounding range of knowledge, such as eighth-grade math, world history, and pop culture. Many are multiple choice, others take free-form answers. Some purport to measure knowledge of advanced fields like law, medicine and science. Others are more abstract, asking AI systems to choose the next logical step in a sequence of events, or to review “moral scenarios” and decide what actions would be considered acceptable behavior in society today.

source