Announcing ARC-AGI-3 - An benchmark that tests if AI can explore, learn, and adapt in unfamiliar situations. Humans score 100%. Frontier AI scores 0.26%.

Submitted ⁨⁨2⁩ ⁨months⁩ ago⁩ by ⁨brianpeiris@lemmy.ca⁩ to ⁨technology@lemmy.world⁩

https://arcprize.org/blog/arc-agi-3-launch

The ARC Prize organization designs benchmarks which are specifically crafted to demonstrate tasks that humans complete easily, but are difficult for AIs like LLMs, “Reasoning” models, and Agentic frameworks.

ARC-AGI-3 is the first fully interactive benchmark in the ARC-AGI series. ARC-AGI-3 represents hundreds of original turn-based environments, each handcrafted by a team of human game designers. There are no instructions, no rules, and no stated goals. To succeed, an AI agent must explore each environment on its own, figure out how it works, discover what winning looks like, and carry what it learns forward across increasingly difficult levels.

Previous ARC-AGI benchmarks predicted and tracked major AI breakthroughs, from reasoning models to coding agents. ARC-AGI-3 points to what’s next: the gap between AI that can follow instructions and AI that can genuinely explore, learn, and adapt in unfamiliar situations.

You can try the tasks yourself here: arcprize.org/arc-agi/3

Here is the current leaderboard for ARC-AGI 3, using state of the art models

OpenAI GPT-5.4 High - 0.3% success rate at $5.2K
Google Gemini 3.1 Pro - 0.2% success rate at $2.2K
Anthropic Opus 4.6 Max - 0.2% success rate at $8.9K
xAI Grok 4.20 Reasoning - 0.0% success rate $3.8K.

ARC-AGI 3 Leaderboard
(Logarithmic cost on the horizontal axis)

arcprize.org/leaderboard

source

Comments

Sort:hotnew top

RustyShackleford@piefed.social ⁨2⁩ ⁨months⁩ ago
As a psychiatrist, I have a theory about what’s missing in AI. First, it lacks childhood dependency and attachments. Second, it struggles to overcome repeated pain and suffering. Third, it lacks regular eating and restroom breaks. Fourth, it struggles to accept loss in everyday situations. Finally, it lacks the concept of our inevitable death. Without these nagging memories and concepts, machines will simply revert to the simpler concepts we use them for in our recent times, such as stealing cryptocurrency. After all, we live in a world run by capitalism, so it’s only logical. ¯\_(ツ)_/¯

source
- CosmicTurtle0@lemmy.dbzer0.com ⁨2⁩ ⁨months⁩ ago
  As a technologist, I have to remind everyone that AI is not intelligence. It’s a word prediction/statistical machine. It’s guessing at a surprisingly good rate what words follow the words before it.
  
  It’s math. All the way down.
  
  We as humans have simply taken these words and have said that it is “intelligence”.
  
  source
  - unpossum@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
    As another technologist, I have to remind everyone that unless you subscribe to some rather fringe theories, humans are also based on standard physics.
    
    Which is math. All the way down.
    
    source
    -> View More Comments
  - silverneedle@lemmy.ca ⁨2⁩ ⁨months⁩ ago
    As someone who knows a thing or two about biology I think LLMs strip away >90% of what makes animals think.
    
    source
- msage@programming.dev ⁨2⁩ ⁨months⁩ ago
  Are you anthromorphizing word suggester into a being experiencing things?
  
  source
- MagicShel@lemmy.zip ⁨2⁩ ⁨months⁩ ago
  The major thing AI lacks is continuous parallel “prompting” through a variety of channels including sensory, biofeedback, and introspection / meta-thought about internal state and thinking.
  
  AI currently transforms a given input into an output. However it cannot accept new input in the middle of an output. It can’t evaluate the quality of its own reasoning except though trial and error.
  
  If you had 1000 AIs operating in tandem and fed a continuous stream of prompts in the form of pictures, text, meta-inspection, and perhaps a simulation of biomechanical feedback with the right configuration, I think it might be possible to create a system that is a hell of an approximation of sentience. But it would be slow and I’m not sure the result would be any better than a human — you’d introduce a lot of friction to the “thought” process. And I have to assume the energy cost would be pretty enormous.
  
  In the end it would be a cool experiment to be part of, but I doubt that version would be worth the investment.
  
  source
- ExFed@programming.dev ⁨2⁩ ⁨months⁩ ago
  It could also be that it lacks the machinery to feel any emotions at all. You don’t (normally) have to train people to be afraid of bears or heights or loneliness or boredom. You also don’t (normally) have to train people to have empathy or compassion.
  
  I argue that our obsession with AI is, itself, a misalignment with our environment; it disproportionately tickles psychological reward centers which evolved under unrecognizably different circumstances.
  
  source
  - Havoc8154@mander.xyz ⁨2⁩ ⁨months⁩ ago
    I guess you don’t have children.
    
    You absolutely do have to train them to be afraid of bears, heights, and every fucking thing you can imagine. You absolutely do have to teach them empathy and compassion. There may be some nugget of instinct, but without reinforcement it might as well not exist.
    
    source
    -> View More Comments
  - 2xsaiko@discuss.tchncs.de ⁨2⁩ ⁨months⁩ ago
    
    You don’t (normally) have to train people to be afraid of bears or heights or loneliness or boredom. You also don’t (normally) have to train people to have empathy or compassion.
    
    So what are you implying about people who don’t experience these?
    
    source
    -> View More Comments
HaunchesTV@feddit.uk ⁨2⁩ ⁨months⁩ ago

Grok Reasoning: 0%

Hilarious

source
- brsrklf@jlai.lu ⁨2⁩ ⁨months⁩ ago
  Reasoning is woke propaganda, obviously.
  
  source
ExLisper@lemmy.curiana.net ⁨2⁩ ⁨months⁩ ago
Can’t wait for this to be the new captcha.

source
Multiplexer@discuss.tchncs.de ⁨2⁩ ⁨months⁩ ago
Link to the recent Al Explained video mainly covering ARC-AGI-3:
www.youtube.com/watch?v=s4tptozUJ8Y

source
GreatBlueHeron@lemmy.ca ⁨2⁩ ⁨months⁩ ago
It’s fun to point at the crappy performance of current technology. But all I can think about is the amount of power and hardware the AI bros are going to burn through trying to improve their results.

source
lath@lemmy.world ⁨2⁩ ⁨months⁩ ago
Biased study. Take any average person off the streets and shove this thing in their face. That 100% notion will go down fast.

source
- tomalley8342@lemmy.world ⁨2⁩ ⁨months⁩ ago
  They didn’t say “100% of humans can solve this benchmark”, they said “humans can solve 100% of this benchmark”.
  
  source
  - rimu@piefed.social ⁨2⁩ ⁨months⁩ ago
    I couldn’t get past the second level :(
    
    source
    -> View More Comments
  - lath@lemmy.world ⁨2⁩ ⁨months⁩ ago
    “Humans score 100%. Frontier AI scores 0.26%.”
    
    The title deals in absolutes.
    
    source
    -> View More Comments
- pulsewidth@lemmy.world ⁨2⁩ ⁨months⁩ ago
  Pretty defensive there. It’s not even a study
  
  source
  - lath@lemmy.world ⁨2⁩ ⁨months⁩ ago
    If it studies something, it’s a study. If you feel defensiveness, you consider aggression. If you feel bias in one way, someone can feel bias in another way. If there’s an action, there’s a reaction.
    
    source
    -> View More Comments
- brianpeiris@lemmy.ca ⁨2⁩ ⁨months⁩ ago
  
  ARC-AGI-3 Launch event - Shared publicly live on March 25 in San Francisco at Y Combinator HQ, featuring a fireside conversation between François Chollet (creator, ARC-AGI) and Sam Altman (CEO, OpenAI) on measuring intelligence on the path to AGI.
  
  François Chollet is a software engineer, artificial intelligence (AI) researcher, and former Senior Staff Engineer at Google. Chollet is the creator of the Keras deep-learning library released in 2015.
  
  source
UnrepentantAlgebra@lemmy.world ⁨2⁩ ⁨months⁩ ago

If human scores were included, they would be at 100%, at the cost of approximately $250

Wait, why did it cost real humans $250 to pass the test?

source
- mapleseedfall@lemmy.world ⁨2⁩ ⁨months⁩ ago
  Youd have to eat $250 worth of burgers to pass it.
  
  source
- FrankFrankson@lemmy.world ⁨2⁩ ⁨months⁩ ago
  Thatvis how much individual testing humans cost when you buy them in bulk.
  
  source
- ExLisper@lemmy.curiana.net ⁨2⁩ ⁨months⁩ ago
  Because I ain’t doing this shit for free.
  
  source
General_Effort@lemmy.world ⁨2⁩ ⁨months⁩ ago

ARC-AGI-3

What happened to ARC-AGI-1 and -2?

source
tatterdemalion@programming.dev ⁨2⁩ ⁨months⁩ ago
LLMs might suck at this game but I’m pretty sure Deepmind’s deep reinforcement learning AI could solve these easily.

source
- 33550336@lemmy.world ⁨2⁩ ⁨months⁩ ago
  if only it would exist
  
  source
  - tatterdemalion@programming.dev ⁨2⁩ ⁨months⁩ ago
    Wdym? It’s existed for at least a decade. Plenty of papers about it. It mastered Atari and Mario. It became the best Go player.
    
    source