AI agents wrong ~70% of time: Carnegie Mellon study

⁨985⁩ ⁨likes⁩

Submitted ⁨⁨4⁩ ⁨months⁩ ago⁩ by ⁨eli001@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/?td=rt-4a

source

Comments

Sort:hotnew top

brown567@sh.itjust.works ⁨4⁩ ⁨months⁩ ago
70% seems pretty optimistic based on my experience…

source
iopq@lemmy.world ⁨4⁩ ⁨months⁩ ago
Now I’m curious, what’s the average score for humans?

source
SocialMediaRefugee@lemmy.world ⁨4⁩ ⁨months⁩ ago
I use it for very specific tasks and give as much information as possible. I usually have to give it more feedback to get to the desired goal. For instance I will ask it how to resolve an error message. I’ve even asked it for some short python code. I almost always get good feedback when doing that. Asking it about basic facts works too like science questions.

One thing I have had problems with is if the error is sort of an oddball it will give me suggestions that don’t work with my OS/app version. Then I give it feedback and eventually it will loop back to its original suggestions, so it couldn’t come up with an answer.

source
burgerpocalyse@lemmy.world ⁨4⁩ ⁨months⁩ ago
I dont know why but I am reminded of this clip about eggless omelette youtu.be/9Ah4tW-k8Ao

source
Melvin_Ferd@lemmy.world ⁨4⁩ ⁨months⁩ ago
How often do tech journalist get things wrong?

source
lmagitem@lemmy.zip ⁨4⁩ ⁨months⁩ ago
Color me surprised

source
MagicShel@lemmy.zip ⁨4⁩ ⁨months⁩ ago
I need to know the success rate of human agents in Mumbai (or some other outsourcing capital) for comparison.

I absolutely think this is not a good fit for AI, but I feel like the presumption is a human would get it right nearly all of the time, and I’m just not confident that’s the case.

source
dylanmorgan@slrpnk.net ⁨4⁩ ⁨months⁩ ago
Claude why did you make me an appointment with a gynecologist? I need an appointment with my neurologist, I’m a man and I have Parkinson’s.

source
- TimewornTraveler@lemmy.dbzer0.com ⁨4⁩ ⁨months⁩ ago
  Got it, changing your gender to female
  
  source
lemmy_outta_here@lemmy.world ⁨4⁩ ⁨months⁩ ago
Rookie numbers! Let’s pump them up!

To match their tech bro hypers, the should be wrong at least 90% of the time.

source
sircac@lemmy.world ⁨4⁩ ⁨months⁩ ago
Why would they be right beyond word sequence frecuencies?

source
dan69@lemmy.world ⁨4⁩ ⁨months⁩ ago
And it won’t be until humans can agree on what’s a fact and true vs not… there is always someone or some group spreading mis/dis-information

source
Ileftreddit@lemmy.world ⁨4⁩ ⁨months⁩ ago
Hey I went there

source
esc27@lemmy.world ⁨4⁩ ⁨months⁩ ago
30% might be high. I’ve worked with two different agent creation platforms. Both require a huge amount of manual correction to work anywhere near accurately. I’m really not sure what the limit actually provides other than some natural language processing.

In my experience these sorts of agents are right 20% of the time, wrong 30%, and fail entirely 50%. A human has to sit behind the curtain and manually review conversations and program custom interactions for every failure.

In theory, once it is fully setup and all the edge cases fixed, it will provide 24/7 support in a convenient chat format. But that takes a lot more man hours than the hype suggests…

Weirdly, chatgpt does a better job than a purpose built, purchased agent.

source
atticus88th@lemmy.world ⁨4⁩ ⁨months⁩ ago
this study was written with the assistance of an AI agent.
source