Comment

Comment on AI agents wrong ~70% of time: Carnegie Mellon study

<- View Parent

jsomae@lemmy.ml ⁨5⁩ ⁨months⁩ ago

I meant the latter, not “it can do 30% of tasks correctly 100% of the time.”

source

Sort:hotnew top

outhouseperilous@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
You get how that’s fucking useless, generally?

source
- jsomae@lemmy.ml ⁨5⁩ ⁨months⁩ ago
  yes, that’s generally useless. It should not be shoved down people’s throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.
  
  source
  - Knock_Knock_Lemmy_In@lemmy.world ⁨5⁩ ⁨months⁩ ago
    Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.
    
    source
    MangoCats@feddit.it ⁨5⁩ ⁨months⁩ ago
    I have actually been doing this lately: iteratively prompting AI to write software and fix its errors until something useful comes out. It’s a lot like machine translation. I speak fluent C++, but I don’t speak Rust, but I can hammer away on the AI (with English language prompts) until it produces passable Rust for something I could write for myself in C++ in half the time and effort.
    
    I also don’t speak Finnish, but Google Translate can take what I say in English and put it into at least somewhat comprehensible Finnish without egregious translation errors most of the time.
    
    Is this useful? When C++ is getting banned for “security concerns” and Rust is the required language, it’s at least a little helpful.
    
    source
    -> View More Comments
    jsomae@lemmy.ml ⁨5⁩ ⁨months⁩ ago
    The problem is they are not i.i.d., so this doesn’t really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we’re already looking at “agents,” so they’re probably already doing chain-of-thought.
    
    source
    -> View More Comments
    davidagain@lemmy.world ⁨5⁩ ⁨months⁩ ago
    What’s 0.7^10?
    
    source
    -> View More Comments
  - outhouseperilous@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
    Less broadly useful than 20 tons of mixed texture human shit.
    
    source
    jsomae@lemmy.ml ⁨5⁩ ⁨months⁩ ago
    Are you just trolling or do you seriously not understand how something which can do a task correctly with 30% reliability can be made useful if the result can be automatically verified.
    
    source
    -> View More Comments
- MangoCats@feddit.it ⁨5⁩ ⁨months⁩ ago
  As useless as a cubicle farm full of unsupervised workers.
  
  source
  - outhouseperilous@lemmy.dbzer0.com ⁨5⁩ ⁨months⁩ ago
    Tjose are people who could be living their li:es, pursuing their ambitions, whatever. That could get some shit done. Comparison not valid.
    
    source
    Honytawk@feddit.nl ⁨5⁩ ⁨months⁩ ago
    The comparison is about the correctness of their work.
    
    Their lives have nothing to do with it.
    
    source
    -> View More Comments