Comment

Comment on Meta chief technology officer(CTO) explains why the smart glasses demo failed at Meta Connect — and it wasn’t the Wi-Fi

<- View Parent

PhilipTheBucket@piefed.social ⁨2⁩ ⁨months⁩ ago

Yeah it’s a bunch of shit. I’m not an expert obviously, just talking out of my ass, but:

Running inference for all the devices in the building to “our dev server” would not have maintained a usable level of response time for any of them, unless he meant to say “the dev cluster” or something and his home wifi glitched right at that moment and made it sound different
LLMs don’t degrade by giving wrong answers, they degrade by stopping producing tokens
Meta already has shown itself to be okay with lying
GUYS JUST USE FUCKING CANNED ANSWERS WITH THE RIGHT SOUNDING VOICE, THIS ISN’T ROCKET SCIENCE, THAT’S HOW YOU DO DEMOS WHEN YOUR SHIT’S NOT DONE YET

source

Sort:hotnew top

Sasha@lemmy.blahaj.zone ⁨2⁩ ⁨months⁩ ago
LLMs can degrade by giving “wrong” answers, but not because of network congestion ofc.

That paper is fucking hilarious, but the tl;dr is that when asked to manage a vending machine business for an extended period of time, they eventually go completely insane. Some have an existential crisis, some call the whole thing a conspiracy and call the FBI, etc. it’s amazing how trash they are.

source
- PhilipTheBucket@piefed.social ⁨2⁩ ⁨months⁩ ago
  Initial thought: Well… but this is a transparently absurd way to set up an ML system to manage a vending machine. I mean it is a useful data point I guess, but to me it leads to the conclusion “Even though LLMs sound to humans like they know what they’re doing, they does not, don’t just stick the whole situation into the LLM input and expect good decisions and strategies to come out of the output, you have to embed it into a more capable and structured system for any good to come of it.”
  
  Updated thought, after reading a little bit of the paper: Holy Christ on a pancake. Is this architecture what people have been meaning by “AI agents” this whole time I’ve been hearing about them? Yeah this isn’t going to work. What the fuck, of course it goes insane over time. I stand corrected, I guess, this is valid research pointing out the stupidity of basically putting the LLM in the driver’s seat of something even more complicated than the stuff it’s already been shown to fuck up, and hoping that goes okay.
  
  source
  - Sasha@lemmy.blahaj.zone ⁨2⁩ ⁨months⁩ ago
    I’m pretty sure they touch on those points in the paper, they knew they were overloading it and were looking at how it handled that in particular. My understanding is that they’re testing failure modes to try and probe the inner workings to some degree; they discuss the impact of filling up the context in the abstract, mention it’s designed to stress test and are particularly interested in memory limits, so I’m pretty sure they’ve deliberately chosen to not cater to an LLMs ideal conditions. It’s not really a real world use case of LLMs running a business (even if that’s the framing given initially), it’s an experiment meant to break them in a simulated environment. The last line of the abstract kind highlights this, they’re hoping to find flaws to improve the models generally.
    
    Either way, I just meant to point out that they can absolutely just output junk as a failure mode.
    
    source
    PhilipTheBucket@piefed.social ⁨2⁩ ⁨months⁩ ago
    Yeah, I get it. I don’t think it is necessarily bad research or anything. I just feel like maybe it would have been good to go into it as two papers:
    
    Look at the funny LLM and how far off the rails it goes if you don’t keep it stable and let it kind of “build on itself” over time iteratively and don’t put the right boundaries on
    
    How should we actually wrap up an LLM into a sensible model so that it can pursue an “agent” type of task, what leads it off the rails and what doesn’t, what are some various ideas to keep it grounded and which ones work and don’t work
    
    And yeah obviously they can get confused or output counterfactuals or nonsense as a failure mode, what I meant to say was just that they don’t really do that as a response to an overload / “DDOS” situation specifically. They might do it as a result of too much context or a badly set up framework around them sure.
    
    source
    -> View More Comments