Yeah it’s a bunch of shit. I’m not an expert obviously, just talking out of my ass, but:
- Running inference for all the devices in the building to “our dev server” would not have maintained a usable level of response time for any of them, unless he meant to say “the dev cluster” or something and his home wifi glitched right at that moment and made it sound different
- LLMs don’t degrade by giving wrong answers, they degrade by stopping producing tokens
- Meta already has shown itself to be okay with lying
- GUYS JUST USE FUCKING CANNED ANSWERS WITH THE RIGHT SOUNDING VOICE, THIS ISN’T ROCKET SCIENCE, THAT’S HOW YOU DO DEMOS WHEN YOUR SHIT’S NOT DONE YET
Sasha@lemmy.blahaj.zone 1 day ago
LLMs can degrade by giving “wrong” answers, but not because of network congestion ofc.
That paper is fucking hilarious, but the tl;dr is that when asked to manage a vending machine business for an extended period of time, they eventually go completely insane. Some have an existential crisis, some call the whole thing a conspiracy and call the FBI, etc. it’s amazing how trash they are.
PhilipTheBucket@piefed.social 1 day ago
Initial thought: Well… but this is a transparently absurd way to set up an ML system to manage a vending machine. I mean it is a useful data point I guess, but to me it leads to the conclusion “Even though LLMs sound to humans like they know what they’re doing, they does not, don’t just stick the whole situation into the LLM input and expect good decisions and strategies to come out of the output, you have to embed it into a more capable and structured system for any good to come of it.”
Updated thought, after reading a little bit of the paper: Holy Christ on a pancake. Is this architecture what people have been meaning by “AI agents” this whole time I’ve been hearing about them? Yeah this isn’t going to work. What the fuck, of course it goes insane over time. I stand corrected, I guess, this is valid research pointing out the stupidity of basically putting the LLM in the driver’s seat of something even more complicated than the stuff it’s already been shown to fuck up, and hoping that goes okay.
Sasha@lemmy.blahaj.zone 19 hours ago
I’m pretty sure they touch on those points in the paper, they knew they were overloading it and were looking at how it handled that in particular. My understanding is that they’re testing failure modes to try and probe the inner workings to some degree; they discuss the impact of filling up the context in the abstract, mention it’s designed to stress test and are particularly interested in memory limits, so I’m pretty sure they’ve deliberately chosen to not cater to an LLMs ideal conditions. It’s not really a real world use case of LLMs running a business (even if that’s the framing given initially), it’s an experiment meant to break them in a simulated environment. The last line of the abstract kind highlights this, they’re hoping to find flaws to improve the models generally.
Either way, I just meant to point out that they can absolutely just output junk as a failure mode.
PhilipTheBucket@piefed.social 19 hours ago
Yeah, I get it. I don’t think it is necessarily bad research or anything. I just feel like maybe it would have been good to go into it as two papers:
And yeah obviously they can get confused or output counterfactuals or nonsense as a failure mode, what I meant to say was just that they don’t really do that as a response to an overload / “DDOS” situation specifically. They might do it as a result of too much context or a badly set up framework around them sure.