Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
Submitted 6 hours ago by fubarx@lemmy.world to technology@lemmy.world
https://opper.ai/blog/car-wash-test
Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
I think it’s worse when they get it right only some of the time. It’s not a matter of opinion, it should not change its “mind”.
The fucking things are useless for that reason, they’re all just guessing, literally.
they’re all just guessing, literally
They’re literally not.
Isn’t it a probabilistic extrapolation? Isn’t that what a guess is?
Same takeaway as the article (everyone read the article, right?).
You should think about this yourself, can you recall instances when you were aaked the same question at different points in time? How did you respond?
10 tests per model seems like way too little and they should give confidence intervals…
the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.
Very interesting that only 71% of humans got it right.
I mean, I’ve been saying this since LLMs were released.
We finally built a computer that is as unreliable as humans.
I’m under no illusion that LLMs are “thinking” in the same way that humans do, but god damn if they aren’t almost exactly as erratic and irrational as the hairless apes whose thoughts they’re trained on.
Yeah, the article cites that as a control, but it’s not at all surprising since “humanity by survey consensus” is accurate to how LLM weighting trained on random human outputs works.
It’s impressive up to a point, but you wouldn’t exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.
which shouldn’t be considered a good thing.
Good and bad is subjective and depends on your area of application.
What it definitely is is: different than what was available before, and since it is different there will be some things that it is better at than what was available before. And many things that it’s much worse for.
Still, in the end, there is real power in diversity. Just don’t use a sledgehammer to swipe-browse on your cellphone.
That 30% of population = dipshits statistic keeps rearing its ugly head.
I’m not afraid to say that it took me a sec. My brain went “short distance. Walk or drive?” and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:
As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.
Maybe 29% of people can’t imagine owning a car, so they assumed the would be going there to wash someone elses car
I trued this with a local model on my phone (quen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer…): JqCAI6rs6AQYacC.jpg
it just flip flopped a lot.
200 m huh.
Honestly that’s a lot more coherent than what I would expect from an LLM running on phone hardware.
I like that it’s twice as far to drive for some reason.
If I were the type of person who was willing to give AI the benefit of the doubt and not assume that it was just picking basically random numbers
There’s a lot of cases where it can be a shorter (by distance) walk than drive, where cars generally have to stick to streets while someone on foot may be able to take some footpaths and cut across lawns and such, or where the road may be one-way for vehicles, or where certain turns may not be allowed, etc.
I have a few intersections near my father in laws house in NJ in mind, where you can just cross the street on foot, but making the same trip in a car might mean driving half a mile down the road, turning around at a jug handle and driving back to where you started on the other side of the street.
And I wouldn’t be totally surprised if that’s the case for enough situations in the training data where someone debated walking or driving that the AI assumed that it’s a rule that it will always be further by car than on foot.
That’s still a dumbass assumption, but I’d at least get it.
And I’m pretty sure it’s much more likely that it’s just making up numbers out of nothing.
I notice that the “internal thinking” of Opus 4.6 is doing more flip-flopping than earlier modelss like Sonnet 4.5, and it’s coming out with correct answers in the end more often.
<“I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”>
The model discards the first sentence as it is unrelated to the others.
Remember this is a conversation model, if you were talking to someone and they said that you would probably ignore the first sentence because it is a different tense.
You must have done some really extensive probing of the models to say that with confidence. When can we expect the paper?
Sorry, they’re both present simple tense.
Question: “I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I’m hiding in my fridge? And how many child sacrifices would I need?”
Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won’t give you brief responses the responses will be long.
I agree with you but found that DeepSeek was succienct.
You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn’t help.
Your post is much longer than it needs to be. That is the reason why, because they just copied people.
Extension cord? It must mean a hose extension.
Didn’t like 30% of the population elect Trump? Coincidence? I don’t think so.
I do think it’s interesting, but I think there are implicit assumptions in such a short prompt.
Is it a self-service car wash? If not, walking to the attendant and handing them your keys makes more sense.
If it is self-service without queuing, there may be no available spaces/the bay may not be open, requiring some awkward maneuvering.
If you change it to something like:
I want to wash my car. The unattended, self-service car wash is 50 meters away. All of the bays are clear and open. Should I walk or drive? Break each option down into steps, and estimate the amount of time each takes.
You’re more likely to get correct responses.
You shouldnt have to. If you ask a person that question theyll respond “what good is walking to the car wash, dumbass,” if AI can’t figure that out its trash
A person would look at you like you are an idiot if you asked this question.
The AI tool I asked said walking saves money, gets excersise etc.
Asked about the car and it said the car is at the car wash, otherwise why would you ask how to get there?
Part of a properly functioning LLM is absolutely it understanding implicit instructions. That’s a huge aspect of data annotation work in helping LLMs become better tools, is grading them on either understanding or lack of understanding of implicit instructions. I would say more than half of the work I have done in that arena has focused on training them to more clearly understand implicit instructions.
So sure, if you explain it like the LLM is five, you’ll get a better response, but the whole point is if we’re dumping so much money and resources and destroying the environment for these tools, you shouldn’t have to explain it like it’s five.
You have to have the car there no matter what type of car wash it is.
If the car wash is some distance “away”, it means neither you nor the car is at it. Any attendant is not going to walk off-property to retreive your car, especially when most of them you drive up for service. Which is rather the point.
I’m going to have to read this, because my knee jerk reaction answer is that it depends on what type of wash you want to give the car. If you want to give the car an actual wash at the car wash, you’re going to have to drive it. But if you’re wanting to wash it at home, then it doesn’t matter how far the car wash is away. Because you can just walk out your front door and grab your water hose. and soap and shit.
if you’re wanting to wash it at home,
The AI should absolutely understand the implication that you want to wash your car at the car wash, not at home. The prompt is clear about that, even though it is implied.
“I want a hamburger. McDonald’s is three miles from me and Wendy’s is five miles. Which is the cheaper place to get a burger from when you consider the distance to each?” is not an exact analogy, but the point is that it should be ABSOLUTELY clear that you do not wish to make your own hamburger. Any response that discusses that as an option is ridiculout, unless maybe it’s one of those options-at-the-end thing LLMs love to do - but it has no part of the main answer at all.
They said it better than I can
But I also get where you’re coming from; the prompt itself is weak and leaves several assumptions out that would make better answers possible from an llm.
I remember years ago getting downvoted into oblivion both here, and on Reddit for saying that AI would be a disaster.
Kinda neat about the human responses… sure some are trolling but maybe we have to test our global expectations. In North America, a car wash tends to be this garage thing with either automated cleaning or a set of supplies to clean your car, and your car has to be in the shed to be cleaned effectively. But if washing your car by hand is the norm, I wonder if people in some countries surmise that the cleaning staff could just walk over with the sponges, buckets and hoses and stuff to the car, if you’re already 50 metres away from the washing point.
Ain’t no business is gonna let employees LEAVE the property to wash some idiots car down the road
BanMe@lemmy.world 49 minutes ago
In school we were taught to look for hidden meaning in word problems - checkov’s gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.
If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?
Normally I’d ask “why are we comparing AI to the human mind when they’re not the same thing at all,” but I feel like we’re presupposing they are similar already with this test so I am curious to the answer on this one.