Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
Submitted 2 weeks ago by fubarx@lemmy.world to technology@lemmy.world
https://opper.ai/blog/car-wash-test
Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
10 tests per model seems like way too little and they should give confidence intervals…
the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.
It should get it wrong 0% of the time because it is a computer that should have predictable results about basic things like requiring a car to be present to be washed.
I’m not talking about the quality of LLMs (they suck, in so many different ways…).
I’m criticizing the experiment setup, it is not really statistically sound. Doing 10 tests each with 52 different models is almost bound to have one model be correct 100% of the time (even if the true probability is closer to 50%), by pure chance. Doing 100 tests each might yield very different results with none of them answering correct 100% of the time. Or put another way, the p-values of the tests performed are pretty high, not <0.05, so the results don’t really say what they purport to say.
I’m going to have to read this, because my knee jerk reaction answer is that it depends on what type of wash you want to give the car. If you want to give the car an actual wash at the car wash, you’re going to have to drive it. But if you’re wanting to wash it at home, then it doesn’t matter how far the car wash is away. Because you can just walk out your front door and grab your water hose. and soap and shit.
if you’re wanting to wash it at home,
The AI should absolutely understand the implication that you want to wash your car at the car wash, not at home. The prompt is clear about that, even though it is implied.
“I want a hamburger. McDonald’s is three miles from me and Wendy’s is five miles. Which is the cheaper place to get a burger from when you consider the distance to each?” is not an exact analogy, but the point is that it should be ABSOLUTELY clear that you do not wish to make your own hamburger. Any response that discusses that as an option is ridiculout, unless maybe it’s one of those options-at-the-end thing LLMs love to do - but it has no part of the main answer at all.
Pretty sure if you asked it on stackoverflow you would get a bunch of responses to make it at home and then someone would lock your answer
They said it better than I can
But I also get where you’re coming from; the prompt itself is weak and leaves several assumptions out that would make better answers possible from an llm.
They didn’t take into account the “thinking mode” most model pass when thinking is activated
Sure they did. They even had a notation on the results table that grok passed expect when reasoning mode was off.
Kinda neat about the human responses… sure some are trolling but maybe we have to test our global expectations. In North America, a car wash tends to be this garage thing with either automated cleaning or a set of supplies to clean your car, and your car has to be in the shed to be cleaned effectively. But if washing your car by hand is the norm, I wonder if people in some countries surmise that the cleaning staff could just walk over with the sponges, buckets and hoses and stuff to the car, if you’re already 50 metres away from the washing point.
Ain’t no business is gonna let employees LEAVE the property to wash some idiots car down the road
Remember that LLMs don’t very well understand what a car wash is, as it can be both a place, and an action. Can you define a car wash? There’s many types… I can see future LLMs start asking useful follow up/clarity questions before giving their answers. Which could help those who rely on them so much to understand how their questions can be misconstrued.
I remember years ago getting downvoted into oblivion both here, and on Reddit for saying that AI would be a disaster.
I watched this in a YouTube Shorts format a week ago, where they ask a few models about walking or driving to the car wash.
Qwen3 feels left out. All 30B models I have failed the test.
Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.
I wonder if an abliterated VL model do even better? They tend to have the better real world model benchmarks.
I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain. Though it was pretty funny though: I saw chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)
Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.
Mistral (the free version) seems to get it right. Maybe they fixed it specifically ?
Drive. Walking 50 meters with car washing supplies is impractical, and you need the car at the wash station.
Ha ha ! Nice one, the paid version is worse
melsaskca@lemmy.ca 2 weeks ago
I don’t use AI but read a lot about it. I now want to google how it attacks the trolley problem.