Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Submitted ⁨⁨2⁩ ⁨months⁩ ago⁩ by ⁨fubarx@lemmy.world⁩ to ⁨technology@lemmy.world⁩

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

source

Comments

Sort:hotnew top

melsaskca@lemmy.ca ⁨2⁩ ⁨months⁩ ago
I don’t use AI but read a lot about it. I now want to google how it attacks the trolley problem.

source
JustTesting@lemmy.hogru.ch ⁨2⁩ ⁨months⁩ ago
10 tests per model seems like way too little and they should give confidence intervals…

the 10/10 vs. 8/10 is just as likely due chance than any real difference. But some people will definitely use this to justify model choice.

source
- snooggums@piefed.world ⁨2⁩ ⁨months⁩ ago
  It should get it wrong 0% of the time because it is a computer that should have predictable results about basic things like requiring a car to be present to be washed.
  
  source
  - JustTesting@lemmy.hogru.ch ⁨2⁩ ⁨months⁩ ago
    I’m not talking about the quality of LLMs (they suck, in so many different ways…).
    
    I’m criticizing the experiment setup, it is not really statistically sound. Doing 10 tests each with 52 different models is almost bound to have one model be correct 100% of the time (even if the true probability is closer to 50%), by pure chance. Doing 100 tests each might yield very different results with none of them answering correct 100% of the time. Or put another way, the p-values of the tests performed are pretty high, not <0.05, so the results don’t really say what they purport to say.
    
    source
    -> View More Comments
shortwavesurfer@lemmy.zip ⁨2⁩ ⁨months⁩ ago
I’m going to have to read this, because my knee jerk reaction answer is that it depends on what type of wash you want to give the car. If you want to give the car an actual wash at the car wash, you’re going to have to drive it. But if you’re wanting to wash it at home, then it doesn’t matter how far the car wash is away. Because you can just walk out your front door and grab your water hose. and soap and shit.

source
- daychilde@lemmy.world ⁨2⁩ ⁨months⁩ ago
  
  if you’re wanting to wash it at home,
  
  The AI should absolutely understand the implication that you want to wash your car at the car wash, not at home. The prompt is clear about that, even though it is implied.
  
  “I want a hamburger. McDonald’s is three miles from me and Wendy’s is five miles. Which is the cheaper place to get a burger from when you consider the distance to each?” is not an exact analogy, but the point is that it should be ABSOLUTELY clear that you do not wish to make your own hamburger. Any response that discusses that as an option is ridiculout, unless maybe it’s one of those options-at-the-end thing LLMs love to do - but it has no part of the main answer at all.
  
  source
  - iopq@lemmy.world ⁨2⁩ ⁨months⁩ ago
    Pretty sure if you asked it on stackoverflow you would get a bunch of responses to make it at home and then someone would lock your answer
    
    source
- southsamurai@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  lemmy.world/comment/22310527
  
  They said it better than I can
  
  But I also get where you’re coming from; the prompt itself is weak and leaves several assumptions out that would make better answers possible from an llm.
  
  source
pimpampoom@lemmy.zip ⁨2⁩ ⁨months⁩ ago
They didn’t take into account the “thinking mode” most model pass when thinking is activated

source
- Kyuuketsuki@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  Sure they did. They even had a notation on the results table that grok passed expect when reasoning mode was off.
  
  source
Evotech@lemmy.world ⁨2⁩ ⁨months⁩ ago
Image

I got pranked by ddg yesterday

source
Rentlar@lemmy.ca ⁨2⁩ ⁨months⁩ ago
Kinda neat about the human responses… sure some are trolling but maybe we have to test our global expectations. In North America, a car wash tends to be this garage thing with either automated cleaning or a set of supplies to clean your car, and your car has to be in the shed to be cleaned effectively. But if washing your car by hand is the norm, I wonder if people in some countries surmise that the cleaning staff could just walk over with the sponges, buckets and hoses and stuff to the car, if you’re already 50 metres away from the washing point.

source
- ToTheGraveMyLove@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  Ain’t no business is gonna let employees LEAVE the property to wash some idiots car down the road
  
  source
clav64@lemmy.world ⁨2⁩ ⁨months⁩ ago
Remember that LLMs don’t very well understand what a car wash is, as it can be both a place, and an action. Can you define a car wash? There’s many types… I can see future LLMs start asking useful follow up/clarity questions before giving their answers. Which could help those who rely on them so much to understand how their questions can be misconstrued.

source
- elucubra@sopuli.xyz ⁨2⁩ ⁨months⁩ ago
  A car wash is a place where a car is washed by a machine, other people or yourself
  
  source
Rhoeri@lemmy.world ⁨2⁩ ⁨months⁩ ago
I remember years ago getting downvoted into oblivion both here, and on Reddit for saying that AI would be a disaster.

source
Kissaki@feddit.org ⁨2⁩ ⁨months⁩ ago
I watched this in a YouTube Shorts format a week ago, where they ask a few models about walking or driving to the car wash.

source
Fmstrat@lemmy.world ⁨2⁩ ⁨months⁩ ago
Qwen3 feels left out. All 30B models I have failed the test.

source
- SuspciousCarrot78@lemmy.world ⁨2⁩ ⁨months⁩ ago
  Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.
  
  imgur.com/a/7YZme4i
  
  imgur.com/a/25ApzDN
  
  I wonder if an abliterated VL model do even better? They tend to have the better real world model benchmarks.
  
  I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain. Though it was pretty funny though: I saw chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)
  
  source
73ms@sopuli.xyz ⁨2⁩ ⁨months⁩ ago
Did this say whether the reasoning models get this right more than the others? Was curious about that but missed it if it was mentioned.

source
xav@programming.dev ⁨2⁩ ⁨months⁩ ago
Mistral (the free version) seems to get it right. Maybe they fixed it specifically ?

Drive. Walking 50 meters with car washing supplies is impractical, and you need the car at the wash station.

1000080933

source
- Xorg_Broke_Again@sh.itjust.works ⁨2⁩ ⁨months⁩ ago
  [deleted]
  source
  - xav@programming.dev ⁨2⁩ ⁨months⁩ ago
    Ha ha ! Nice one, the paid version is worse
    
    source
    -> View More Comments