Comment

Comment on Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Fmstrat@lemmy.world ⁨2⁩ ⁨weeks⁩ ago

Qwen3 feels left out. All 30B models I have failed the test.

Sort:hotnew top

SuspciousCarrot78@lemmy.world ⁨2⁩ ⁨weeks⁩ ago
Qwen3-4B HIVEMIND (abliterated) got it in 2, though it scores a lot higher on PIQA, HellaSwag and Winogrande benchmarks than normal Qwen3-30B. I think the new abliteration methods actually strengthen real world understanding.

imgur.com/a/7YZme4i

imgur.com/a/25ApzDN

I wonder if an abliterated VL model do even better? They tend to have the better real world model benchmarks.

I’d like to think a lot of these gotcha prompts rely on verbal misunderstanding, rather than failure in world models, but I can’t say that for certain. Though it was pretty funny though: I saw chatgpt recommend “yeah, lift the car and carry it on your back. Make sure to bend your knees” (though I’m guessing someone edited that for the lulz)

source