Comment

Comment on Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

jaykrown@lemmy.world ⁨2⁩ ⁨days⁩ ago

Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There’s a reason why there’s a shift towards “thinking” models, because it forces the model to build its own context before giving a concrete answer.

Without DeepThink Image

With DeepThink Image

source

Sort:hotnew top

rockSlayer@lemmy.blahaj.zone ⁨2⁩ ⁨days⁩ ago
It’s interesting to see it build the context necessary to answer the question, but this seems to be a lot of text just to come up with a simple answer

source
- Schadrach@lemmy.sdf.org ⁨2⁩ ⁨days⁩ ago
  The whole premise of deep think and similar in other models is to come up with an answer, then ask itself if the answer is right and how it could be wrong until the result is stable.
  
  The seahorse emoji question is one that trips up a lot of models (it’s a Mandela effect thing where it doesn’t exist but lots of people remember it and as a consequence are firm that it’s real), I asked GLM 4.7 about it with deep think on and it wrote about two dozen paragraphs trying to think of everywhere a seahorse emoji could be hiding, if it was in a previous or upcoming standard, if maybe there was another emoji that might be mistaken for a seahorse, etc, etc. It eventually decided that it didn’t exist, double checked that it wasn’t missing anything, and gave an answer.
  
  It was startlingly like flow.ofnconaciousness of someone experiencing the Mandela effect trying desperately to find evidence they were right, except it eventually gave up and realized the truth.
  
  source
- Buffy@libretechni.ca ⁨2⁩ ⁨days⁩ ago
  They’re showing the thinking the model did, the actual response is the sentence at the end.
  
  source