Comment on Car Wash Test on 53 leading AI models: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

<- View Parent
SuspciousCarrot78@lemmy.world ⁨1⁩ ⁨week⁩ ago

Not sure how we’re quantifying intelligence here. Benchmarks?

Qwen3-4B 2507 Instruct (4B) outperforms GPT-4.1 nano (7B) on all stated benchmarks. It outperforms GPT-4.1 mini (~27B according to scuttlebutt) on mathematical and logical reasoning benchmarks, but loses (barely) on instruction-following and knowledge benchmarks. It outperforms GPT-4o on a few specific domains (math, creative writing), but loses overall (because of course it would).

So, in that instance, a 4B > 7B (globally), 27B (significantly) and 500B(?) situationally.

It sort of wild to think that 2024 SOTA is ~ ‘strong’ 4-12B these days.

I think (believe) that we’re sort of getting to the point where the next step forward is going to be “densification” and/or architecture shift (maybe M$ can finally pull their finger out and release the promised 1.58 bit next step architectures).

ICBW / IANAE

source
Sort:hotnewtop