I don’t know why you assume there has to be a gotcha, maybe it’s the competitive background… Anyway, it’s visual because you look at it to see it. And it’s not the best human performance vs best LLM performance, it’s best controlled performance because the testing is limited to a set of parameters.
floquant@lemmy.dbzer0.com 2 months ago
That’s what games are? I really don’t see how it is an unfair comparison to you. How would you change it?
lath@lemmy.world 2 months ago
Stress test it. Low, average, high, impairment conditions, safeguards off, order, chaos and everything in between.
gnufuu@infosec.pub 2 months ago
I haven’t read all of their Benchmark introduction and Technical Documentation. I assume you have and didn’t find any of the tests you’re asking for?