Comment

Comment on AI chatbots unable to accurately summarise news, BBC finds

MoonlightFox@lemmy.world ⁨9⁩ ⁨months⁩ ago

I have been pretty impressed by Gemini 2.0 Flash.

Its slightly worse than the very best on the benchmarks I have seen, but is pretty much instant and incredibly cheap. Maybe a loss leader?

Anyways, which model of the commercial ones do you consider to be good?

source

Sort:hotnew top

brucethemoose@lemmy.world ⁨9⁩ ⁨months⁩ ago

benchmarks

Benchmarks are so gamed, even Chatbot Arena is kinda iffy. TBH you have to test them with your prompts yourself.

Honestly I am getting incredible/creative responses from Deepseek R1, the hype is real. Tencent’s API is a bit under-rated. If llama 3.3 70B is smart enough for you, Cerebras API is super fast.

MiniMax is ok for long context, but I still tend to lean on Gemini for this.

source
- Knock_Knock_Lemmy_In@lemmy.world ⁨9⁩ ⁨months⁩ ago
  What are the local use cases? I’m running on a 3060ti but output is always inferior to the free tier of the various providers.
  
  Can I justify an upgrade to a 4090 (or more)?
  
  source
- MoonlightFox@lemmy.world ⁨9⁩ ⁨months⁩ ago
  So there is not any trustworthy benchmarks I can currently use to evaluate? That in combination with my personal anecdotes is how I have been evaluating them.
  
  I was pretty impressed with Deepseek R1. I used their app, but not for anything sensitive.
  
  I don’t like that OpenAI defaults to a model I can’t pick. I have to select it each time, even when I use a special URL it will change after the first request
  
  I am having a hard time deciding which models to use besides a random mix between o3-mini-high, o1, Sonnet 3.5 and Gemini 2 Flash
  
  source
  - brucethemoose@lemmy.world ⁨9⁩ ⁨months⁩ ago
    Heh, only obscure ones that they can’t game, and only if they fit your use case. One example is the ones in EQ bench: eqbench.com
    
    …And again, the best mix of models depends on your use case.
    
    I can suggest using something like Open Web UI with APIs instead of native apps. It gives you a lot more control, more powerful tooling to work with, and the ability to easily select and switch between models.
    
    source