Comment

Comment on Consumer hardware is no longer a priority for manufacturers

brucethemoose@lemmy.world ⁨2⁩ ⁨days⁩ ago

This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models.

source

Sort:hotnew top

WhyJiffie@sh.itjust.works ⁨1⁩ ⁨day⁩ ago

You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.

how do you have the time to figure all these out and keep being up to date? do you do this at work?

source
- brucethemoose@lemmy.world ⁨23⁩ ⁨hours⁩ ago
  As a hobby mostly, but its useful for work.
  
  Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE offloading and quantization), and watch for new releasing in LocalLlama or whatever. You kinda do have to follow things, yes.
  
  source
melfie@lemy.lol ⁨2⁩ ⁨days⁩ ago
Appreciate all the info! I did find this calculator the other day, and it’s pretty clear the RTX 4060 in my server isn’t going to do much though its NVMe may help.

apxml.com/tools/vram-calculator

I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.

I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM. Maybe when RDNA 5 releases, the RAM situation will have have blown over and we will have much better options.

source
- brucethemoose@lemmy.world ⁨2⁩ ⁨days⁩ ago
  
  I did find this calculator the other day
  
  That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.
  
  I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores
  
  Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine.
  
  along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM
  
  So far, NPUs have been useless. Don’t buy any of that marketing.
  
  I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
  
  That’s still 5 words/second. That’s not a bad reading speed.
  
  Whether its enough. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.
  
  But anyway, I’m looking at more like 20-30 tokens a second into models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.
  
  source
  - melfie@lemy.lol ⁨2⁩ ⁨days⁩ ago
    Ah, a lot of good info! Thanks, I’ll look into all of that!
    
    source