Comment

Comment on Consumer hardware is no longer a priority for manufacturers

melfie@lemy.lol ⁨3⁩ ⁨months⁩ ago

I’ve been looking into self-hosting LLMs, and it seems a $10k GPU is kind of a requirement to run a decently-sized model and get reasonable tokens / s rate. There’s CPU and SSD offloading, but I’d imagine it would be frustratingly slow to use. I find cloud-based AI like GH Copilot to be rather annoyingly slow. Even so, GH Copilot is like $20 a month per user, and I’d be curious what the actual costs are per user considering the hardware and electricity cost.

What we have now is clearly an experimental first generation of the tech, but the industry is building out data centers as though it’s always going to require massive GPUs / NPUs with wicked quantities of VRAM to run these things. If it really will require huge data centers full of expensive hardware where each user prompt requires minutes of compute time on a $10k GPU, then it can’t possibly be profitable to charge a nominal monthly fee to use this tech, but maybe there are optimizations I’m unaware of.

source

Sort:hotnew top

brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models.

source
- melfie@lemy.lol ⁨3⁩ ⁨months⁩ ago
  Appreciate all the info! I did find this calculator the other day, and it’s pretty clear the RTX 4060 in my server isn’t going to do much though its NVMe may help.
  
  apxml.com/tools/vram-calculator
  
  I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
  
  I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM. Maybe when RDNA 5 releases, the RAM situation will have have blown over and we will have much better options.
  
  source
  - brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    
    I did find this calculator the other day
    
    That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.
    
    I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores
    
    Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine.
    
    along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM
    
    So far, NPUs have been useless. Don’t buy any of that marketing.
    
    I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
    
    That’s still 5 words/second. That’s not a bad reading speed.
    
    Whether its enough. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.
    
    But anyway, I’m looking at more like 20-30 tokens a second into models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.
    
    source
    melfie@lemy.lol ⁨3⁩ ⁨months⁩ ago
    Ah, a lot of good info! Thanks, I’ll look into all of that!
    
    source
- WhyJiffie@sh.itjust.works ⁨3⁩ ⁨months⁩ ago
  
  You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.
  
  how do you have the time to figure all these out and keep being up to date? do you do this at work?
  
  source
  - brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    As a hobby mostly, but its useful for work.
    
    Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE offloading and quantization), and watch for new releasing in LocalLlama or whatever. You kinda do have to follow things, yes.
    
    source
Xenny@lemmy.world ⁨3⁩ ⁨months⁩ ago
Ai failed and now they are doing this to capture the compute market to then make their profit back through unscrupulous means.

source
hector@lemmy.today ⁨3⁩ ⁨months⁩ ago
As I am told, there is no way these llm’s ever make their investments back. It’s like Tesla at this point. Whomever is paying the actual money to build this stuff is going to get hosed if they can’t offload it onto some other sucker. That ultimate sucker probably being the US taxpayer.

source
Analog@lemmy.ml ⁨3⁩ ⁨months⁩ ago
Can run decent size models with one of these: …minisforum.com/…/minisforum-ms-s1-max-mini-pc

For $1k more you can have the same thing from nvidia in their dgx spark. You can use high speed fabric to connect two of ‘em and run 405b parameter models, or so they claim.

Point being that’s some pretty big models in the 3-4k range, and massive models for less than 10k. The nvidia one supports comfyui so I assume it supports cuda.

It ain’t cheap and AI has soooo many negatives, but… it does have some positives and local LLMs mitigate some of the minuses, so I hope this helps!

source
- melfie@lemy.lol ⁨3⁩ ⁨months⁩ ago
  Nice, though $3k is still getting pretty pricey. I see mini PCs with a AMD RYZEN AI MAX+ 395 and 96GB of RAM can be had for $2k, or even $1k with less RAM: gmktec.com/…/amd-ryzen™-ai-max-395-evo-x2-ai-mini…
  
  I’m looking for something that also does path tracing well if I’m going to drop that kind of coin. It sounds like this chip can be on par with a 4070 for rasterization, but it only gets a benchmark score of 495 for Blender rendering compared to 3110 for even a RTX 4060. RDNA 5 with true RTX cores should drastically change the situation of chips like this, though.
  
  source
  - brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    FYI you can buy this this: frame.work/…/framework-desktop-mainboard-amd-ryze…
    
    And stick a regular Nvidia GPU on it. Or an AMD one.
    
    That’d give you the option to batch renders across the integrated and discrete GPUs too, if that fits your workflow.
    
    source
Clam_Cathedral@lemmy.ml ⁨3⁩ ⁨months⁩ ago
Honestly just jump in with whatever hardware you have available and a small 1.5b/7b model. You’ll figure out all the difficult uncertainties as you go and try to improve things.

I’m hosting a few lighter models that are somewhat useful and fun without even using a dedicated GPU- just a lot of ram and fast NVMe so the models don’t take forever to spin up.

Of course I’ve got an upgrade path in mind for the hardware and to add a GPU but there are other places I’d rather put the money atm and I do appreciate that it all currently runs on a 250w PSU.

source