Comment

Comment on Bewildered enthusiasts decry memory price increases of 100% or more — the AI RAM squeeze is finally starting to hit PC builders where it hurts

<- View Parent

brucethemoose@lemmy.world ⁨4⁩ ⁨days⁩ ago

They can ALL be run on RAM, theoretically. I bought 128GB so I can run GLM 4.5 with the experts offloaded to CPU, with a custom trellis/K quant mix; but this is a ‘personal’ use setup tinkerer setup.

Qwen Next is good at that because its very low active parameter.

…But they aren’t actually deployed that way. They’re basically always deployed on cloud GPU boxes that serve dozens/hundreds of people at once, in parallel.

AFAIK the only major model actually developed for CPU inference is one of the esoteric Gemma releases, aimed at mobile.

source

Sort:hotnew top

Passerby6497@lemmy.world ⁨3⁩ ⁨days⁩ ago
I for one would enjoy triggering your unskippable cutscenes in setting up local CPU based AI if it can work on Linux with an older amd card.

Don’t have funds for anything fancy, but would be interesting in playing around with it. Been wanting to get something like that setup for home assistant.

source
- brucethemoose@lemmy.world ⁨3⁩ ⁨days⁩ ago
  Plenty of folks do AMD. A popular ‘homelab’ setup is 32GB AMD MI50s. Even Intel is fine these days!
  
  But what’s your setup, precisely? CPU, RAM, and GPU.
  
  source
  - afk_strats@lemmy.world ⁨3⁩ ⁨days⁩ ago
    I have a MI50/7900xtx gaming/ai setup at homr which in i use for learning and to test out different models. Happy to answer questions
    
    source
  - Passerby6497@lemmy.world ⁨2⁩ ⁨days⁩ ago
    Looks like I’m running an AMD Ryzen 5 2600 CPU, AMD Radeon RX 570 GPU, and 32GB RAM
    
    source
    brucethemoose@lemmy.world ⁨2⁩ ⁨days⁩ ago
    
    4GB VRAM
    
    Mmmmm… I would wait a few days, and try a GGUF quantization of Kimi Linear once its better supported: huggingface.co/…/Kimi-Linear-48B-A3B-Instruct
    
    Otherwise you can mess with Qwen 3 VL now, in the native llama.cpp UI: huggingface.co/…/Qwen3-VL-30B-A3B-Instruct-UD-Q4_…
    
    If you’re interested, I can work out an optimal launch command. But to be blunt, with that setup, you’re kinda better off using free LLM APIs with a local chat UI.
    
    source
    -> View More Comments
- ag10n@lemmy.world ⁨3⁩ ⁨days⁩ ago
  You can use Vulkan fairly easily as long as you have 8G vram
  
  blog.linux-ng.de/…/running-llms-with-llama-cpp-us…
  
  source
  - Passerby6497@lemmy.world ⁨1⁩ ⁨day⁩ ago
    Only got 4G vram, unfortunately
    
    source
  - brucethemoose@lemmy.world ⁨3⁩ ⁨days⁩ ago
    The key is which one, and how though.
    
    For the really sparse models, you might be better off trying ik_llama.cpp, especially if you are targeting a ‘small’ quant.
    
    source
- SabinStargem@lemmy.today ⁨3⁩ ⁨days⁩ ago
  If you just want an easy way to setup AI on Windows or Linux, KoboldCPP is my recommendation for your backend. It supports the GGUF format, which allows you to use both RAM and VRAM simultaneously. It won’t be the fastest thing, but it is easy enough to setup, with a bundled GUI for prep and actual usage. Through the IP address it gives, you can hook the backend into a frontend of choice.
  
  KoboldCPP
  
  Image
  
  source