They can ALL be run on RAM, theoretically. I bought 128GB so I can run GLM 4.5 with the experts offloaded to CPU, with a custom trellis/K quant mix; but this is a ‘personal’ use setup tinkerer setup.
Qwen Next is good at that because its very low active parameter.
…But they aren’t actually deployed that way. They’re basically always deployed on cloud GPU boxes that serve dozens/hundreds of people at once, in parallel.
AFAIK the only major model actually developed for CPU inference is one of the esoteric Gemma releases, aimed at mobile.
Passerby6497@lemmy.world 3 days ago
I for one would enjoy triggering your unskippable cutscenes in setting up local CPU based AI if it can work on Linux with an older amd card.
Don’t have funds for anything fancy, but would be interesting in playing around with it. Been wanting to get something like that setup for home assistant.
brucethemoose@lemmy.world 3 days ago
Plenty of folks do AMD. A popular ‘homelab’ setup is 32GB AMD MI50s. Even Intel is fine these days!
But what’s your setup, precisely? CPU, RAM, and GPU.
afk_strats@lemmy.world 3 days ago
I have a MI50/7900xtx gaming/ai setup at homr which in i use for learning and to test out different models. Happy to answer questions
Passerby6497@lemmy.world 2 days ago
Looks like I’m running an AMD Ryzen 5 2600 CPU, AMD Radeon RX 570 GPU, and 32GB RAM
brucethemoose@lemmy.world 2 days ago
Mmmmm… I would wait a few days, and try a GGUF quantization of Kimi Linear once its better supported: huggingface.co/…/Kimi-Linear-48B-A3B-Instruct
Otherwise you can mess with Qwen 3 VL now, in the native llama.cpp UI: huggingface.co/…/Qwen3-VL-30B-A3B-Instruct-UD-Q4_…
If you’re interested, I can work out an optimal launch command. But to be blunt, with that setup, you’re kinda better off using free LLM APIs with a local chat UI.
ag10n@lemmy.world 3 days ago
You can use Vulkan fairly easily as long as you have 8G vram
blog.linux-ng.de/…/running-llms-with-llama-cpp-us…
Passerby6497@lemmy.world 1 day ago
Only got 4G vram, unfortunately
brucethemoose@lemmy.world 3 days ago
The key is which one, and how though.
For the really sparse models, you might be better off trying ik_llama.cpp, especially if you are targeting a ‘small’ quant.
SabinStargem@lemmy.today 3 days ago
If you just want an easy way to setup AI on Windows or Linux, KoboldCPP is my recommendation for your backend. It supports the GGUF format, which allows you to use both RAM and VRAM simultaneously. It won’t be the fastest thing, but it is easy enough to setup, with a bundled GUI for prep and actual usage. Through the IP address it gives, you can hook the backend into a frontend of choice.
KoboldCPP
Image