Comment on Bewildered enthusiasts decry memory price increases of 100% or more — the AI RAM squeeze is finally starting to hit PC builders where it hurts

<- View Parent
brucethemoose@lemmy.world ⁨4⁩ ⁨days⁩ ago

They can ALL be run on RAM, theoretically. I bought 128GB so I can run GLM 4.5 with the experts offloaded to CPU, with a custom trellis/K quant mix; but this is a ‘personal’ use setup tinkerer setup.

Qwen Next is good at that because its very low active parameter.

…But they aren’t actually deployed that way. They’re basically always deployed on cloud GPU boxes that serve dozens/hundreds of people at once, in parallel.

AFAIK the only major model actually developed for CPU inference is one of the esoteric Gemma releases, aimed at mobile.

source
Sort:hotnewtop