Comment

BlackSnack@lemmy.zip ⁨1⁩ ⁨month⁩ ago

Bet. Looking into that now. Thanks!

I believe I have 11g of vram, so I should be good to run decent models from what I’ve been told by the other AIs.

Sort:hotnew top

brucethemoose@lemmy.world ⁨1⁩ ⁨month⁩ ago
In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:

huggingface.co/ubergarm/GLM-4.5-Air-GGUF

Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split in a particularly efficient way for these kind of ‘MoE’ models.

If you ‘only’ have 32GB RAM or less, that’s tricker, and the next question is what kind of speeds do you want. But it’s probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: huggingface.co/…/Qwen3-30B-A3B-Thinking-2507-GGUF

And you don’t really need the hyper optimization of ik_llama.cpp for Qwen3 30B.

Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.

source
brucethemoose@lemmy.world ⁨1⁩ ⁨month⁩ ago
How much system RAM, and what kind? DDR5?

source