Comment on Selfhosted & AI
scrubbles@poptalk.scrubbles.tech 1 day agoYeah it’s heresy on Lemmy, but I do find it genuinely useful. My only regret is that I have to use Claude/Anthropic more than I’d like, which is why I have a vested interest in selfhosting myself. I’d rather figure out how to run the larger models myself and cut them off completely, but you even begin to mention that here and you’ll get downvoted to hell.
brucethemoose@lemmy.world 1 day ago
You don’t even need Claude anymore. GLM 5.2 API is good enough for 95% of the same things and vastly cheaper.
MiMo 2.5 Pro and Kimi are also very good. And then there’s Cerebras API if you just want simple things done quick.
scrubbles@poptalk.scrubbles.tech 1 day ago
That’s where I am okay with hardware, but can’t seem to fit the models on my 3090. I have dreams of something like an A100 someday, but not until there’s a ton of used ones that hit the market. What do you use for your hardware?
brucethemoose@lemmy.world 1 day ago
I have a single 3090!
And I have 128GB RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.
…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (with is a high quality quantization type but very slow on CPU), with a model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough.”
For speed, or prompts with lots of thinking or context, I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you just have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.