Comment on Selfhosted & AI
scrubbles@poptalk.scrubbles.tech 1 day agoThat’s where I am okay with hardware, but can’t seem to fit the models on my 3090. I have dreams of something like an A100 someday, but not until there’s a ton of used ones that hit the market. What do you use for your hardware?
brucethemoose@lemmy.world 1 day ago
I have a single 3090!
And I have 128GB RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.
…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (with is a high quality quantization type but very slow on CPU), with a model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough.”
For speed, or prompts with lots of thinking or context, I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you just have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.