Comment - FBXL Lotide

scrubbles@poptalk.scrubbles.tech@poptalk.scrubbles.tech ⁨3⁩ ⁨weeks⁩ ago

That’s where I am okay with hardware, but can’t seem to fit the models on my 3090. I have dreams of something like an A100 someday, but not until there’s a ton of used ones that hit the market. What do you use for your hardware?

original

Sort:hotnew top

brucethemoose@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
I have a single 3090!

And I have 128GB RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.

…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (with is a high quality quantization type but very slow on CPU), with a model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough.”

For speed, or prompts with lots of thinking or context, I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you just have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.

original
- scrubbles@poptalk.scrubbles.tech@poptalk.scrubbles.tech ⁨3⁩ ⁨weeks⁩ ago
  I’ll have to play around with mine then, because I’ve had not great luck with it, or at least very disappointing. The CPU offloading is fairly slow, but maybe I should try tweaking more
  
  original
  - brucethemoose@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
    Be sure to try the ik_llama.cpp fork. Basically, it specializes in MoE CPU offloading on Nvidia cards, and more efficient quantization types than mainline llama.cpp:
    
    github.com/ikawrakow/ik_llama.cpp/
    
    And see this repo for specific 3090 configs: github.com/noonghunna/club-3090
    
    Honestly I should just write up my general setup in this community too.
    
    original