Comment

Comment on Very large amounts of gaming gpus vs AI gpus

brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago

Ah, here we go:

huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

Ubergarm is great. See this part in particular: huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF#quic…

You will need to modify the syntax for 2x GPUs a bit. I’d recommend starting f16/f16 K/V cache with 32K (to see if that’s acceptable), and try not go lower than q8_0/q5_1 (as the V is more amenable to quantization).

source

Sort:hotnew top

TheMightyCat@ani.social ⁨3⁩ ⁨months⁩ ago
Thanks! Ill go check it out.

source
- brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
  One last thing, I’ve heard mixed things about 235B, hence there might be a smaller, more optimal LLM for whatever you do, if it’s something targeted?
  
  For instance, Kimi 72B is quite a good coding model: huggingface.co/moonshotai/Kimi-Dev-72B
  
  It might fit in vllm (as an AWQ) with 2x 4090s. It and would easily fit as an exl3.
  
  source
  - rezz@lemmy.world ⁨3⁩ ⁨months⁩ ago
    What do I need to run Kimi? Does it have apple silicon compatible releases? It seems promising.
    
    source
    brucethemoose@lemmy.world ⁨3⁩ ⁨months⁩ ago
    Depends. You’re in luck, as someone made a DWQ (which is the most optimal way to run it, and should work in LM Studio): huggingface.co/mlx-community/…/main
    
    It’s chonky though. The weights alone are like 40GB, so assume 50GB of VRAM allocation for some context. I’m not sure what Macs that equates to… 96GB? Can the 64GB can allocate enough?
    
    Otherwise, the requirement is basically a 5090. You can stuff it into 32GB as an exl3.
    
    Note that it is going to be slow on Macs, being a dense 72B model.
    
    source