Comment

Comment on Grok praises Hitler, gives credit to Musk for removing “woke filters”

brucethemoose@lemmy.world ⁨4⁩ ⁨days⁩ ago

A lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)

With this framework, specifically: github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-f…

The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits, instead of 8 bits like the full model.

That’s just for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2.

source

Sort:hotnew top