Comment on Grok praises Hitler, gives credit to Musk for removing “woke filters”
brucethemoose@lemmy.world 4 days agoA lot, but less than you’d think! Basically a RTX 3090/threadripper system with a lot of RAM (192GB?)
With this framework, specifically: github.com/ikawrakow/ik_llama.cpp?tab=readme-ov-f…
The “dense” part of the model can stay on the GPU while the experts can be offloaded to the CPU, and the whole thing can be quantized to ~3 bits, instead of 8 bits like the full model.
That’s just for personal use, though. The intended way to run it is on a couple of H100 boxes, and to serve it to many, many, many users at once. LLMs run more efficiently when they serve in parallel. Eg generating tokens for 4 users isn’t much slower than generating them for 2.