Bloefz has a great setup.
An RTX 3090 + a cheap HEDT/Server CPU is another popular one. Newer models run reasonably quickly on them, with the attention/dense layers on the GPU and sparse parts on the CPU.
Comment on Meet the AI workers who tell their friends and family to stay away from AI
Waphles@lemmy.world 11 hours agoCould you elaborate a little on your setup? Sounds interesting
Bloefz has a great setup.
An RTX 3090 + a cheap HEDT/Server CPU is another popular one. Newer models run reasonably quickly on them, with the attention/dense layers on the GPU and sparse parts on the CPU.
Bloefz@lemmy.world 10 hours ago
I have one server with a cheap MI50 instinct. Those come for really cheap on eBay. And it’s got really good memory bandwidth with HBM2. They worked ok with ollama until recently when they dropped support for some weird reason but a lot of other software still works fine. Also older models work fine on old ollama.
The other one runs an RTX 3060 12GB. I use this for models that only work on nvidia like whisper speech recognition.
I tend to use the same models for everything so I don’t have the delay of loading the model. Mainly uncensored ones so it doesn’t choke when someone says something slightly sexual. I’m in some very open communities so standard models are pretty useless with all their prudeness.
For frontend i use OpenWebUI and i also run stuff directly against the models like scripts.
brucethemoose@lemmy.world 8 hours ago
This is the way.
…Except for ollama. It’s starting to enshittify and I would not recommend it.
Bloefz@lemmy.world 3 hours ago
Agreed. The way they just dumped support for my card in some update with some vague reason also irked me (we need a newer rocm they said but my card works fine with all current rocm versions)
Also the way they’re now trying to sell cloud AI means their original local service is in competition to the product they sell.
I’m looking to use something new but I don’t know what yet.
brucethemoose@lemmy.world 2 hours ago
I’ll save you the searching!
For max speed, especially when making parallel calls, vllm: hub.docker.com/r/btbtyler09/vllm-rocm-gcn5
Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web Ui as well.
For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: github.com/YellowRoseCx/koboldcpp-rocm/
If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: github.com/ikawrakow/ik_llama.cpp/
mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.