Comment on Meet the AI workers who tell their friends and family to stay away from AI

<- View Parent
brucethemoose@lemmy.world ⁨3⁩ ⁨hours⁩ ago

I’ll save you the searching!

For max speed, especially when making parallel calls, vllm: hub.docker.com/r/btbtyler09/vllm-rocm-gcn5

Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web Ui as well.

For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: github.com/YellowRoseCx/koboldcpp-rocm/

If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: github.com/ikawrakow/ik_llama.cpp/

mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.

source
Sort:hotnewtop