Comment on Llama 3.1 AI Models Have Officially Released
admin@lemmy.my-box.dev 3 months agoOof - not on my 12gb 3060 it doesn’t :/ Even at 48k context and the Q4_K quantization, it’s ollama its doing a lot of offloading to the cpu. What kind of hardware are you running it on?
brucethemoose@lemmy.world 3 months ago
A 3090.
But it should be fine on a 3060
Dump ollama for long context. Grab a 6bpw exl2 quantization and load it with Q4 or Q6 cache depending on how much context you want. I personally use EXUI, but text-gen-webu- and tabbyapi (with some other frontend) will also load them.