Comment

liliumstar@lemmy.dbzer0.com ⁨6⁩ ⁨months⁩ ago

I know you said consumer GPU, but I run a used Tesla P40. It has 24 GB of vram. The price has gone up since I got it a couple years ago, there might be better options in the same price category. Still, it’s going to be cheaper than a modern full fat consumer gpu, with a reasonable performance hit.

My use case is text generation, chat kind of things. In most cases, the inference is more than fast enough, but it can get slow when swapping out large context lengths.

Mostly I run quantized 8-20B models with the sweet spot being around 12. For specialized use cases outside of general language, you can run more compact models. The general output is quite good, and I would have never had thought it was possible 10 years ago.

source

Sort:hotnew top