Comment

Comment on DeepSeek's distilled new R1 AI model can run on a single GPU | TechCrunch

the Chinese AI lab also released a smaller, “distilled” version of its new R1, DeepSeek-R1-0528-Qwen3-8B, that DeepSeek claims beats comparably sized models on certain benchmarks

Most models come in 1B, 7-8B, 12-14B, and 27+B parameter variants. According to the docs, they benchmarked the 8B model using an NVIDIA H20 (96 GB VRAM) and got between 144-1198 tokens/sec. Most consumer GPUs probably aren’t going to be able to keep up with

source

Sort:hotnew top

avidamoeba@lemmy.ca ⁨5⁩ ⁨months⁩ ago
It proved sqrt(2) irrational with 40tps on a 3090 here. The 32b R1 did it with 32tps but it thought a lot longer.

source
- vhstape@lemmy.sdf.org ⁨5⁩ ⁨months⁩ ago
  On my Mac mini running LM Studio, it managed 1702 tokens at 17.19 tok/sec and thought for 1 minute
  
  source
brucethemoose@lemmy.world ⁨5⁩ ⁨months⁩ ago
Depends on the quantization.

7B is small enough to run it in FP8 or a Marlin quant with SGLang/VLLM/TensorRT, so you can probably get very close to the H20 on a 3090 or 4090.

source