Comment

Comment on Very large amounts of gaming gpus vs AI gpus

<- View Parent

TheMightyCat@ani.social ⁨3⁩ ⁨weeks⁩ ago

I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
I don’t need to build a datacenter, i’m fine with building a rack myself in my garage. And i don’t think that requires higher volumes than just purchasing at different retailers
I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays FP16 Tensor Core, the RTX pro 6000 datasheet keeps it vague with only mentioning AI TOPS, which they define as Effective FP4 TOPS with sparsity, and they didn’t even bother writing a datasheet for he 5090 only saying 3352 AI TOPS, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don’t know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.

source

Sort:hotnew top

non_burglar@lemmy.world ⁨3⁩ ⁨weeks⁩ ago

I don’t need to build a datacenter, i’m fine with building a rack myself in my garage.

During the last GPU mining craze, I helped build a 3-rack mining operation. Gpus are unregulated pieces of power-sucking shit from a power management perspective. You do not have the power requirements to do this on residential power, even at 300amp service.

Think of a microwave’s behaviour ; yes, a 1000w microwave pulls between 700 and 900w while cooking, but the startup load is massive, almost 1800w sometimes, depending on how cheap the thing is.

GPUs also behave like this, but not at startup. They spin up load predictively, which means the hardware demands more power to get the job done, it doesn’t scale down the job to save power. Multiply by 58 rx9070. Now add cooling.

You cannot do this.

source
- MTK@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
  If you have 3 phase you could reasonably do this. This is not very common but some people have it in which case running about 50 rx9070 plus a strong AC should be possible, I think.
  
  source
  - non_burglar@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
    I guess. I don’t know why a person would do this, though… Especially just for an LLM.
    
    source
    MTK@lemmy.world ⁨3⁩ ⁨weeks⁩ ago
    🤷‍♂️
    
    source
- TheMightyCat@ani.social ⁨3⁩ ⁨weeks⁩ ago
  Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.
  
  source
enumerator4829@sh.itjust.works ⁨3⁩ ⁨weeks⁩ ago
the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.

Your math checks out, but only for some workloads. Other workloads scale out like shit, and then you want all your bandwidth concentrated. At some point you’ll also want to consider power draw:

One H200 is like 1500W when including support infrastructure like networking, motherboard, CPUs, storage, etc.

58 consumer cards will be like 8 servers loaded with GPUs, at like 5kW each, so say 40kW in total.

Now include power and cooling over a few years and do the same calculations.

As for apples and oranges, this is why you can’t look at the marketing numbers, you need to benchmark your workload yourself.
source