brucethemoose@lemmy.world 2 weeks ago
In my case it’s performance and sheer RAM usage.
GLM 4.5 needs like 112GB RAM and absolutely every megabyte of VRAM from the GPU. It simply cannot afford the overhead. I think containers may slow down CPU<->GPU transfers slightly, but don’t quote me on that.
kiol@lemmy.world 2 weeks ago
Can anyone confirm if containers would actually impact CPU to GPU transfers
brucethemoose@lemmy.world 2 weeks ago
To be clear, VMs absolutely have overhead but Docker/Podman is the question. It might be negligible.
And this is a particularly weird scenario (since prompt processing literally has to shuffle 112GB over the PCIe bus for each batch). Most GPGPU apps aren’t so sensitive to that.