EPYC 9355 CPU-only sweep-bench

#3
by sousekd - opened

People on Reddit sometimes ask about EPYC CPU-only performance; my GPUs are currently out-of-order, so here are CPU-only results from a single Turin 9355 (12x DDR5-6400) running GLM-4.5-Air HQ4_K with ik_llama.cpp:

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa on -amb 512 \
    -b 2048 -ub 1024 \
    -ctk f16 -ctv f16 -c 131072 \
    --threads 20 \
    --threads-batch 30 \
    --warmup-batch \
    -n 128
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 128 0 3.544 288.95 4.404 29.07
1024 128 1024 4.037 253.65 5.061 25.29
1024 128 2048 4.521 226.49 5.113 25.04
... ... ... ... ... ... ...
1024 128 46080 35.672 28.71 22.420 5.71
1024 128 47104 36.476 28.07 28.743 4.45
1024 128 48128 37.586 27.24 22.298 5.74

image
image

@sousekd nice graphs, thank you!

The tokens per second graph has distinct pits and then it craters to that level past 40k tokens.
When my graphs looked like that, I had cooling issues caused by DRAM overheating and self-throttling. I put some fans on my RAM, and the graphs are much more smoother. So you may have another 30-35% performance hidden in there to unlock.

@anikifoss Thank you for your insight. I think you might be right about thermal throttling. I do have some fairly aggressive BIOS-driven DRAM frequency throttling set up (on top of the DIMMs’ own internal protections). That said, I re-ran it on a cold machine and watched the memory group temps — they never went above 61 °C, which seems pretty reasonable to me. But who knows what the internal DRAM/junction temps were, or how the chips themselves react to that, irrespective of my config.

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa on -amb 512 \
    -b 2048 -ub 1024 \
    -ctk f16 -ctv f16 -c 65536 \
    --threads 24 \
    --threads-batch 32 \
    --warmup-batch \
    -n 128

image

Interestingly, I ran another test after this one which looks perfectly fine: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF/discussions/6
But it is quite possible that RAM modules were too hot already, so it was throttled from start to finish... or the smaller model simply did not cause enough stress to allow the issue to be noticable.

Anyway, it is not exactly a typical usecase for me - I need to find a way how to make the machine run with the GPUs again 😀

GLM-4.7-Flash is pretty awesome too, but now we have Kimi-K2.5 with vision, and MiniMax-M2.5 just came out!

What happened to your GPU?

GLM-4.7-Flash is pretty awesome too, but now we have Kimi-K2.5 with vision, and MiniMax-M2.5 just came out!

I know! I can’t wait to be able to run them :).

What happened to your GPU?

I’m not sure, yet. The PSU is cutting power with no log entries. The first time it happened suddenly in the middle of work. After that, with the GPUs connected, I couldn’t keep the server running for more than two minutes. Without the GPUs, everything worked fine for a day, even under heavy load - the reason I ran the CPU-only benchmarks - to test it.

So I tried reconnecting the GPUs one by one. Eventually it failed with any of them, and finally it failed even without GPUs.

My primary suspect is overheating on the AUX PCIe 12V connectors. The motherboard design is terrible: the two connectors sit right next to very hot RAM, basically touching the heat spreader. My second suspect is the PSU, obviously… and then the motherboard. But who knows - it could even be a short on one of the 24 DRAM sticks, a pin contacting a heat sink due to thermal paste degradation?

It could be anything. The issue is easier to reproduce with GPUs connected, and it seems to work a bit better when everything is cold.
So for now I’m focusing on the connectors. We’ll see how it goes.

Cannot wait to be up and running again :).

Cannot wait to be up and running again :).

Yeah, it sucks deling with hardware failures. Hope you'll figure it out soon!

Sign up or log in to comment