Why is my virtual server with NVIDIA B300 GPUs exhibiting degraded performance or running at high speeds for some GPUs?
In some instances, one or more NVIDIA B300 GPU clock speeds are locked, which can cause degraded performance or higher power usage on affected GPUs.
NVIDIA HGX B300 accelerated virtual server profiles are available for select customers. Create a support case if you are interested in purchasing and using this offering.
When you run the nvidia-smi command to check GPU status, you see that some GPUs are running at significantly higher clock speeds and power usage than others, even when no workload is running.
Run the following command to check GPU status:
nvidia-smi --query-gpu=index,pstate,power.draw,clocks.current.graphics,clocks.current.sm,temperature.gpu --format=csv index, pstate, power.draw [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], temperature.gpu
The following example shows the output without a workload running. GPU 6 and GPU 7 are running at higher speeds (2017 MHz) and higher power usage (187-188 W) compared to the other GPUs (120 MHz and 133-138 W):
0, P0, 134.45 W, 120 MHz, 120 MHz, 37
1, P0, 136.28 W, 120 MHz, 120 MHz, 37
2, P0, 135.65 W, 120 MHz, 120 MHz, 31
3, P0, 133.02 W, 120 MHz, 120 MHz, 31
4, P0, 136.44 W, 120 MHz, 120 MHz, 39
5, P0, 138.55 W, 120 MHz, 120 MHz, 39
6, P0, 187.59 W, 2017 MHz, 2017 MHz, 34
7, P0, 188.10 W, 2017 MHz, 2017 MHz, 34
This issue is due to an NVIDIA bug for the B300 GPU. The issue can occur when the GPU goes into an idle state and the clock speed becomes locked.
To unlock the GPU clocks and restore normal performance, reset the GPUs by using the nvidia-smi command.
-
Reset the GPUs by running the following command:
nvidia-smi -r -
Verify that all GPUs are now running at similar low power and low clock speeds by running the following command:
nvidia-smi --query-gpu=index,pstate,power.draw,clocks.current.graphics,clocks.current.sm,temperature.gpu --format=csv index, pstate, power.draw [W], clocks.current.graphics [MHz], clocks.current.sm [MHz], temperature.gpuAfter the reset, all GPUs should show similar clock speeds and power usage levels.