How to check GPU utilization: Difference between revisions
Added common tips for working with GPUs |
|||
| (6 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
= | = Introduction = | ||
Compute accelerators (GPU cards) are a powerful tool for scientific research in areas such as machine learning training, scientific simulations, and large-scale data processing. These resources are expensive and limited in our compute clusters. If these shared resources are not properly utilized, significant compute capacity and time can be wasted. | |||
Typical GPU workflows involves three main stages in each iteration: transferring data from the CPU to the GPU, executing GPU kernels, and transferring results back from the GPU to the CPU. GPU utilization drops when the GPU spends time waiting for data or when the kernels do not fully exploit the available parallelism. By monitoring utilization metrics, users can identify when the GPU is idle or underused and take steps to address performance issues. | |||
Monitoring GPU utilization promotes responsible use of shared computing infrastructure. Requesting more than 1 GPU if not needed increases queue times and reduces availability for other users. By understanding and monitoring the usage patterns, users can make optimal resource requests and improve overall cluster efficiency. | |||
= For Running Jobs = | = For Running Jobs = | ||
| Line 42: | Line 48: | ||
On the [[Open OnDemand | OOD portal]], here https://ood-arc.rcs.ucalgary.ca | On the [[Open OnDemand | OOD portal]], here https://ood-arc.rcs.ucalgary.ca | ||
login using your UofC credentials. | login using your UofC credentials (it may require your full UofC email address instead of just user name). | ||
Once you log in, | Once you log in, | ||
* Select '''Help --> View my job metrics''', | * Select '''Help --> View my job metrics''', | ||
: this will open an interface to your past jobs | : this will open an interface to your past jobs that are available in the database. | ||
* Find the job you want to check and there may be useful graphs for GPU usage. | |||
= Common tips for working with GPUs on ARC = | |||
* Confirm your software is GPU-enabled and actually using CUDA, Many tools automatically fall back to CPU without warning. | |||
* Avoid long salloc sessions holding idle GPUs. For interactive exploration, request small walltime intervals on older generation GPUs (V100). | |||
* Verify one GPU is well-utilized before scaling to multiple. | |||
* Increase batch size (within memory limits) to amortize overhead. | |||
* Use multi-threaded data loaders. | |||
* Avoid small, frequent I/O; prefer fewer, larger reads/writes. | |||
* Write active job output to high-performance scratch. Avoid home paths during training. | |||
[[How-Tos]] | [[Category:Guides]] | ||
[[Category:How-Tos]] | |||
{{Navbox Guides}} | |||
Latest revision as of 19:15, 13 March 2026
Introduction
Compute accelerators (GPU cards) are a powerful tool for scientific research in areas such as machine learning training, scientific simulations, and large-scale data processing. These resources are expensive and limited in our compute clusters. If these shared resources are not properly utilized, significant compute capacity and time can be wasted.
Typical GPU workflows involves three main stages in each iteration: transferring data from the CPU to the GPU, executing GPU kernels, and transferring results back from the GPU to the CPU. GPU utilization drops when the GPU spends time waiting for data or when the kernels do not fully exploit the available parallelism. By monitoring utilization metrics, users can identify when the GPU is idle or underused and take steps to address performance issues.
Monitoring GPU utilization promotes responsible use of shared computing infrastructure. Requesting more than 1 GPU if not needed increases queue times and reduces availability for other users. By understanding and monitoring the usage patterns, users can make optimal resource requests and improve overall cluster efficiency.
For Running Jobs
using SLURM
If you have a job that is running on a GPU node and that is expected to use a GPU on that node, you can check the GPU use by your code by running the following command on ARC's login node:
$ srun -s --jobid 12345678 --pty nvidia-smi
The number here is the job ID of the running job.
The output should look similar to the following:
Mon Aug 22 09:27:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 36W / 250W | 848MiB / 16160MiB | 30% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2232533 C .../Programs/OpenDBA/openDBA 338MiB |
+-----------------------------------------------------------------------------+
In this case there was 1 GPU allocated and its usage was 30%.
The code openDBA also uses 338 MB of the GPU memory.
For Past Jobs
Using Sampled Metrics
On the OOD portal, here https://ood-arc.rcs.ucalgary.ca login using your UofC credentials (it may require your full UofC email address instead of just user name).
Once you log in,
- Select Help --> View my job metrics,
- this will open an interface to your past jobs that are available in the database.
- Find the job you want to check and there may be useful graphs for GPU usage.
Common tips for working with GPUs on ARC
- Confirm your software is GPU-enabled and actually using CUDA, Many tools automatically fall back to CPU without warning.
- Avoid long salloc sessions holding idle GPUs. For interactive exploration, request small walltime intervals on older generation GPUs (V100).
- Verify one GPU is well-utilized before scaling to multiple.
- Increase batch size (within memory limits) to amortize overhead.
- Use multi-threaded data loaders.
- Avoid small, frequent I/O; prefer fewer, larger reads/writes.
- Write active job output to high-performance scratch. Avoid home paths during training.