How to request GPUs for batch jobs on ARC: Difference between revisions
Line 73: | Line 73: | ||
=== Example 3 === | === Example 3 === | ||
Request | Request | ||
'''4 V100 GPU''' for one job. | '''4 V100 GPU''' for one job for '''2 hours'''. | ||
The '''V100''' GPUs are in the '''gpu-v100''' partition. | The '''V100''' GPUs are in the '''gpu-v100''' partition. | ||
Line 89: | Line 89: | ||
We have to confirm that the computation scales well with increasing number of nodes, as well as the number of GPUs. | We have to confirm that the computation scales well with increasing number of nodes, as well as the number of GPUs. | ||
'''8 CPUs''' and '''64 GB''' of RAM on | |||
We still have to request the CPU part of the resource request, | |||
in this example, we request '''8 CPUs''' and '''64 GB''' of RAM on '''each node'''. | |||
Line 96: | Line 99: | ||
#! /bin/bash | #! /bin/bash | ||
# ==================================== | # ==================================== | ||
#SBATCH --job-name= | #SBATCH --job-name=4GPU_test | ||
#SBATCH --nodes=2 | #SBATCH --nodes=2 | ||
#SBATCH --ntasks=2 | #SBATCH --ntasks=2 | ||
Line 106: | Line 109: | ||
# ==================================== | # ==================================== | ||
# Check the GPUs with the nvidia-smi command. | # Report allocated nodes. | ||
nvidia-smi | NODES=`scontrol show hostnames` | ||
echo "Alocated nodes: $SLURM_JOB_NODELIST" | |||
# Check the GPUs on each of the allocated nodes with the nvidia-smi command. | |||
for NODE in `scontrol show hostlist`; do | |||
echo $NODE; ssh $NODE "nvidia-smi -L" | |||
done | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 16:13, 17 May 2023
GPUs on ARC
Currently, there are three partitions on ARC that have usable CUDA GPUs, gpu-v100, gpu-a100, and bigmem. ARC's configuration keeps changing depending on addition of new hardware and retiring of the old hardware, so you have to check available hardware on ARC before starting a new set of computations.
- First, check the available partitions with GPUs on ARC following this article:
- Then, check the resource limits for the partitions you are intended to use, as it is explained here:
The main partitions that contain GPUs on ARC at the moment of writing are gpu-v100 and gpu-a100.
These partitions contain V100 and A100 nVidia GPUs, correspondingly.
The maximum time limit for jobs on both the partitions is 24 hours.
Please note, that GPU partitions should only be used to run jobs that required GPUs.
Any CPU-only computations have to be directed to CPU-only partitions.
Examples
Example 1
Request 1 V100 GPU, 4 CPUs and 16 GB of RAM for 1 hour.
V100 GPUs can be got from the gpu-v100 partitions. This is the gpu-1.slurm jobs scripts that requests these resources:
#! /bin/bash
# ====================================
#SBATCH --job-name=GPU_test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=0-01:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
# ====================================
# Check the GPUs with the nvidia-smi command.
nvidia-smi
The script only runs the nvidia-smi command to make sure that the GPU has been provided by SLURM.
Example 2
Request 2 A100 GPU, 8 CPUs and 64 GB of RAM for 2 hour.
A100 GPUs can be got from the gpu-a100 partitions. This is the gpu-2.slurm jobs scripts that requests these resources:
#! /bin/bash
# ====================================
#SBATCH --job-name=2GPU_test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64GB
#SBATCH --time=0-02:00:00
#SBATCH --gres=gpu:2
#SBATCH --partition=gpu-a100
# ====================================
# Check the GPUs with the nvidia-smi command.
nvidia-smi
The script only runs the nvidia-smi command to make sure that the GPU has been provided by SLURM.
Example 3
Request 4 V100 GPU for one job for 2 hours.
The V100 GPUs are in the gpu-v100 partition.
As the arc.hardware
shows, each node has 2 GPUs, not 4 as we want.
Therefore,
to get 4 GPUs we have to use 2 compute nodes from this partition.
Note that before proceeding with the job, one has to understand,
that for this request to make sense, the code that is going to run in the job must support distributed execution,
when computation spreads several nodes.
Even though, we believe that the code can run on multiple nodes and can use multiple GPUs for performance benefit, it has to be tested and the performance benefit has to be measured. We have to confirm that the computation scales well with increasing number of nodes, as well as the number of GPUs.
We still have to request the CPU part of the resource request, in this example, we request 8 CPUs and 64 GB of RAM on each node.
This is the gpu-4.slurm jobs scripts that requests these resources:
#! /bin/bash
# ====================================
#SBATCH --job-name=4GPU_test
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --mem=64GB
#SBATCH --time=0-02:00:00
#SBATCH --gres=gpu:2
#SBATCH --partition=gpu-v100
# ====================================
# Report allocated nodes.
NODES=`scontrol show hostnames`
echo "Alocated nodes: $SLURM_JOB_NODELIST"
# Check the GPUs on each of the allocated nodes with the nvidia-smi command.
for NODE in `scontrol show hostlist`; do
echo $NODE; ssh $NODE "nvidia-smi -L"
done
The script only runs the nvidia-smi command to make sure that the GPU has been provided by SLURM.
Checking GPU utilization
Before running all your productions jobs you have to run several test jobs to make sure that the requested resources are properly used. You really have to make sure that the GPU(s) your job requests are actually used.
- See this HowTo on how to check the utilization How to check GPU utilization.
Software that often used with GPUs