How to request GPUs for batch jobs on ARC

From RCSWiki
Jump to navigation Jump to search


Currently, there are three partitions on ARC that have usable CUDA GPUs, gpu-v100, gpu-a100, and bigmem. ARC's configuration keeps changing depending on addition of new hardware and retiring of the old hardware, so you have to check available hardware on ARC before starting a new set of computations.

  • First, check the available partitions with GPUs on ARC following this article:
How to find available partitions on ARC.
  • Then, check the resource limits for the partitions you are intended to use, as it is explained here:
How to find current limits on ARC.

The main partitions that contain GPUs on ARC at the moment of writing are gpu-v100 and gpu-a100. These partitions contain V100 and A100 nVidia GPUs, correspondingly. The maximum time limit for jobs on both the partitions is 24 hours.

Please note, that GPU partitions should only be used to run jobs that required GPUs. Any CPU-only computations have to be directed to CPU-only partitions.


Example 1

Request 1 V100 GPU, 4 CPUs and 16 GB of RAM for 1 hour.

V100 GPUs can be got from the gpu-v100 partitions. This is the gpu-1.slurm jobs scripts that requests these resources:

#! /bin/bash
# ====================================
#SBATCH --job-name=GPU_test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=0-01:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
# ====================================

# Check the GPUs with the nvidia-smi command.

The script only runs the nvidia-smi command to make sure that the GPU has been provided by SLURM.

Example 1

Request 2 A100 GPU, 8 CPUs and 64 GB of RAM for 2 hour.

A100 GPUs can be got from the gpu-100 partitions. This is the gpu-2.slurm jobs scripts that requests these resources:

#! /bin/bash
# ====================================
#SBATCH --job-name=2GPU_test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64GB
#SBATCH --time=0-02:00:00
#SBATCH --gres=gpu:2
#SBATCH --partition=gpu-a100
# ====================================

# Check the GPUs with the nvidia-smi command.

The script only runs the nvidia-smi command to make sure that the GPU has been provided by SLURM.

Checking GPU utilization

Before running all your productions jobs you have to run several test jobs to make sure that the requested resources are properly used. You really have to make sure that the GPU(s) your job requests are actually used.

Software that often used with GPUs

