How to find current limits on ARC
Resource allocation limits
Checking the current limits on ARC is recommended before planning a new series of computations, especially if GPUs are required for the computations.
The current limits can be shown by the arc.limits
command:
$ arc.limits PartitionName Flags MaxTRES MaxWall MaxTRESPU MaxSubmitPU MinTRES 0 normal 7-00:00:00 4000 1 cpu2019 cpu=240 7-00:00:00 cpu=240 4000 2 gpu-v100 DenyOnLimit cpu=80,gpu=4 1-00:00:00 cpu=160,gpu=8 4000 gpu=1 3 single cpu=200 7-00:00:00 cpu=200,node=30 4000 4 razi 7-00:00:00 4000 5 apophis 7-00:00:00 4000 6 razi-bf cpu=546 05:00:00 cpu=546 4000 7 apophis-bf cpu=280 05:00:00 cpu=280 4000 8 lattice cpu=408 7-00:00:00 cpu=408 4000 9 parallel cpu=624 7-00:00:00 cpu=624 4000 10 bigmem cpu=80 1-00:00:00 cpu=80,gpu=1 4000 11 cpu2013 7-00:00:00 4000 12 pawson 7-00:00:00 4000 13 pawson-bf cpu=480 05:00:00 cpu=480 4000 14 theia 7-00:00:00 4000 15 theia-bf cpu=280 05:00:00 4000 16 demo 7-00:00:00 4000 17 synergy 7-00:00:00 4000 18 synergy-bf cpu=448 05:00:00 cpu=448 4000 19 backfill05 cpu=1000 05:00:00 cpu=1000 4000 20 cpu2021 cpu=576 7-00:00:00 cpu=576 4000 21 backfill24 cpu=208 1-00:00:00 cpu=208 4000 22 sherlock 7-00:00:00 4000 23 wdf-zach 7-00:00:00 4000 24 wdf-think 7-00:00:00 4000 25 mtst 7-00:00:00 26 cpu2022 cpu=520 7-00:00:00 cpu=520 4000 27 gpu-a100 DenyOnLimit cpu=80,gpu=4 1-00:00:00 cpu=160,gpu=8 4000 gpu=1 TRES=Trackable RESources PU=Per User
The table shows the list of partitions and set limits for each of the partitions.
- Flags column shows settings that determine if the job may be accepted or denied if the resource request is over the limit.
- MaxTRES -- the maximum traceable resources, maximal amount of resources allowed per job on this partition.
- MaxWall -- the maximal wall time for a job, the longest time a job is allowed to run on this partition.
- MaxTRESPU -- maximum traceable resources per user, the total maximal amount of resources allowed per single user on the partition. If the limit is reached the resource requests above the limit will have to wait in the queue until some resources are freed by currently running jobs.
- MaxSubmitPU -- maximal number of jobs submitted to this partition. SLURM will reject any jobs above this limit. Please note, that there is also a global limit of 4000 jobs per user for the entire cluster.
- MinTRES -- minimal amount of resources in job's resource request that is required for a job to be accepted. Relevant on GPU partition, as jobs are expected to request at least one GPU to qualify to run on a GPU partition.
Backfill partitions
Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group.
However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the -bfXX
suffix,
such as cpu2019-bf05
, cpu2021-bf24
, etc., as they are shown in the output of the arc.hardware
command.
Currently, the limits for such partitions are set via two auxiliary partitions, backfill05
and backfill24
instead.
When examining the output of the arc.limits
command, and you want to check the limits for any *-bf05
partition,
for example,
you have to check the limits of the
backfill05
instead, to find the current limits for the partition.
Correspondingly, for any *-bf24</code partition, please use the
backfill24
limits.
Examples
gpu-v100 partition limits
For example, the gpu-v100
partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours.
The job must request at least 1 GPU, but not more than 4 GPUs.
The total number of GPUs working for a single user's jobs is limited to 8.
If the resource request of a submitted job is over the limit, the job will be rejected.