How to find current limits on ARC

From RCSWiki
Revision as of 17:19, 21 September 2023 by Lleung (talk | contribs) (introduction)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

ARC has many partitions each with different resource limits. It is important to understand these limits when planning a new computation and before submitting a job. For resources that are more scarce such as GPU and bigmem nodes, understanding the resource limits and making your jobs as efficient as possible will help increase the throughput of jobs on the cluster.

Resource allocation limits

The current resource limits can be shown by the arc.limits command:

$ arc.limits

   PartitionName        Flags       MaxTRES     MaxWall        MaxTRESPU MaxSubmitPU MinTRES
0         normal                             7-00:00:00                         4000        
1        cpu2019                    cpu=240  7-00:00:00          cpu=240        4000        
2       gpu-v100  DenyOnLimit  cpu=80,gpu=4  1-00:00:00    cpu=160,gpu=8        4000   gpu=1
3         single                    cpu=200  7-00:00:00  cpu=200,node=30        4000        
4           razi                             7-00:00:00                         4000        
5        apophis                             7-00:00:00                         4000        
6        razi-bf                    cpu=546    05:00:00          cpu=546        4000        
7     apophis-bf                    cpu=280    05:00:00          cpu=280        4000        
8        lattice                    cpu=408  7-00:00:00          cpu=408        4000        
9       parallel                    cpu=624  7-00:00:00          cpu=624        4000        
10        bigmem                     cpu=80  1-00:00:00     cpu=80,gpu=1        4000        
11       cpu2013                             7-00:00:00                         4000        
12        pawson                             7-00:00:00                         4000        
13     pawson-bf                    cpu=480    05:00:00          cpu=480        4000        
14         theia                             7-00:00:00                         4000        
15      theia-bf                    cpu=280    05:00:00                         4000        
16          demo                             7-00:00:00                         4000        
17       synergy                             7-00:00:00                         4000        
18    synergy-bf                    cpu=448    05:00:00          cpu=448        4000        
19    backfill05                   cpu=1000    05:00:00         cpu=1000        4000        
20       cpu2021                    cpu=576  7-00:00:00          cpu=576        4000        
21    backfill24                    cpu=208  1-00:00:00          cpu=208        4000        
22      sherlock                             7-00:00:00                         4000        
23      wdf-zach                             7-00:00:00                         4000        
24     wdf-think                             7-00:00:00                         4000        
25          mtst                             7-00:00:00                                     
26       cpu2022                    cpu=520  7-00:00:00          cpu=520        4000        
27      gpu-a100  DenyOnLimit  cpu=80,gpu=4  1-00:00:00    cpu=160,gpu=8        4000   gpu=1

TRES=Trackable RESources
  PU=Per User

The table shows the list of partitions and set limits for each of the partitions.

  • Flags column shows settings that determine if the job may be accepted or denied if the resource request is over the limit.
  • MaxTRES -- the maximum traceable resources, maximal amount of resources allowed per job on this partition.
  • MaxWall -- the maximal wall time for a job, the longest time a job is allowed to run on this partition.
  • MaxTRESPU -- maximum traceable resources per user, the total maximal amount of resources allowed per single user on the partition. If the limit is reached the resource requests above the limit will have to wait in the queue until some resources are freed by currently running jobs.
  • MaxSubmitPU -- maximal number of jobs submitted to this partition. SLURM will reject any jobs above this limit. Please note, that there is also a global limit of 4000 jobs per user for the entire cluster.
  • MinTRES -- minimal amount of resources in job's resource request that is required for a job to be accepted. Relevant on GPU partition, as jobs are expected to request at least one GPU to qualify to run on a GPU partition.

Backfill partitions

Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group. However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the -bfXX suffix, such as cpu2019-bf05, cpu2021-bf24, etc., as they are shown in the output of the arc.hardware command.

Currently, the limits for such partitions are set via two auxiliary partitions, backfill05 and backfill24 instead.

When examining the output of the arc.limits command, and you want to check the limits for any xxxxxxx-bf05 partition, for example, you have to check the limits of the backfill05 instead, to find the current limits for the partition.

Correspondingly, for any xxxxxxx-bf24 partition, please use the backfill24 limits.


gpu-v100 partition limits

For example, the gpu-v100 partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours. The job must request at least 1 GPU, but not more than 4 GPUs. The total number of GPUs working for a single user's jobs is limited to 8. If the resource request of a submitted job is over the limit, the job will be rejected.