How to find current limits on ARC: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
Line 59: Line 59:
Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group.
Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group.
However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the <code>-bfXX</code> suffix,
However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the <code>-bfXX</code> suffix,
such as <code>cpu2019-bf05</code>, <code>cpu2021-bf24</code>, etc.
such as <code>cpu2019-bf05</code>, <code>cpu2021-bf24</code>, etc., as they are shown in the [[How to find available partitions on ARC | output of the <code>arc.hardware</code> command]].


= Examples =
= Examples =

Revision as of 16:58, 23 March 2023

Resource allocation limits

Checking the current limits on ARC is recommended before planning a new series of computations, especially if GPUs are required for the computations.

The current limits can be shown by the arc.limits command:

$ arc.limits

   PartitionName        Flags       MaxTRES     MaxWall        MaxTRESPU MaxSubmitPU MinTRES
0         normal                             7-00:00:00                         4000        
1        cpu2019                    cpu=240  7-00:00:00          cpu=240        4000        
2       gpu-v100  DenyOnLimit  cpu=80,gpu=4  1-00:00:00    cpu=160,gpu=8        4000   gpu=1
3         single                    cpu=200  7-00:00:00  cpu=200,node=30        4000        
4           razi                             7-00:00:00                         4000        
5        apophis                             7-00:00:00                         4000        
6        razi-bf                    cpu=546    05:00:00          cpu=546        4000        
7     apophis-bf                    cpu=280    05:00:00          cpu=280        4000        
8        lattice                    cpu=408  7-00:00:00          cpu=408        4000        
9       parallel                    cpu=624  7-00:00:00          cpu=624        4000        
10        bigmem                     cpu=80  1-00:00:00     cpu=80,gpu=1        4000        
11       cpu2013                             7-00:00:00                         4000        
12        pawson                             7-00:00:00                         4000        
13     pawson-bf                    cpu=480    05:00:00          cpu=480        4000        
14         theia                             7-00:00:00                         4000        
15      theia-bf                    cpu=280    05:00:00                         4000        
16          demo                             7-00:00:00                         4000        
17       synergy                             7-00:00:00                         4000        
18    synergy-bf                    cpu=448    05:00:00          cpu=448        4000        
19    backfill05                   cpu=1000    05:00:00         cpu=1000        4000        
20       cpu2021                    cpu=576  7-00:00:00          cpu=576        4000        
21    backfill24                    cpu=208  1-00:00:00          cpu=208        4000        
22      sherlock                             7-00:00:00                         4000        
23      wdf-zach                             7-00:00:00                         4000        
24     wdf-think                             7-00:00:00                         4000        
25          mtst                             7-00:00:00                                     
26       cpu2022                    cpu=520  7-00:00:00          cpu=520        4000        
27      gpu-a100  DenyOnLimit  cpu=80,gpu=4  1-00:00:00    cpu=160,gpu=8        4000   gpu=1

TRES=Trackable RESources
  PU=Per User

The table shows the list of partitions and set limits for each of the partitions.

  • Flags column shows settings that determine if the job may be accepted or denied if the resource request is over the limit.
  • MaxTRES -- the maximum traceable resources, maximal amount of resources allowed per job on this partition.
  • MaxWall -- the maximal wall time for a job, the longest time a job is allowed to run on this partition.
  • MaxTRESPU -- maximum traceable resources per user, the total maximal amount of resources allowed per single user on the partition. If the limit is reached the resource requests above the limit will have to wait in the queue until some resources are freed by currently running jobs.
  • MaxSubmitPU -- maximal number of jobs submitted to this partition. SLURM will reject any jobs above this limit. Please note, that there is also a global limit of 4000 jobs per user for the entire cluster.
  • MinTRES -- minimal amount of resources in job's resource request that is required for a job to be accepted. Relevant on GPU partition, as jobs are expected to request at least one GPU to qualify to run on a GPU partition.

Backfill partitions

Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group. However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the -bfXX suffix, such as cpu2019-bf05, cpu2021-bf24, etc., as they are shown in the output of the arc.hardware command.

Examples

gpu-v100 partition limits

For example, the gpu-v100 partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours. The job must request at least 1 GPU, but not more than 4 GPUs. The total number of GPUs working for a single user's jobs is limited to 8. If the resource request of a submitted job is over the limit, the job will be rejected.