How to find current limits on ARC: Difference between revisions
(introduction) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
ARC has many partitions each with different resource limits. It is important to understand these limits when planning a new computation and before submitting a job. For resources that are more scarce such as GPU and bigmem nodes, understanding the resource limits and making your jobs as efficient as possible will help increase the throughput of jobs on the cluster. | |||
= Resource allocation limits = | = Resource allocation limits = | ||
The current resource limits can be shown by the <code>arc.limits</code> command: | |||
The current limits can be shown by the <code>arc.limits</code> command: | |||
<pre> | <pre> | ||
$ arc.limits | $ arc.limits | ||
Line 57: | Line 56: | ||
== Backfill partitions == | == Backfill partitions == | ||
Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group. | Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group. However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the <code>-bfXX</code> suffix, such as <code>cpu2019-bf05</code>, <code>cpu2021-bf24</code>, etc., as they are shown in the [[How to find available partitions on ARC |output of the <code>arc.hardware</code> command]]. | ||
However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the <code>-bfXX</code> suffix, | |||
such as <code>cpu2019-bf05</code>, <code>cpu2021-bf24</code>, etc., | |||
as they are shown in the [[How to find available partitions on ARC | output of the <code>arc.hardware</code> command]]. | |||
Currently, the limits for such partitions are set via two auxiliary partitions, <code>backfill05</code> and <code>backfill24</code> instead. | Currently, the limits for such partitions are set via two auxiliary partitions, <code>backfill05</code> and <code>backfill24</code> instead. | ||
Line 67: | Line 63: | ||
for example, | for example, | ||
you have to check the limits of the <code>backfill05</code> instead, to find the current limits for the partition. | you have to check the limits of the <code>backfill05</code> instead, to find the current limits for the partition. | ||
Correspondingly, for any <code>xxxxxxx-bf24</code> partition, please use the <code>backfill24</code> limits. | Correspondingly, for any <code>xxxxxxx-bf24</code> partition, please use the <code>backfill24</code> limits. | ||
Line 75: | Line 70: | ||
== gpu-v100 partition limits == | == gpu-v100 partition limits == | ||
'''For example''', the <code>gpu-v100</code> partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours. | '''For example''', the <code>gpu-v100</code> partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours. The job must request at least 1 GPU, but not more than 4 GPUs. The total number of GPUs working for a single user's jobs is limited to 8. If the resource request of a submitted job is over the limit, the job will be rejected. | ||
The job must request at least 1 GPU, but not more than 4 GPUs. | |||
The total number of GPUs working for a single user's jobs is limited to 8. | [[Category:Guides]] | ||
If the resource request of a submitted job is over the limit, the job will be rejected. | [[Category:How-Tos]] | ||
[[Category:ARC]] | |||
{{Navbox ARC}} |
Latest revision as of 17:19, 21 September 2023
ARC has many partitions each with different resource limits. It is important to understand these limits when planning a new computation and before submitting a job. For resources that are more scarce such as GPU and bigmem nodes, understanding the resource limits and making your jobs as efficient as possible will help increase the throughput of jobs on the cluster.
Resource allocation limits
The current resource limits can be shown by the arc.limits
command:
$ arc.limits PartitionName Flags MaxTRES MaxWall MaxTRESPU MaxSubmitPU MinTRES 0 normal 7-00:00:00 4000 1 cpu2019 cpu=240 7-00:00:00 cpu=240 4000 2 gpu-v100 DenyOnLimit cpu=80,gpu=4 1-00:00:00 cpu=160,gpu=8 4000 gpu=1 3 single cpu=200 7-00:00:00 cpu=200,node=30 4000 4 razi 7-00:00:00 4000 5 apophis 7-00:00:00 4000 6 razi-bf cpu=546 05:00:00 cpu=546 4000 7 apophis-bf cpu=280 05:00:00 cpu=280 4000 8 lattice cpu=408 7-00:00:00 cpu=408 4000 9 parallel cpu=624 7-00:00:00 cpu=624 4000 10 bigmem cpu=80 1-00:00:00 cpu=80,gpu=1 4000 11 cpu2013 7-00:00:00 4000 12 pawson 7-00:00:00 4000 13 pawson-bf cpu=480 05:00:00 cpu=480 4000 14 theia 7-00:00:00 4000 15 theia-bf cpu=280 05:00:00 4000 16 demo 7-00:00:00 4000 17 synergy 7-00:00:00 4000 18 synergy-bf cpu=448 05:00:00 cpu=448 4000 19 backfill05 cpu=1000 05:00:00 cpu=1000 4000 20 cpu2021 cpu=576 7-00:00:00 cpu=576 4000 21 backfill24 cpu=208 1-00:00:00 cpu=208 4000 22 sherlock 7-00:00:00 4000 23 wdf-zach 7-00:00:00 4000 24 wdf-think 7-00:00:00 4000 25 mtst 7-00:00:00 26 cpu2022 cpu=520 7-00:00:00 cpu=520 4000 27 gpu-a100 DenyOnLimit cpu=80,gpu=4 1-00:00:00 cpu=160,gpu=8 4000 gpu=1 TRES=Trackable RESources PU=Per User
The table shows the list of partitions and set limits for each of the partitions.
- Flags column shows settings that determine if the job may be accepted or denied if the resource request is over the limit.
- MaxTRES -- the maximum traceable resources, maximal amount of resources allowed per job on this partition.
- MaxWall -- the maximal wall time for a job, the longest time a job is allowed to run on this partition.
- MaxTRESPU -- maximum traceable resources per user, the total maximal amount of resources allowed per single user on the partition. If the limit is reached the resource requests above the limit will have to wait in the queue until some resources are freed by currently running jobs.
- MaxSubmitPU -- maximal number of jobs submitted to this partition. SLURM will reject any jobs above this limit. Please note, that there is also a global limit of 4000 jobs per user for the entire cluster.
- MinTRES -- minimal amount of resources in job's resource request that is required for a job to be accepted. Relevant on GPU partition, as jobs are expected to request at least one GPU to qualify to run on a GPU partition.
Backfill partitions
Some of the partitions on ARC are not for general use. They may be dedicated to a specific task or may belong to a specific group. However, some of the nodes from such partitions may also be assigned to additional auxiliary partitions, which names end with the -bfXX
suffix, such as cpu2019-bf05
, cpu2021-bf24
, etc., as they are shown in the output of the arc.hardware
command.
Currently, the limits for such partitions are set via two auxiliary partitions, backfill05
and backfill24
instead.
When examining the output of the arc.limits
command, and you want to check the limits for any xxxxxxx-bf05
partition,
for example,
you have to check the limits of the backfill05
instead, to find the current limits for the partition.
Correspondingly, for any xxxxxxx-bf24
partition, please use the backfill24
limits.
Examples
gpu-v100 partition limits
For example, the gpu-v100
partition is limited to 4000 jobs per user and the maximum run time is limited to 24 hours. The job must request at least 1 GPU, but not more than 4 GPUs. The total number of GPUs working for a single user's jobs is limited to 8. If the resource request of a submitted job is over the limit, the job will be rejected.