General Cluster Guidelines and Policies: Difference between revisions

← Older edit

VisualWikitext

Latest revision as of 20:41, 21 September 2023

General Rules

Never run anything related to your research on the Login Node.

You have to make sure that the resources you request for the job are used by the job.
When resources are requested from SLURM by the job script, the resources are reserved for the job and will not be allocated for other users. Jobs that do not make complete use of the allocated resource reduces the overall cluster efficiency. It is essential that the user makes resource requests that fit with the requirements of their jobs so that resources are properly used.

Users are expected to monitor their UofC e-mail addresses associated with their accounts, so they can receive communication from system administrators and analysts about the state of the systems and their accounts.

Guidelines

Please review the guidelines set out below when using our cluster.

Login Node

The login node should be used only for:

Data management, that is file management, compression / decompression, and, possibly, data transfer.
Job management: job script creation / submission / monitoring.
Software development: Source editing / compilation.
Short data analysis computations that take 100% of 1 CPU for up to 15 minutes.

Everything else should be run on compute nodes either via the sbatch command or in an interactive job via the salloc command. These restrictions are in place to ensure that the login node remains available for other users and is not unnecessarily overburdened.

Short jobs

Jobs on ARC, generally, should be at least 15 minutes long. Scheduling a node for a new job takes time and if jobs are too short, then scheduling time becomes similar or longer than the actual run time of the job, which is very inefficient.

If you are expecting to run a large number of very short jobs, that are from seconds to several minutes long, please pack several of those computations into longer jobs, 2-3 hours long. For example, if you have 200000 jobs that ran for 30 seconds, consider running 1000 of these computations inside a single job, which will result in 200 medium long jobs instead of 200000 very short ones.

Data Transfer Node

If the cluster has a Data Transfer Node (DTN), please use it rather than the login node to transfer files to/from the cluster.

Interactive Jobs

Interactive jobs can be started using the salloc command and are limited to a maximum of 5 hours.

The reason for the time restriction on interactive jobs are:

If an interactive job asks for more than 5 hours of run time, it is hardly interactive. Who can stare in the screen for more than 5 hours straight?
Interactive jobs tend to be resource-wise wasteful as the job does not finish when the computation is done, but keeps running until it times out.
The partition setup allows for much quicker resource allocation for jobs that are 5 hours or less, so it is significantly easier to get resources in the default partitions for shorter jobs.

Bigmem Partition

The bigmem partitions can be used for general shorter jobs is intended for computations that need lots of memory.

Please avoid running low memory computations on the bigmem partition.

GPU Partitions

The GPU partitions, such as gpu-v100 and gpu-a100, are strictly for computations that utilize GPUs.

Please do not run CPU-only computations on the GPU partitions.

@@ Line 1: / Line 1: @@
 == General Rules ==
 * '''Never''' run anything related to your research on the '''Login Node'''.
 * You have to make sure that the '''resources you request''' for the job are '''used''' by the job. <br />When resources are requested from '''SLURM''' by the job script, the resources are reserved for the job and will not be allocated for other users. Jobs that do not make complete use of the allocated resource reduces the overall cluster efficiency.  It is essential that the user makes resource requests that fit with the requirements of their jobs so that resources are properly used.
+* Users are expected to '''monitor their UofC e-mail addresses''' associated with their accounts, so they can receive communication from system administrators and analysts about the state of the systems and their accounts.
 == Guidelines ==
@@ Line 8: / Line 11: @@
 === Login Node ===
 The login node should be used only for:
-* Data management, that is file management, compression / decompression, and, possibly, data transfer.
+* '''Data management''', that is file management, compression / decompression, and, possibly, data transfer.
-* Job management: job script creation / submission / monitoring.
+* '''Job management''': job script creation / submission / monitoring.
-* Software development: Source editing / compilation.
+* Software '''development''': Source editing / compilation.
-* Short data analysis computations that take 100% of 1 CPU for up to 15 minutes.
+* Short data analysis computations that take 100% of 1 CPU for up to '''15 minutes'''.
 Everything else should be run on compute nodes either via the '''[[Running_jobs#Use_sbatch_to_submit_jobs|sbatch command]]''' or in an interactive job via the '''[[Running_jobs#Interactive_jobs|salloc command]]'''. These restrictions are in place to ensure that the login node remains available for other users and is not unnecessarily overburdened.
@@ Line 19: / Line 22: @@
 Scheduling a node for a new job takes time and if jobs are too short, then scheduling time becomes similar or longer than the actual run time of the job, which is very '''inefficient'''.
-If you are expecting to run a large number of very short jobs, that are from seconds to several minutes long, please '''pack several''' of those computations into one longer jobs, that is 2-3 hours long. For example, if you have 200000 jobs that ran for 30 seconds, consider run 1000 of these computations inside one job, with will result in 200 medium long jobs.
+If you are expecting to run a large number of very short jobs, that are from seconds to several minutes long, please '''pack several''' of those computations into longer jobs, 2-3 hours long. For example, if you have 200000 jobs that ran for 30 seconds, consider running 1000 of these computations inside a single job, which will result in 200 medium long jobs instead of 200000 very short ones.
 === Data Transfer Node ===
-If the cluster has a Data Transfer Node (DTN), please use it rather than the login node to transfer files to/from the cluster.
+If the cluster has a '''Data Transfer Node''' (DTN), please use it rather than the login node to transfer files to/from the cluster.
 === Interactive Jobs ===
-Interactive jobs can be started using the '''[[Running_jobs#Interactive_jobs|salloc command]]''' and are limited to a maximum of 5 hours.
+Interactive jobs can be started using the '''[[Running_jobs#Interactive_jobs|salloc command]]''' and are limited to a '''maximum of 5 hours'''.
 The reason for the time restriction on interactive jobs are:
@@ Line 37: / Line 40: @@
 '''Please avoid running low memory computations on the bigmem partition.'''
-=== gpu-v100 Partition ===
+=== GPU Partitions ===
-The gpu-v100 partition is strictly for computations that utilize GPUs.
+The GPU partitions, such as '''gpu-v100''' and '''gpu-a100''', are strictly for computations that utilize GPUs.
-'''Please do not run CPU-only computations on the gpu-v100 partition.'''
+'''Please do not run CPU-only computations on the GPU partitions.'''
-__NOTOC__
+[[Category:Administration]]
+{{Navbox Administration}}

General Cluster Guidelines and Policies: Difference between revisions

Latest revision as of 20:41, 21 September 2023

Contents

General Rules

Guidelines

Login Node

Short jobs

Data Transfer Node

Interactive Jobs

Bigmem Partition

GPU Partitions

Navigation menu

General Cluster Guidelines and Policies: Difference between revisions

Latest revision as of 20:41, 21 September 2023

General Rules

Guidelines

Login Node

Short jobs

Data Transfer Node

Interactive Jobs

Bigmem Partition

GPU Partitions

Navigation menu

Search