Running jobs: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
(36 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page is intended for the user who is already familiar with the concepts of job scheduling and job scripts,  
This page is intended for users who are looking for guidance on working effectively with the [[wikipedia:Slurm_Workload_Manager|SLURM]] job scheduler on the compute clusters that are operated by Research Computing Services (RCS). The content on this page assumes that you already familiar with the basic concepts of job scheduling and job scripts. If you have not worked on a large shared computer cluster before, please read [[What is a scheduler?]] first. To familiarize yourself with how we use computer architecture terminology in relation to SLURM, you may also want to read about the [[Types of Computational Resources]], as well as the cluster guide for the cluster that you are working on. Users familiar with other job schedulers including PBS/Torque, SGE, LSF, or LoadLeveler, may [https://slurm.schedmd.com/rosetta.pdf find this table of corresponding commands] useful. Finally, if you have never done any kind of parallel computation before, you may find some of the content on [[Parallel Models]] useful as an introduction.
and who wants guidance on submitting jobs to UofC RCS clusters.


If you have not worked on a large shared computer cluster before, you should probably read [[What is a scheduler?]] first.


{{box|'''All jobs must be submitted via the scheduler!'''
Here is the link to the SLURM manual: https://slurm.schedmd.com/
<br>
Exceptions are made for compilation and other tasks not expected to consume more than about 10 CPU-minutes and about 4 gigabytes of RAM. Such tasks may be run on a login node. In no case should you run processes on compute nodes except via the scheduler.}}


On UofC RCS compute clusters as well as on Compute Canada clusters, the job scheduler is the
== Basic HPC Workflow ==
[https://en.wikipedia.org/wiki/Slurm_Workload_Manager Slurm Workload Manager].
Comprehensive [https://slurm.schedmd.com/documentation.html documentation for Slurm] is maintained by SchedMD.


If you are coming to Slurm from PBS/Torque, SGE, LSF, or LoadLeveler,
The most basic HPC computation workflow (excluding software setup and data transfer) involves 5 steps.  
you might find this table of [https://slurm.schedmd.com/rosetta.pdf corresponding commands] useful.


==Use <code>sbatch</code> to submit jobs==
<!--[[File:SvgTemplatePNG.png|thumb|left|a test for PNG inclusion]] -->
The command to submit a job is [https://slurm.schedmd.com/sbatch.html <code>sbatch</code>]:
<!--[[File:JobWorkflow.png|border|center|400px|The standard process for executing and optimizing a job script ]] -->
<source lang="bash">
  [[File:JobWorkflowv2.svg|thumb|400px|The standard process for executing and optimizing a job script]]
$ sbatch simple_job.sh
Submitted batch job 123456
</source>


A minimal Slurm job script looks like this:
{{File
  |name=simple_job.sh
  |lang="sh"
  |contents=
#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --account=def-someuser
echo 'Hello, world!'
sleep 30 
}}


On general-purpose (GP) clusters this job reserves 1 core and 256MB of memory for 15 minutes. On [[Niagara]] this job reserves the whole node with all its memory.
This process begins with the creation of a working preliminary job script that includes an approximate resource request and all of the commands required to complete a computational task of interest. This job script is used as the basis of a job submission to SLURM. The submitted job is then monitored for issues and analyzed for resource utilization. Depending on how well the job utilized resources, modifications to the job script can be made to improve computational efficiency and the accuracy of the resource request. In this way, the job script can be iteratively improved until a reliable job script is developed. As your computational goals and methods change, your job scripts will need to evolve to reflect these changes and this workflow can be used again to make appropriate modifications.  
Directives (or "options") in the job script are prefixed with <code>#SBATCH</code> and must precede all executable commands. All available directives are described on the [https://slurm.schedmd.com/sbatch.html sbatch page]. Compute Canada policies require that you supply at least a time limit (<code>--time</code>) for each job. You may also need to supply an account name (<code>--account</code>). See [[#Accounts and projects|Accounts and projects]] below.


You can also specify directives as command-line arguments to <code>sbatch</code>. So for example,
== Job Design and Resource Estimation ==
$ sbatch --time=00:30:00 simple_job.sh
will submit the above job script with a time limit of 30 minutes. The acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".  Please note that the time limit will strongly affect how quickly the job is started, since longer jobs are [[Job_scheduling_policies|eligible to run on fewer nodes]].


Please be cautious if you use a script to submit multiple Slurm jobs in a short time. Submitting thousands of jobs at a time can cause Slurm to become [[Frequently_Asked_Questions#.22sbatch:_error:_Batch_job_submission_failed:_Socket_timed_out_on_send.2Frecv_operation.22|unresponsive]] to other users. Consider using an [[Running jobs#Array job|array job]] instead, or use <code>sleep</code> to space out calls to <code>sbatch</code> by one second or more.
The goal of this first step is the development of a slurm job script for the calculations that you wish to run that reflects real resource usage in a job typical of your work. To review some example jobs scripts for different models of parallel computation, please consult the page [[Sample Job Scripts]]. We will look at the specific example of using the NumPy library in python to multiply two matrices that are read in from two files. We assume that the files have been transferred to a subdirectory of the users home directory <code>~/projects/matmul</code> and that a custom python installation has been setup under <code>~/anaconda3</code>.


=== Memory ===
<source lang="console">
$ ls -lh ~/project/matmul
total 4.7G
-rw-rw-r-- 1 username username 2.4G Feb  3 10:13 A.csv
-rw-rw-r-- 1 username username 2.4G Feb  3 10:16 B.csv
$ ls ~/anaconda3/bin/python
/home/username/anaconda3/bin/python
</source> 


Memory may be requested with <code>--mem-per-cpu</code> (memory per core) or <code>--mem</code> (memory per node). On general-purpose (GP) clusters a default memory amount of 256 MB per core will be allocated unless you make some other request. At [[Niagara]] only whole nodes are allocated along with all available memory, so a memory specification is not required there.
The computational plan for this project is to load the data from files and then use the multithreaded version of the NumPy <code>dot</code> function to perform the matrix multiplication and write it back out to a file. To test out our proposed steps and software environment, we will request a small ''Interactive Job''. This is done with the <code>salloc</code> command. The salloc command accepts all of the standard resource request parameters for SLURM. (The key system resources that can be requested can be reviewed on the [[Types of Computational Resources]] page) In our case, we will need to specify a partition, a number of cores, a memory allocation, and a small time interval in which we will do our testing work (on the order of 1-4 hours). We begin by checking which partitions are currently not busy using the <code>arc.nodes</code> command   


A common source of confusion comes from the fact that some memory on a node is not available to the job (reserved for the OS, etc).  The effect of this is that each node-type has a maximum amount available to jobs - for instance, nominally "128G" nodes are typically configured to permit 125G of memory to user jobs. If you request more memory than a node-type provides, your job will be constrained to run on higher-memory nodes, which may be fewer in number.
<source lang="console">
$ arc.nodes


Adding to this confusion, Slurm interprets K, M, G, etc., as [https://en.wikipedia.org/wiki/Binary_prefix binary prefixes], so <code>--mem=125G</code> is equivalent to <code>--mem=128000M</code>.  See the "available memory" column in the "Node characteristics" table for each GP cluster for the Slurm specification of the maximum memory you can request on each node: [[Béluga/en#Node_characteristics|Béluga]], [[Cedar#Node_characteristics|Cedar]], [[Graham#Node_characteristics|Graham]].
          Partitions: 16 (apophis, apophis-bf, bigmem, cpu2013, cpu2017, cpu2019, gpu-v100, lattice, parallel, pawson, pawson-bf, razi, razi-bf, single, theia, theia-bf)


==Use <code>squeue</code> or <code>sq</code> to list jobs==


The general command for checking the status of Slurm jobs is <code>squeue</code>, but by default it supplies information about ''all'' jobs in the system, not just your own. You can use the shorter <code>sq</code> to list only your own jobs:
      ====================================================================================
                    | Total Allocated  Down  Drained  Draining  Idle  Maint  Mixed
      ------------------------------------------------------------------------------------
            apophis |    21          1    1        0        0    7      1    11
          apophis-bf |    21          1    1        0        0    7      1    11
              bigmem |    2          0    0        0        0    1      0      1
            cpu2013 |    14          1    0        0        0    0      0    13
            cpu2017 |    16          0    0        0        0    0    16      0
            cpu2019 |    40          3    1        2        1    0      4    29
            gpu-v100 |    13          2    0        1        0    5      0      5
            lattice |  279        34    3      72        0    47      0    123
            parallel |  576        244    26      64        1    20      3    218
              pawson |    13          1    1        2        0    9      0      0
          pawson-bf |    13          1    1        2        0    9      0      0
                razi |    41        26    4        1        0    8      0      2
            razi-bf |    41        26    4        1        0    8      0      2
              single |  168          0    14      31        0    43      0    80
              theia |    20          0    0        0        1    0      0    19
            theia-bf |    20          0    0        0        1    0      0    19
      ------------------------------------------------------------------------------------
      logical total |  1298        340    56      176        4  164    25    533
                    |
      physical total |  1203        312    50      173        3  140    24    501


<source lang="bash">
$ sq
  JOBID    USER      ACCOUNT      NAME  ST  TIME_LEFT NODES CPUS    GRES MIN_MEM NODELIST (REASON)
  123456  smithj  def-smithj  simple_j  R        0:03    1    1  (null)      4G cdr234  (None)
  123457  smithj  def-smithj  bigger_j  PD  2-00:00:00    1  16  (null)    16G (Priority)
</source>
</source>


The ST column of the output shows the status of each job. The two most common states are "PD" for "pending" or "R" for "running".
The [[ARC Cluster Guide]] tells us that the single partition is an old partition with 12GB of Memory (our data set is only 5GB and matrix multiplication is usually pretty memory efficient) and 8 cores so it is probably a good fit. Based on the <code>arc.nodes</code> output, we can choose the single partition (which at the time that this ran had 43 idle nodes) for our test. The preliminary resource request that we will use for the interactive job is
<pre>
partition=single
time=2:0:0
nodes=1
ntasks=1
cpus-per-task=8
mem=0
</pre>


If you want to know more about the output of <code>sq</code> or <code>squeue</code>, or learn how to change the output, see the [https://slurm.schedmd.com/squeue.html online manual page for squeue].  <code>sq</code> is a Compute Canada customization.
We are requesting 2 hours and 0 minutes and 0 seconds, a whole single node (i.e. all 8 cores), and <code>mem=0</code> means that all memory on the node is being requested. <u>Please note that mem=0 can only used in a memory request when you are already requesting all of the cpus on a node.</u> On the single partition, 8 cpus is the full number of cpus per node. Please check this for the partition that you plan on running on using <code>arc.hardware</code> before using it. From the login node, we will request this job using <code>salloc</code>


'''Do not''' run <code>sq</code> or <code>squeue</code> from a script or program at high frequency, ''e.g.,'' every few seconds. Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.  See [[#Email_notification|Email notification]] below for a much better way to learn when your job starts or ends.
<source lang="console">
[userrname@arc ~]$ salloc --partition=single --time=2:0:0 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=8gb
salloc: Pending job allocation 8430460
salloc: job 8430460 queued and waiting for resources
salloc: job 8430460 has been allocated resources
salloc: Granted job allocation 8430460
salloc: Waiting for resource configuration
salloc: Nodes cn056 are ready for job
[username@cn056 ~]$ export PATH=~/anaconda3/bin:$PATH
[username@cn056 ~]$ which python
~/anaconda3/bin/python
[username@cn056 ~]$ export OMP_NUM_THREADS=4
[username@cn056 ~]$ export OPENBLAS_NUM_THREADS=4
[username@cn056 ~]$ export MKL_NUM_THREADS=4
[username@cn056 ~]$ export VECLIB_MAXIMUM_THREADS=4
[username@cn056 ~]$ export NUMEXPR_NUM_THREADS=4
[username@cn056 ~]$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
</source>
The steps that were taken at the command line included
<ol>
  <li>submitting an interactive job request and waiting for the scheduler to allocate the node (cn056) and move the terminal to that node</li>
  <li>setting up our software environment using Environment Variables</li>
  <li>starting a python interpreter</li>
</ol>


==Where does the output go?==
At this point, we are ready to test our simple python code at the python interpreter command line.


By default the output is placed in a file named "slurm-", suffixed with the job ID number and ".out", ''e.g.'' <code>slurm-123456.out</code>, in the directory from which the job was submitted.
<source lang="console">
You can use <code>--output</code> to specify a different name or location.  
>>> import numpy as np
Certain replacement symbols can be used in the filename, ''e.g.'' <code>%j</code> will be replaced
>>> A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
by the job ID number. See [https://slurm.schedmd.com/sbatch.html sbatch] for a complete list.
>>> B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
>>> C=np.dot(A,B)
>>> np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")
>>> quit()
[username@cn056 ~]$ exit
exit
salloc: Relinquishing job allocation 8430460
</source> 


The following sample script sets a ''job name'' (which appears in <code>squeue</code> output) and sends the output to a file with a name constructed from the job name (%x) and the job ID number (%j).  
In a typical testing/debugging/prototyping session like this, some problems would be identified and fixed along the way. Here we have only shown successful steps of a simple job. However, normally this would be an opportunity to identify syntax errors, data type mismatch, out-of-memory exceptions, and other issues in an interactive environment. The data used for a test should be a manageable size. In the sample data, I am working with randomly generated 10000x10000 matrices and discussing it as though it was the actual job that would run. However, often times, the real problem is too large for an interactive job and it is better to test the commands in an interactive mode using a small data subset. For example, randomly generated 10000 x 10000 matrices might be a good starting point for planning a real job that was going to take the product of 1 million x1 million matrices.


{{File
While this interactive job was running, I used a separate terminal on ARC to check the resource utilization of the interactive job by running the command <code>arc.job-info 8430460</code>, and found that the CPU utilization hovered around 50% and memory utilization around 30%. We will come back to this point in later steps of the workflow but this would be a warning sign that the resource request is a bit high, especially in CPUs. It is worth noting that if you have run your code before on a different computer, you can base your initial slurm script on that past experience or the resources that are recommended in the software documentation instead of testing it out explicitly with salloc. However, as the the hardware and software installation are likely different, you may run into new challenges that need to be resolved. In the next step, we will write a slurm script that includes all of the commands we ran manually in the above example.
  |name=name_output.sh
  |lang="sh"
  |contents=
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
echo 'Hello, world!'
}}


Error output will normally appear in the same file as standard output, just as it would if you were typing commands interactively. If you want to send the standard error channel (stderr) to a separate file, use <code>--error</code>.
== Job Submission ==


==Accounts and projects==
In order to use the full resources of a computing cluster, it is important to do your large-scale work with ''Batch Jobs''. This amounts to replacing sequentially typed commands (or successively run cells in a notebook) with a shell script that automates the entire process and using the <code>sbatch</code> command to submit the shell script as a job to the job scheduler. Please note, '''you must not run this script directly on the login node'''. It must be submitted to the job scheduler.


Every job must have an associated ''account name'' corresponding to a Compute Canada [https://ccdb.computecanada.ca/me/faq#what_is_rap Resource Allocation Project] (RAP).
Simply putting all of our commands in a single shell script file (with proper <code>#SBATCH</code> directives for the resource request), yields the following unsatisfactory result:
<source lang="bash">
#!/bin/bash
#SBATCH --partition=single
#SBATCH --time=2:0:0
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8gb


If you try to submit a job with <code>sbatch</code> and receive one of these messages:
export PATH=~/anaconda3/bin:$PATH
<pre>
echo $(which python)
You are associated with multiple _cpu allocations...
export OMP_NUM_THREADS=4
Please specify one of the following accounts to submit this job:
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4


You are associated with multiple _gpu allocations...
python matmul_test.py
Please specify one of the following accounts to submit this job:
</source>
</pre>  
then you have more than one valid account, and you will have to specify one
using the <code>--account</code> directive:
#SBATCH --account=def-user-ab


To find out which account name corresponds
where '''matmul_test.py''':
to a given Resource Allocation Project, log in to [https://ccdb.computecanada.ca CCDB]
<source lang="python">
and click on "My Account -> Account Details". You will see a list of all the projects
import numpy as np
you are a member of. The string you should use with the <code>--account</code> for
a given project is under the column '''Group Name'''. Note that a Resource
Allocation Project may only apply to a specific cluster (or set of clusters) and therefore
may not be transferable from one cluster to another.


In the illustration below, jobs submitted with <code>--account=def-rdickson</code> will be accounted against RAP wnp-003-aa.
A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
C=np.dot(A,B)
np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")
</source>  


[[File:Find-group-name-annotated.png|750px|frame|left| Finding the group name for a Resource Allocation Project (RAP)]]
There are a number of things to improve here when moving to a batch job. First, we don't need to submit the job exclusively to the single partition. We can expand our partition list a bit and the job should be able to start sooner. However, we do need to be correspondingly more precise with the rest of our request. We should also consider generalizing our script a bit to improve reusability between jobs, instead of hard coding paths. This yields:
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->


If you plan to use one account consistently for all jobs, once you have determined the right account name you may find it convenient to set the following three environment variables in your <code>~/.bashrc</code> file:
'''matmul_test02032021.slurm:'''
export SLURM_ACCOUNT=def-someuser
<source lang="bash">
export SBATCH_ACCOUNT=$SLURM_ACCOUNT
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
Slurm will use the value of <code>SBATCH_ACCOUNT</code> in place of the <code>--account</code> directive in the job script. Note that even if you supply an account name inside the job script, ''the environment variable takes priority.'' In order to override the environment variable you must supply an account name as a command-line argument to <code>sbatch</code>.
 
<code>SLURM_ACCOUNT</code> plays the same role as <code>SBATCH_ACCOUNT</code>, but for the <code>srun</code> command instead of <code>sbatch</code>.  The same idea holds for <code>SALLOC_ACCOUNT</code>.
 
== Examples of job scripts ==
 
=== Serial job ===
A serial job is a job which only requests a single core. It is the simplest type of job. The "simple_job.sh" which appears above in [[#Use_sbatch_to_submit_jobs|"Use sbatch to submit jobs"]] is an example.
 
=== Array job ===
Also known as a ''task array'', an array job is a way to submit a whole set of jobs with one command. The individual jobs in the array are distinguished by an environment variable, <code>$SLURM_ARRAY_TASK_ID</code>, which is set to a different value for each instance of the job. The following example will create 10 tasks, with values of <code>$SLURM_ARRAY_TASK_ID</code> ranging from 1 to 10:
 
'''array_job.slurm''':
<pre>
#!/bin/bash
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --partition=single,lattice,parallel,pawson-bf
#SBATCH --time=0-0:5
#SBATCH --time=2:0:0  
#SBATCH --array=1-10
#SBATCH --nodes=1  
./myapplication $SLURM_ARRAY_TASK_ID
#SBATCH --ntasks=1
<pre>
#SBATCH --cpus-per-task=4
#SBATCH --mem=10000M


For more examples, see [[Job arrays]]. See [https://slurm.schedmd.com/job_array.html Job Array Support] at SchedMD.com for detailed documentation.
export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4


=== Threaded or OpenMP job ===
AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"


This example script launches a single process with eight CPU cores. Bear in mind that for an application to use OpenMP it must be compiled with the appropriate flag, e.g. <code>gcc -fopenmp ...</code> or <code>icc -openmp ...</code>
python matmul_test.py $AMAT $BMAT $OUT
</source>


{{File
where '''matmul_test.py''':
  |name=openmp_job.sh
<source lang="python">
  |lang="sh"
import numpy as np
  |contents=
import sys
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=8
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./ompHello
}}


=== MPI job ===
#parse arguments
if len(sys.argv)<3:
    raise Exception


This example script launches four MPI processes, each with 1024 MB of memory. The run time is limited to 5 minutes.  
lmatrixpath=sys.argv[1]
rmatrixpath=sys.argv[2]
outpath=sys.argv[3]


{{File
A=np.loadtxt(lmatrixpath,delimiter=",")
  |name=mpi_job.sh
B=np.loadtxt(rmatrixpath,delimiter=",")
  |lang="sh"
C=np.dot(A,B)
  |contents=
np.savetxt(outpath,C,delimiter=",")
#!/bin/bash
</source>
#SBATCH --account=def-someuser
#SBATCH --ntasks=4              # number of MPI processes
#SBATCH --mem-per-cpu=1024M      # memory; default unit is megabytes
#SBATCH --time=0-00:05          # time (DD-HH:MM)
srun ./mpi_program              # mpirun or mpiexec also work
}}


Large MPI jobs, specifically those which can efficiently use whole nodes, should use <code>--nodes</code> and <code>--ntasks-per-node</code> instead of <code>--ntasks</code>. Hybrid MPI/threaded jobs are also possible. For more on these and other options relating to distributed parallel jobs, see [[Advanced MPI scheduling]].
We have expanded the partition list to include other partitions. (the specific consequences of this choice will be discussed later) We have made the cpu and memory requests correspondingly more precise. Finally, we have generalized the way that data sources and output locations are specified in the job. In the future, this will help us reuse this script without proliferating python scripts. Similar considerations of reusability apply to any language and any job script. In general, due to the complexity of using them, HPC systems are not ideal for running a couple small calculations. They are at their best when you have to solve many variations on a problem at a large scale. As such, it is valuable to always be thinking about how to make your workflows simpler and easier to modify.


For more on writing and running parallel programs with OpenMP, see [[OpenMP]].
Now that we have job script, we can simply submit the script to sbatch:


=== GPU job ===
<source lang="console">
[username@arc matmul]$ sbatch matmul_test02032021.slurm
Submitted batch job 8436364
</source>


There are many options involved in requesting GPUs because
The number reported after the sbatch job submission is the JobID and will be used in all further analyses and monitoring. With this number, we can begin the process of checking on how our job progresses through the scheduling system.
* the GPU-equipped nodes at [[Cedar]] and [[Graham]] have different configurations,
* there are two different configurations at Cedar, and
*there are different policies for the different Cedar GPU nodes.
Please see [[Using GPUs with Slurm]] for a discussion and examples of how to schedule various job types on the available GPU resources.


== Interactive jobs ==
== Job Monitoring ==


Though batch submission is the most common and most efficient way to take advantage of our clusters, interactive jobs are also supported. These can be useful for things like:
The third step in this work flow is checking the progress of the job. We will emphasize three tools:
* Data exploration at the command line
<ol>
* Interactive "console tools" like R and iPython
  <li><code>--mail-user</code> , <code>--mail-type</code>
* Significant software development, debugging, or compiling
    <ol>
      <li>sbatch options for sending email about job progress</li>
      <li>no performance information</li>
    </ol>
  </li>
  <li><code>squeue</code>
    <ol>
      <li>slurm tool for monitoring job start, running, and end</li>
      <li>no performance information</li>
    </ol>
  </li>
  <li><code>arc.job-info</code>
    <ol>
      <li>RCS tool for monitoring job performance once it is running</li>
      <li>provides detailed snapshot of key performance information</li>
    </ol>
  </li>
</ol>


You can start an interactive session on a compute node with [https://slurm.schedmd.com/salloc.html salloc]. In the following example we request two tasks, which corresponds to two CPU cores, for an hour:
$ salloc --time=1:0:0 --ntasks=2 --account=def-someuser
salloc: Granted job allocation 1234567
$ ...            # do some work
$ exit            # terminate the allocation
salloc: Relinquishing job allocation 1234567


It is also possible to run graphical programs interactively on a compute node by adding the '''--x11'''  flag to your ''salloc'' command. In order for this to work, you must first connect to the cluster with X11 forwarding enabled (see the [[SSH]] page for instructions on how to do that).
The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. The details of these tools applied to the above example are discussed in the article [[Job Monitoring]]. If a job is determined to be incapable of running or have a serious error in it, you can cancel it at any time using the command <code>scancel JobID</code>, for example <code>scancel 8442969</code> 


== Monitoring jobs ==
Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.


=== Current jobs ===
== Job Performance Analysis ==


By default [https://slurm.schedmd.com/squeue.html squeue] will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
There are a tremendous number of performance analysis tools, from complex instruction-analysis tools like gprof, valgrind, and nvprof to extremely simple tools that only provide average utilization. The minimal tools that are required for characterizing a job are
$ squeue -u $USER
You can also use the utility <code>sq</code> to do the same thing with less typing.


You can show only running jobs, or only pending jobs:
<ol>
$ squeue -u <username> -t RUNNING
  <li><code>\time</code> for estimating CPU utilization on a single node</li>
$ squeue -u <username> -t PENDING
  <li><code>sacct -j JobID -o JobIDMaxRSS,Elapsed</code> for analyzing maximum memory utilization and wall time</li>
  <li><code>seff JobID </code> for analyzing general resource utilization</li>
</ol>


You can show detailed information for a specific job with [https://slurm.schedmd.com/scontrol.html scontrol]:
When utilizing these tools, it is important to consider the different sources of variability in job performance. Specifically, it is important to perform a detailed analysis of
$ scontrol show job -dd <jobid>


'''Do not''' run <code>squeue</code> from a script or program at high frequency, e.g., every few seconds. Responding to <code>squeue</code> adds load to Slurm, and may interfere with its performance or correct operation.
<ol>  
  <li>Variability Among Datasets</li>
  <li>Variability Across Hardware</li>
  <li>Request Parameter Estimation from Samples Sets - Scaling Analysis</li>
</ol>


==== Email notification ====
and how these can be addressed through thoughtful job design. This is the subject of the article [[Job Performance Analysis and Resource Planning]].


You can ask to be notified by email of certain job conditions by supplying options to
== Revisiting the Resource Request ==
[https://slurm.schedmd.com/sbatch.html sbatch]:
#SBATCH --mail-user=<email_address>
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL


==== Output buffering ====
Having completed a careful analysis of performance for our problem and the hardware that we plan to run on, we are now in a position to revise the resource request and submit more jobs. If you have 10000 such jobs then it is important to think about how busy the resources that you are planning to use are likely to be. It is also important to think about the fact that every job has some finite scheduling overhead (time to setup the environment to run in, time to figure out where to run to optimally fit jobs, etc). In order to make this scheduling overhead worthwhile, it is generally important to have jobs run for more than 10 minutes. So, we may want to pack batches of 10-20 matrix pairs into one job just to make it a sensible amount of work. This also will have some benefit in making job duration more predictable when there is variability due to data sets. New hardware also tends to be busier than old hardware and cpu2019 will often have 24 hour wait times when the cluster is busy. However, for jobs that can stay under 5 hours, the backfill partitions are often not busy and so the best approach might be to bundle 20 tasks together for a duration of about 100 minutes and then have the job request be


Output from a non-interactive Slurm job is normally ''buffered'', which means that there is usually a delay between when data is written by the job and when you can see the output on a login node.  Depending on the application you are running and the load on the filesystem, this delay can range from less than a second to many minutes, or until the job completes.
<source lang="bash">
#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019
#SBATCH --time=1:40:0
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=5000M
#SBATCH --array=1-500


There are methods to reduce or eliminate the buffering, but we do not recommend using them because buffering is vital to preserving the overall performance of the filesystem.  If you need to monitor the output from a job in "real time", we recommend you run an [[#Interactive_jobs|interactive job]] as described above.
export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4


=== Completed jobs ===
export TOP_DIR="/home/username/project/matmul/pairs"$SLURM_ARRAY_TASK_ID


Get a short summary of the CPU- and memory-efficiency of a job with <code>seff</code>:
DIR_LIST=$(find $TOP_DIR -type d -mindepth 1)
$ seff 12345678
for DIR in $FLIST
Job ID: 12345678
do
Cluster: cedar
python matmul_test.py $DIR/A.csv $DIR/B.csv $DIR/C.csv;
User/Group: jsmith/jsmith
done
State: COMPLETED (exit code 0)
   
Cores: 1
</source>
CPU Utilized: 02:48:58
CPU Efficiency: 99.72% of 02:49:26 core-walltime
Job Wall-clock time: 02:49:26
Memory Utilized: 213.85 MB
Memory Efficiency: 0.17% of 125.00 GB
 
Find more detailed information about a completed job with [https://slurm.schedmd.com/sacct.html sacct], and optionally, control what it prints using <code>--format</code>:
  $ sacct -j <jobid>
$ sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed


The output from <code>sacct</code> typically includes records labelled <code>.bat+</code> and <code>.ext+</code>, and possibly <code>.0, .1, .2, ...</code>.
Here the directory <code>TOP_DIR</code> will be one of 500 directories, each named <code>pairsN</code> where N is an integer from 1 to 500, with each of these holding 20 subdirectories with a pair of A and B matrix files in each. The script will iterate through the 20 subdirectories and run the python script once for each pair of matrices. This is one approach to organizing work into bundles. You could then submit a job array  The choice depends on a lot of assumptions that we have made, but it is a fairly good plan. The result would be 500 jobs, each 2 hours long, running mostly on the backfill partitions on 1 core a piece. If there is usually at least 1 node free, that will be about 38 jobs running at the same time for a total of roughly 13 sequences that are 2 hours long, or 26 hours. In this way, all of work could be done in a day. Typically, we would iteratively improve the script until it was something that could scale up. This is why the workflow is a loop.
The batch step (<code>.bat+</code>) is your submission script - for many jobs that's where the main part of the work is done and where the resources are consumed.
If you use <code>srun</code> in your submission script, that would create a <code>.0</code> step that would consume most of the resources.  
The extern (<code>.ext+</code>) step is basically prologue and epilogue and normally doesn't consume any significant resources.


If a node fails while running a job, the job may be restarted. <code>sacct</code> will normally show you only the record for the last (presumably successful) run. If you wish to see all records related to a given job, add the <code>--duplicates</code> option.
== Tips and tricks ==


Use the MaxRSS accounting field to determine how much memory a job needed. The value returned will be the largest [https://en.wikipedia.org/wiki/Resident_set_size resident set size] for any of the tasks. If you want to know which task and node this occurred on, print the MaxRSSTask and MaxRSSNode fields also.
=== Attaching to a running job ===


The [https://slurm.schedmd.com/sstat.html sstat] command works on a running job much the same way that [https://slurm.schedmd.com/sacct.html sacct] works on a completed job.
=== Attaching to a running job ===
It is possible to connect to the node running a job and execute new processes there. You might want to do this for troubleshooting or to monitor the progress of a job.
It is possible to connect to the node running a job and execute new processes there. You might want to do this for troubleshooting or to monitor the progress of a job.


Suppose you want to run the utility [https://developer.nvidia.com/nvidia-system-management-interface <code>nvidia-smi</code>] to monitor GPU usage on a node where you have a job running. The following command runs <code>watch</code> on the node assigned to the given job, which in turn runs <code>nvidia-smi</code> every 30 seconds, displaying the output on your terminal.
Suppose you want to run the utility <code>nvidia-smi</code> to monitor GPU usage on a node where you have a job running.  
 
The following command runs <code>watch</code> on the node assigned to the given job, which in turn runs <code>nvidia-smi</code> every 30 seconds,  
displaying the output on your terminal.
  $ srun --jobid 123456 --pty watch -n 30 nvidia-smi
  $ srun --jobid 123456 --pty watch -n 30 nvidia-smi


It is possible to launch multiple monitoring commands using [https://en.wikipedia.org/wiki/Tmux <code>tmux</code>]. The following command launches <code>htop</code> and <code>nvidia-smi</code> in separate panes to monitor the activity on a node assigned to the given job.
It is possible to launch multiple monitoring commands using [https://en.wikipedia.org/wiki/Tmux <code>tmux</code>].  
 
The following command launches <code>htop</code> and <code>nvidia-smi</code> in separate panes to monitor the activity on a node assigned to the given job.
  $ srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach
  $ srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach


Line 286: Line 315:
'''Noteː''' The <code>srun</code> commands shown above work only to monitor a job submitted with <code>sbatch</code>. To monitor an interactive job, create multiple panes with <code>tmux</code> and start each process in its own pane.
'''Noteː''' The <code>srun</code> commands shown above work only to monitor a job submitted with <code>sbatch</code>. To monitor an interactive job, create multiple panes with <code>tmux</code> and start each process in its own pane.


==Cancelling jobs==
It is possible to '''attach second terminal''' to a running interactive job with the
 
  $ srun -s --pty --jobid=12345678 /bin/bash
Use [https://slurm.schedmd.com/scancel.html scancel] with the job ID to cancel a job:
$ scancel <jobid>
 
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER
 
== Resubmitting jobs for long running computations ==
 
When a computation is going to require a long time to complete, so long that it cannot be done within the time limits on the system,
the application you are running must support [[Points de contrôle/en|checkpointing]]. The application should be able to save its state to a file, called a ''checkpoint file'', and
then it should be able to restart and continue the computation from that saved state.
 
For many users restarting a calculation will be rare and may be done manually,
but some workflows require frequent restarts.
In this case some kind of automation technique may be employed.
 
Here are two recommended methods of automatic restarting:
* Using SLURM '''job arrays'''.
* Resubmitting from the end of the job script.
 
Our [[Tutoriel Apprentissage machine/en|Machine Learning tutorial]] covers [[Tutoriel_Apprentissage_machine/en#Checkpointing_a_long-running_job|resubmitting for long machine learning jobs]].
 
=== Restarting using job arrays ===
 
Using the <code>--array=1-100%10</code> syntax mentioned earlier one can submit a collection of identical jobs with the condition that only one job of them will run at any given time.
The script should be written to ensure that the last checkpoint is always used for the next job. The number of restarts is fixed by the <code>--array</code> argument.
 
Consider, for example, a molecular dynamics simulations that has to be run for 1 000 000 steps, and such simulation does not fit into the time limit on the cluster.
We can split the simulation into 10 smaller jobs of 100 000 steps, one after another.
 
An example of using a job array to restart a simulation:
{{File
  |name=job_array_restart.sh
  |lang="sh"
  |contents=
#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for a multi-step job on a Compute Canada cluster.
# ---------------------------------------------------------------------
#SBATCH --account=def-someuser
#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M
#SBATCH --array=1-10%1  # Run a 10-job array, one job at a time.
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
echo ""
echo "Job Array ID / Job ID: $SLURM_ARRAY_JOB_ID / $SLURM_JOB_ID"
echo "This is job $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_COUNT jobs."
echo ""
# ---------------------------------------------------------------------
# Run your simulation step here...
 
if test -e state.cpt; then
    # There is a checkpoint file, restart;
    mdrun --restart state.cpt
else
    # There is no checkpoint file, start a new simulation.
    mdrun
fi
 
# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------
}}
 
=== Resubmission from the job script ===
 
In this case one submits a job that runs the first chunk of the calculation and saves a checkpoint.
Once the chunk is done but before the allocated run-time of the job has elapsed,
the script checks if the end of the calculation has been reached.
If the calculation is not yet finished, the script submits a copy of itself to continue working.
 
An example of a job script with resubmission:
{{File
  |name=job_resubmission.sh
  |lang="sh"
  |contents=
#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster.
# ---------------------------------------------------------------------
#SBATCH --job-name=job_chain
#SBATCH --account=def-someuser
#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M
# ---------------------------------------------------------------------
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
# Run your simulation step here...
 
if test -e state.cpt; then
    # There is a checkpoint file, restart;
    mdrun --restart state.cpt
else
    # There is no checkpoint file, start a new simulation.
    mdrun
fi
 
# Resubmit if not all work has been done yet.
# You must define the function work_should_continue().
if work_should_continue; then
    sbatch ${BASH_SOURCE[0]}
fi
 
# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: `date`"
# ---------------------------------------------------------------------
}}
 
'''Please note:''' The test to determine whether to submit a followup job, abbreviated as <code>work_should_continue</code> in the above example, should be a ''positive test''. There may be a temptation to test for a stopping condition (e.g. is some convergence criterion met?) and submit a new job if the condition is ''not'' detected. But if some error arises that you didn't foresee, the stopping condition might never be met and your chain of jobs may continue indefinitely, doing nothing useful.
 
== Cluster particularities ==
 
There are certain differences in the job scheduling policies from one Compute Canada cluster to another and these are summarized by tab in the following section:
 
<tabs>
<tab name="Cedar">
Jobs may not be submitted from directories on the /home filesystem on Cedar. This is to reduce the load on that filesystem and improve the responsiveness for interactive work. If the command <tt>readlink -f $(pwd) | cut -d/ -f2</tt> returns <tt>home</tt>, you are not permitted to submit jobs from that directory. Transfer the files from that directory either to a /project or /scratch directory and submit the job from there.
 
Cedar has two distinct CPU architectures available: [https://en.wikipedia.org/wiki/Broadwell_(microarchitecture) Broadwell] and [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake]. Users requiring a specific architecture can request it when submitting a job using the <code>--constraint</code> flag. Note that the names should be written all in lower-case, <code>skylake</code> or <code>broadwell</code>.
 
An example job requesting the <code>skylake</code> feature on Cedar:
<pre>
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --constraint=skylake
#SBATCH --time=0:5:0
# Display CPU-specific information with 'lscpu'.
# Skylake CPUs will have 'avx512f' in the 'Flags' section of the output.
lscpu
</pre>
 
Keep in mind that a job which would have obtained an entire node for itself by specifying for example <tt>#SBATCH --cpus-per-task=32</tt> will now share the remaining 16 CPU cores with another job if it happens to use a Skylake node; if you wish to reserve the entire node you will need to request all 48 cores or add the <tt>#SBATCH --constraint=broadwell</tt> option to your job script.
 
''If you are unsure if your job requires a specific architecture, do not use this option.'' Jobs that do not specify a CPU architecture can be scheduled on either Broadwell or Skylake nodes, and will therefore generally start earlier.
</tab>
<tab name="Niagara">
<ul>
<li><p>Scheduling is by node, so in multiples of 40-cores.</p></li>
<li><p> Your job's maximum walltime is 24 hours.</p></li>
<li><p>Jobs must write to your scratch or project directory (home is read-only on compute nodes).</p></li>
<li><p>Compute nodes have no internet access.</p>
<p>[[Data_Management#Moving_data | Move your data]] to Niagara before you submit your job.</p></li></ul>
</tab>
</tabs>
 
== Troubleshooting ==
 
==== Avoid hidden characters in job scripts ====
Preparing a job script with a ''word processor'' instead of a ''text editor'' is a common cause of trouble. Best practice is to prepare your job script on the cluster using an [[Editors|editor]] such as nano, vim, or emacs. If you prefer to prepare or alter the script off-line, then:
* '''Windows users:'''
** Use a text editor such as Notepad or [https://notepad-plus-plus.org/ Notepad++].
** After uploading the script, use <code>dos2unix</code> to change Windows end-of-line characters to Linux end-of-line characters.
* '''Mac users:'''
** Open a terminal window and use an [[Editors|editor]] such as nano, vim, or emacs.
 
==== Cancellation of jobs with dependency conditions which cannot be met ====
A job submitted with <code>--dependency=afterok:<jobid></code> is a "dependent job". A dependent job will wait for the parent job to be completed. If the parent job fails (that is, ends with a non-zero exit code) the dependent job can never be scheduled and so will be automatically cancelled. See [https://slurm.schedmd.com/sbatch.html sbatch] for more on dependency.
 
==== Job cannot load a module ====
It is possible to see an error such as:
 
Lmod has detected the following error: These module(s) exist but cannot be
  loaded as requested: "<module-name>/<version>"
    Try: "module spider <module-name>/<version>" to see how to load the module(s).
 
This can occur if the particular module has an unsatisfied prerequisite. For example
 
<source lang="console">
$ module load gcc
$ module load quantumespresso/6.1
Lmod has detected the following error:  These module(s) exist but cannot be loaded as requested: "quantumespresso/6.1"
  Try: "module spider quantumespresso/6.1" to see how to load the module(s).
$ module spider quantumespresso/6.1
 
-----------------------------------------
  quantumespresso: quantumespresso/6.1
------------------------------------------
    Description:
      Quantum ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials (both
      norm-conserving and ultrasoft).
 
Properties:
      Chemistry libraries/apps / Logiciels de chimie
 
You will need to load all module(s) on any one of the lines below before the "quantumespresso/6.1" module is available to load.
 
nixpkgs/16.09  intel/2016.4  openmpi/2.1.1
 
Help:
 
Description
      ===========
      Quantum ESPRESSO  is an integrated suite of computer codes
      for electronic-structure calculations and materials modeling at the nanoscale.
      It is based on density-functional theory, plane waves, and pseudopotentials
        (both norm-conserving and ultrasoft).
 
 
More information
      ================
      - Homepage: http://www.pwscf.org/
</source>
 
In this case adding the line <code>module load nixpkgs/16.09 intel/2016.4 openmpi/2.1.1</code> to your job script before loading the "quantumespresso/6.1" will solve this problem.
 
==== Jobs inherit environment variables ====
By default a job will inherit the environment variables of the shell where the job was submitted. The [[Using modules|module]] command, which is used to make various software packages available, changes and sets environment variables. Changes will propagate to any job submitted from the shell and thus could affect the job's ability to load modules if there are missing prerequisites. It is best to include the line <code>module purge</code> in your job script before loading all the required modules to ensure a consistent state for each job submission and avoid changes made in your shell affecting your jobs.
 
Inheriting environment settings from the submitting shell can sometimes lead to hard-to-diagnose problems. If you wish to suppress this inheritance, use the <code>--export=none</code> directive when submitting jobs.
 
==== Job hangs / no output ====
 
Sometimes a submitted job writes no output to the log file for an extended period of time, looking like it is hanging. A common reason for this is the aggressive [[#Output_buffering|buffering]] performed by the Slurm scheduler, which will aggregate many output lines before flushing them to the log file. Often the output file will only be written after the job completes. If you wish to monitor the progress of your submitted job as it runs, consider running an [[#Interactive_jobs|interactive job]]
 
Another option is to redirect output to a separate file. For example, in your submit file you may run your, e.g. python, job as <code>python myscript.py > mylog.txt</code>. If you wish for the output to also be present in the Slurm log, you may also use the <code>tee</code> command, as follows: <code>python myscript.py | tee mylog.txt</code>.
 
== Job status and priority ==


* For a discussion of how job priority is determined and how things like time limits may affect the scheduling of your jobs at Cedar and Graham, see [[Job scheduling policies]].
The '''-s''' option specifies that the resources can be '''oversubscribed'''.
It is needed as all the resources have already be given to the first shell.
So, to run the second shell we either have to wait until the first shell is over, or we have to oversubscribe.


== Further reading ==
== Further reading ==
Line 522: Line 329:
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]
* Here is a rather minimal text tutorial from [http://www.brightcomputing.com/blog/bid/174099/slurm-101-basic-slurm-usage-for-linux-clusters Bright Computing]


[[Category:SLURM]]
[[Category:Slurm]]

Revision as of 17:49, 19 January 2023

This page is intended for users who are looking for guidance on working effectively with the SLURM job scheduler on the compute clusters that are operated by Research Computing Services (RCS). The content on this page assumes that you already familiar with the basic concepts of job scheduling and job scripts. If you have not worked on a large shared computer cluster before, please read What is a scheduler? first. To familiarize yourself with how we use computer architecture terminology in relation to SLURM, you may also want to read about the Types of Computational Resources, as well as the cluster guide for the cluster that you are working on. Users familiar with other job schedulers including PBS/Torque, SGE, LSF, or LoadLeveler, may find this table of corresponding commands useful. Finally, if you have never done any kind of parallel computation before, you may find some of the content on Parallel Models useful as an introduction.


Here is the link to the SLURM manual: https://slurm.schedmd.com/

Basic HPC Workflow

The most basic HPC computation workflow (excluding software setup and data transfer) involves 5 steps.

The standard process for executing and optimizing a job script


This process begins with the creation of a working preliminary job script that includes an approximate resource request and all of the commands required to complete a computational task of interest. This job script is used as the basis of a job submission to SLURM. The submitted job is then monitored for issues and analyzed for resource utilization. Depending on how well the job utilized resources, modifications to the job script can be made to improve computational efficiency and the accuracy of the resource request. In this way, the job script can be iteratively improved until a reliable job script is developed. As your computational goals and methods change, your job scripts will need to evolve to reflect these changes and this workflow can be used again to make appropriate modifications.

Job Design and Resource Estimation

The goal of this first step is the development of a slurm job script for the calculations that you wish to run that reflects real resource usage in a job typical of your work. To review some example jobs scripts for different models of parallel computation, please consult the page Sample Job Scripts. We will look at the specific example of using the NumPy library in python to multiply two matrices that are read in from two files. We assume that the files have been transferred to a subdirectory of the users home directory ~/projects/matmul and that a custom python installation has been setup under ~/anaconda3.

$ ls -lh ~/project/matmul
total 4.7G
-rw-rw-r-- 1 username username 2.4G Feb  3 10:13 A.csv
-rw-rw-r-- 1 username username 2.4G Feb  3 10:16 B.csv
$ ls ~/anaconda3/bin/python
/home/username/anaconda3/bin/python

The computational plan for this project is to load the data from files and then use the multithreaded version of the NumPy dot function to perform the matrix multiplication and write it back out to a file. To test out our proposed steps and software environment, we will request a small Interactive Job. This is done with the salloc command. The salloc command accepts all of the standard resource request parameters for SLURM. (The key system resources that can be requested can be reviewed on the Types of Computational Resources page) In our case, we will need to specify a partition, a number of cores, a memory allocation, and a small time interval in which we will do our testing work (on the order of 1-4 hours). We begin by checking which partitions are currently not busy using the arc.nodes command

$ arc.nodes

          Partitions: 16 (apophis, apophis-bf, bigmem, cpu2013, cpu2017, cpu2019, gpu-v100, lattice, parallel, pawson, pawson-bf, razi, razi-bf, single, theia, theia-bf)


      ====================================================================================
                     | Total  Allocated  Down  Drained  Draining  Idle  Maint  Mixed
      ------------------------------------------------------------------------------------
             apophis |    21          1     1        0         0     7      1     11 
          apophis-bf |    21          1     1        0         0     7      1     11 
              bigmem |     2          0     0        0         0     1      0      1 
             cpu2013 |    14          1     0        0         0     0      0     13 
             cpu2017 |    16          0     0        0         0     0     16      0 
             cpu2019 |    40          3     1        2         1     0      4     29 
            gpu-v100 |    13          2     0        1         0     5      0      5 
             lattice |   279         34     3       72         0    47      0    123 
            parallel |   576        244    26       64         1    20      3    218 
              pawson |    13          1     1        2         0     9      0      0 
           pawson-bf |    13          1     1        2         0     9      0      0 
                razi |    41         26     4        1         0     8      0      2 
             razi-bf |    41         26     4        1         0     8      0      2 
              single |   168          0    14       31         0    43      0     80 
               theia |    20          0     0        0         1     0      0     19 
            theia-bf |    20          0     0        0         1     0      0     19 
      ------------------------------------------------------------------------------------
       logical total |  1298        340    56      176         4   164     25    533 
                     |
      physical total |  1203        312    50      173         3   140     24    501

The ARC Cluster Guide tells us that the single partition is an old partition with 12GB of Memory (our data set is only 5GB and matrix multiplication is usually pretty memory efficient) and 8 cores so it is probably a good fit. Based on the arc.nodes output, we can choose the single partition (which at the time that this ran had 43 idle nodes) for our test. The preliminary resource request that we will use for the interactive job is

partition=single
time=2:0:0 
nodes=1 
ntasks=1 
cpus-per-task=8 
mem=0

We are requesting 2 hours and 0 minutes and 0 seconds, a whole single node (i.e. all 8 cores), and mem=0 means that all memory on the node is being requested. Please note that mem=0 can only used in a memory request when you are already requesting all of the cpus on a node. On the single partition, 8 cpus is the full number of cpus per node. Please check this for the partition that you plan on running on using arc.hardware before using it. From the login node, we will request this job using salloc

[userrname@arc ~]$ salloc --partition=single --time=2:0:0 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=8gb
salloc: Pending job allocation 8430460
salloc: job 8430460 queued and waiting for resources
salloc: job 8430460 has been allocated resources
salloc: Granted job allocation 8430460
salloc: Waiting for resource configuration
salloc: Nodes cn056 are ready for job
[username@cn056 ~]$ export PATH=~/anaconda3/bin:$PATH
[username@cn056 ~]$ which python
~/anaconda3/bin/python
[username@cn056 ~]$ export OMP_NUM_THREADS=4
[username@cn056 ~]$ export OPENBLAS_NUM_THREADS=4
[username@cn056 ~]$ export MKL_NUM_THREADS=4
[username@cn056 ~]$ export VECLIB_MAXIMUM_THREADS=4
[username@cn056 ~]$ export NUMEXPR_NUM_THREADS=4
[username@cn056 ~]$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

The steps that were taken at the command line included

  1. submitting an interactive job request and waiting for the scheduler to allocate the node (cn056) and move the terminal to that node
  2. setting up our software environment using Environment Variables
  3. starting a python interpreter

At this point, we are ready to test our simple python code at the python interpreter command line.

>>> import numpy as np
>>> A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
>>> B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
>>> C=np.dot(A,B)
>>> np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")
>>> quit()
[username@cn056 ~]$ exit
exit
salloc: Relinquishing job allocation 8430460

In a typical testing/debugging/prototyping session like this, some problems would be identified and fixed along the way. Here we have only shown successful steps of a simple job. However, normally this would be an opportunity to identify syntax errors, data type mismatch, out-of-memory exceptions, and other issues in an interactive environment. The data used for a test should be a manageable size. In the sample data, I am working with randomly generated 10000x10000 matrices and discussing it as though it was the actual job that would run. However, often times, the real problem is too large for an interactive job and it is better to test the commands in an interactive mode using a small data subset. For example, randomly generated 10000 x 10000 matrices might be a good starting point for planning a real job that was going to take the product of 1 million x1 million matrices.

While this interactive job was running, I used a separate terminal on ARC to check the resource utilization of the interactive job by running the command arc.job-info 8430460, and found that the CPU utilization hovered around 50% and memory utilization around 30%. We will come back to this point in later steps of the workflow but this would be a warning sign that the resource request is a bit high, especially in CPUs. It is worth noting that if you have run your code before on a different computer, you can base your initial slurm script on that past experience or the resources that are recommended in the software documentation instead of testing it out explicitly with salloc. However, as the the hardware and software installation are likely different, you may run into new challenges that need to be resolved. In the next step, we will write a slurm script that includes all of the commands we ran manually in the above example.

Job Submission

In order to use the full resources of a computing cluster, it is important to do your large-scale work with Batch Jobs. This amounts to replacing sequentially typed commands (or successively run cells in a notebook) with a shell script that automates the entire process and using the sbatch command to submit the shell script as a job to the job scheduler. Please note, you must not run this script directly on the login node. It must be submitted to the job scheduler.

Simply putting all of our commands in a single shell script file (with proper #SBATCH directives for the resource request), yields the following unsatisfactory result:

#!/bin/bash
#SBATCH --partition=single 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=8 
#SBATCH --mem=8gb

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

python matmul_test.py

where matmul_test.py:

import numpy as np

A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
C=np.dot(A,B)
np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")

There are a number of things to improve here when moving to a batch job. First, we don't need to submit the job exclusively to the single partition. We can expand our partition list a bit and the job should be able to start sooner. However, we do need to be correspondingly more precise with the rest of our request. We should also consider generalizing our script a bit to improve reusability between jobs, instead of hard coding paths. This yields:

matmul_test02032021.slurm:

#!/bin/bash
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

python matmul_test.py $AMAT $BMAT $OUT

where matmul_test.py:

import numpy as np
import sys

#parse arguments
if len(sys.argv)<3:
    raise Exception

lmatrixpath=sys.argv[1]
rmatrixpath=sys.argv[2]
outpath=sys.argv[3]

A=np.loadtxt(lmatrixpath,delimiter=",")
B=np.loadtxt(rmatrixpath,delimiter=",")
C=np.dot(A,B)
np.savetxt(outpath,C,delimiter=",")

We have expanded the partition list to include other partitions. (the specific consequences of this choice will be discussed later) We have made the cpu and memory requests correspondingly more precise. Finally, we have generalized the way that data sources and output locations are specified in the job. In the future, this will help us reuse this script without proliferating python scripts. Similar considerations of reusability apply to any language and any job script. In general, due to the complexity of using them, HPC systems are not ideal for running a couple small calculations. They are at their best when you have to solve many variations on a problem at a large scale. As such, it is valuable to always be thinking about how to make your workflows simpler and easier to modify.

Now that we have job script, we can simply submit the script to sbatch:

[username@arc matmul]$ sbatch matmul_test02032021.slurm
Submitted batch job 8436364

The number reported after the sbatch job submission is the JobID and will be used in all further analyses and monitoring. With this number, we can begin the process of checking on how our job progresses through the scheduling system.

Job Monitoring

The third step in this work flow is checking the progress of the job. We will emphasize three tools:

  1. --mail-user , --mail-type
    1. sbatch options for sending email about job progress
    2. no performance information
  2. squeue
    1. slurm tool for monitoring job start, running, and end
    2. no performance information
  3. arc.job-info
    1. RCS tool for monitoring job performance once it is running
    2. provides detailed snapshot of key performance information


The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. The details of these tools applied to the above example are discussed in the article Job Monitoring. If a job is determined to be incapable of running or have a serious error in it, you can cancel it at any time using the command scancel JobID, for example scancel 8442969

Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.

Job Performance Analysis

There are a tremendous number of performance analysis tools, from complex instruction-analysis tools like gprof, valgrind, and nvprof to extremely simple tools that only provide average utilization. The minimal tools that are required for characterizing a job are

  1. \time for estimating CPU utilization on a single node
  2. sacct -j JobID -o JobIDMaxRSS,Elapsed for analyzing maximum memory utilization and wall time
  3. seff JobID for analyzing general resource utilization

When utilizing these tools, it is important to consider the different sources of variability in job performance. Specifically, it is important to perform a detailed analysis of

  1. Variability Among Datasets
  2. Variability Across Hardware
  3. Request Parameter Estimation from Samples Sets - Scaling Analysis

and how these can be addressed through thoughtful job design. This is the subject of the article Job Performance Analysis and Resource Planning.

Revisiting the Resource Request

Having completed a careful analysis of performance for our problem and the hardware that we plan to run on, we are now in a position to revise the resource request and submit more jobs. If you have 10000 such jobs then it is important to think about how busy the resources that you are planning to use are likely to be. It is also important to think about the fact that every job has some finite scheduling overhead (time to setup the environment to run in, time to figure out where to run to optimally fit jobs, etc). In order to make this scheduling overhead worthwhile, it is generally important to have jobs run for more than 10 minutes. So, we may want to pack batches of 10-20 matrix pairs into one job just to make it a sensible amount of work. This also will have some benefit in making job duration more predictable when there is variability due to data sets. New hardware also tends to be busier than old hardware and cpu2019 will often have 24 hour wait times when the cluster is busy. However, for jobs that can stay under 5 hours, the backfill partitions are often not busy and so the best approach might be to bundle 20 tasks together for a duration of about 100 minutes and then have the job request be

#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019 
#SBATCH --time=1:40:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --mem=5000M
#SBATCH --array=1-500

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

export TOP_DIR="/home/username/project/matmul/pairs"$SLURM_ARRAY_TASK_ID

DIR_LIST=$(find $TOP_DIR -type d -mindepth 1)
for DIR in $FLIST
do 
python matmul_test.py $DIR/A.csv $DIR/B.csv $DIR/C.csv; 
done

Here the directory TOP_DIR will be one of 500 directories, each named pairsN where N is an integer from 1 to 500, with each of these holding 20 subdirectories with a pair of A and B matrix files in each. The script will iterate through the 20 subdirectories and run the python script once for each pair of matrices. This is one approach to organizing work into bundles. You could then submit a job array The choice depends on a lot of assumptions that we have made, but it is a fairly good plan. The result would be 500 jobs, each 2 hours long, running mostly on the backfill partitions on 1 core a piece. If there is usually at least 1 node free, that will be about 38 jobs running at the same time for a total of roughly 13 sequences that are 2 hours long, or 26 hours. In this way, all of work could be done in a day. Typically, we would iteratively improve the script until it was something that could scale up. This is why the workflow is a loop.

Tips and tricks

Attaching to a running job

It is possible to connect to the node running a job and execute new processes there. You might want to do this for troubleshooting or to monitor the progress of a job.

Suppose you want to run the utility nvidia-smi to monitor GPU usage on a node where you have a job running. The following command runs watch on the node assigned to the given job, which in turn runs nvidia-smi every 30 seconds, displaying the output on your terminal.

$ srun --jobid 123456 --pty watch -n 30 nvidia-smi

It is possible to launch multiple monitoring commands using tmux. The following command launches htop and nvidia-smi in separate panes to monitor the activity on a node assigned to the given job.

$ srun --jobid 123456 --pty tmux new-session -d 'htop -u $USER' \; split-window -h 'watch nvidia-smi' \; attach

Processes launched with srun share the resources with the job specified. You should therefore be careful not to launch processes that would use a significant portion of the resources allocated for the job. Using too much memory, for example, might result in the job being killed; using too many CPU cycles will slow down the job.

Noteː The srun commands shown above work only to monitor a job submitted with sbatch. To monitor an interactive job, create multiple panes with tmux and start each process in its own pane.

It is possible to attach second terminal to a running interactive job with the

$ srun -s --pty --jobid=12345678 /bin/bash

The -s option specifies that the resources can be oversubscribed. It is needed as all the resources have already be given to the first shell. So, to run the second shell we either have to wait until the first shell is over, or we have to oversubscribe.

Further reading

  • Comprehensive documentation is maintained by SchedMD, as well as some tutorials.
  • There is also a "Rosetta stone" mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM.
  • Here is a text tutorial from CÉCI, Belgium
  • Here is a rather minimal text tutorial from Bright Computing