Running jobs

From RCSWiki
Revision as of 19:06, 17 March 2021 by Ian.percel (talk | contribs) (adding seff to →‎Job Performance Analysis)
Jump to navigation Jump to search

This page is intended for users who are looking for guidance on working effectively with the SLURM job scheduler on the compute clusters that are operated by Research Computing Services (RCS). The content on this page assumes that you already familiar with the basic concepts of job scheduling and job scripts. If you have not worked on a large shared computer cluster before, please read What is a scheduler? first. To familiarize yourself with how we use computer architecture terminology in relation to SLURM, you may also want to read about the Types of Computational Resources, as well as the cluster guide for the cluster that you are working on. Users familiar with other job schedulers including PBS/Torque, SGE, LSF, or LoadLeveler, may find this table of corresponding commands useful.


Here is the link to the SLURM manual: https://slurm.schedmd.com/

Basic HPC Workflow

The most basic HPC computation workflow (excluding software setup and data transfer) involves 5 steps.

The standard process for executing and optimizing a job script

This process begins with the creation of a working preliminary job script that includes an approximate resource request and all of the commands required to complete a computational task of interest. This job script is used as the basis of a job submission to SLURM. The submitted job is then monitored for issues and analyzed for resource utilization. Depending on how well the job utilized resources, modifications to the job script can be made to improve computational efficiency and the accuracy of the resource request. In this way, the job script can be iteratively improved until a reliable job script is developed. As your computational goals and methods change, your job scripts will need to evolve to reflect these changes and this workflow can be used again to make appropriate modifications.

Job Design and Resource Estimation

The goal of this first step is the development of a slurm job script for the calculations that you wish to run that reflects real resource usage in a job typical of your work. To review some example jobs scripts for different models of parallel computation, please consult the page Sample Job Scripts. We will look at the specific example of using the NumPy library in python to multiply two matrices that are read in from two files. We assume that the files have been transferred to a subdirectory of the users home directory ~/projects/matmul and that a custom python installation has been setup under ~/anaconda3.

$ ls -lh ~/project/matmul
total 4.7G
-rw-rw-r-- 1 username username 2.4G Feb  3 10:13 A.csv
-rw-rw-r-- 1 username username 2.4G Feb  3 10:16 B.csv
$ ls ~/anaconda3/bin/python
/home/username/anaconda3/bin/python

The computational plan for this project is to load the data from files and then use the multithreaded version of the NumPy dot function to perform the matrix multiplication and write it back out to a file. To test out our proposed steps and software environment, we will request a small Interactive Job. This is done with the salloc command. The salloc command accepts all of the standard resource request parameters for SLURM. (The key system resources that can be requested can be reviewed on the Types of Computational Resources page) In our case, we will need to specify a partition, a number of cores, a memory allocation, and a small time interval in which we will do our testing work (on the order of 1-4 hours). We begin by checking which partitions are currently not busy using the arc.nodes command

$ arc.nodes

          Partitions: 16 (apophis, apophis-bf, bigmem, cpu2013, cpu2017, cpu2019, gpu-v100, lattice, parallel, pawson, pawson-bf, razi, razi-bf, single, theia, theia-bf)


      ====================================================================================
                     | Total  Allocated  Down  Drained  Draining  Idle  Maint  Mixed
      ------------------------------------------------------------------------------------
             apophis |    21          1     1        0         0     7      1     11 
          apophis-bf |    21          1     1        0         0     7      1     11 
              bigmem |     2          0     0        0         0     1      0      1 
             cpu2013 |    14          1     0        0         0     0      0     13 
             cpu2017 |    16          0     0        0         0     0     16      0 
             cpu2019 |    40          3     1        2         1     0      4     29 
            gpu-v100 |    13          2     0        1         0     5      0      5 
             lattice |   279         34     3       72         0    47      0    123 
            parallel |   576        244    26       64         1    20      3    218 
              pawson |    13          1     1        2         0     9      0      0 
           pawson-bf |    13          1     1        2         0     9      0      0 
                razi |    41         26     4        1         0     8      0      2 
             razi-bf |    41         26     4        1         0     8      0      2 
              single |   168          0    14       31         0    43      0     80 
               theia |    20          0     0        0         1     0      0     19 
            theia-bf |    20          0     0        0         1     0      0     19 
      ------------------------------------------------------------------------------------
       logical total |  1298        340    56      176         4   164     25    533 
                     |
      physical total |  1203        312    50      173         3   140     24    501

The ARC Cluster Guide tells us that the single partition is an old partition with 12GB of Memory (our data set is only 5GB and matrix multiplication is usually pretty memory efficient) and 8 cores so it is probably a good fit. Based on the arc.nodes output, we can choose the single partition (which at the time that this ran had 43 idle nodes) for our test. The preliminary resource request that we will use for the interactive job is

partition=single
time=2:0:0 
nodes=1 
ntasks=1 
cpus-per-task=8 
mem=0

We are requesting 2 hours and 0 minutes and 0 seconds, a whole single node (i.e. all 8 cores), and mem=0 means that all memory on the node is being requested. From the login node, we will request this job using salloc

[userrname@arc ~]$ salloc --partition=single --time=2:0:0 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=0
salloc: Pending job allocation 8430460
salloc: job 8430460 queued and waiting for resources
salloc: job 8430460 has been allocated resources
salloc: Granted job allocation 8430460
salloc: Waiting for resource configuration
salloc: Nodes cn056 are ready for job
[username@cn056 ~]$ export PATH=~/anaconda3/bin:$PATH
[username@cn056 ~]$ which python
~/anaconda3/bin/python
[username@cn056 ~]$ export OMP_NUM_THREADS=4
[username@cn056 ~]$ export OPENBLAS_NUM_THREADS=4
[username@cn056 ~]$ export MKL_NUM_THREADS=4
[username@cn056 ~]$ export VECLIB_MAXIMUM_THREADS=4
[username@cn056 ~]$ export NUMEXPR_NUM_THREADS=4
[username@cn056 ~]$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

The steps that were taken at the command line included

  1. submitting an interactive job request and waiting for the scheduler to allocate the node (cn056) and move the terminal to that node
  2. setting up our software environment using Environment Variables
  3. starting a python interpreter

At this point, we are ready to test our simple python code at the python interpreter command line.

>>> import numpy as np
>>> A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
>>> B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
>>> C=np.dot(A,B)
>>> np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")
>>> quit()
[username@cn056 ~]$ exit
exit
salloc: Relinquishing job allocation 8430460

In a typical testing/debugging/prototyping session like this, some problems would be identified and fixed along the way. Here we have only shown successful steps of a simple job. However, normally this would be an opportunity to identify syntax errors, data type mismatch, out-of-memory exceptions, and other issues in an interactive environment. The data used for a test should be a manageable size. In the sample data, I am working with randomly generated 10000x10000 matrices and discussing it as though it was the actual job that would run. However, often times, the real problem is too large for an interactive job and it is better to test the commands in an interactive mode using a small data subset. For example, randomly generated 10000 x 10000 matrices might be a good starting point for planning a real job that was going to take the product of 1 million x1 million matrices.

While this interactive job was running, I used a separate terminal on ARC to check the resource utilization of the interactive job by running the command arc.job-info 8430460, and found that the CPU utilization hovered around 50% and memory utilization around 30%. We will come back to this point in later steps of the workflow but this would be a warning sign that the resource request is a bit high, especially in CPUs. It is worth noting that if you have run your code before on a different computer, you can base your initial slurm script on that past experience or the resources that are recommended in the software documentation instead of testing it out explicitly with salloc. However, as the the hardware and software installation are likely different, you may run into new challenges that need to be resolved. In the next step, we will write a slurm script that includes all of the commands we ran manually in the above example.

Job Submission

In order to use the full resources of a computing cluster, it is important to do your large-scale work with Batch Jobs. This amounts to replacing sequentially typed commands (or successively run cells in a notebook) with a shell script that automates the entire process and using the sbatch command to submit the shell script as a job to the job scheduler. Please note, you must not run this script directly on the login node. It must be submitted to the job scheduler.

Simply putting all of our commands in a single shell script file (with proper #SBATCH directives for the resource request), yields the following:

#!/bin/bash
#SBATCH --partition=single 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=8 
#SBATCH --mem=0

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

python matmul_test.py

where matmul_test.py:

import numpy as np

A=np.loadtxt("/home/username/project/matmul/A.csv",delimiter=",")
B=np.loadtxt("/home/username/project/matmul/B.csv",delimiter=",")
C=np.dot(A,B)
np.savetxt("/home/username/project/matmul/C.csv",C,delimiter=",")

There are a number of things to improve here when moving to a batch job. First, we don't need to submit the job exclusively to the single partition. We can expand our partition list a bit and the job should be able to start sooner. However, we do need to be correspondingly more precise with the rest of our request. We should also consider generalizing our script a bit to improve reusability between jobs, instead of hard coding paths. This yields:

matmul_test02032021.slurm:

#!/bin/bash
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

python matmul_test.py $AMAT $BMAT $OUT

where matmul_test.py:

import numpy as np
import sys

#parse arguments
if len(sys.argv)<3:
    raise Exception

lmatrixpath=sys.argv[1]
rmatrixpath=sys.argv[2]
outpath=sys.argv[3]

A=np.loadtxt(lmatrixpath,delimiter=",")
B=np.loadtxt(rmatrixpath,delimiter=",")
C=np.dot(A,B)
np.savetxt(outpath,C,delimiter=",")

We have expanded the partition list to include other partitions. (the specific consequences of this choice will be discussed later) We have made the cpu and memory requests correspondingly more precise. Finally, we have generalized the way that data sources and output locations are specified in the job. In the future, this will help us reuse this script without proliferating python scripts. Similar considerations of reusability apply to any language and any job script. In general, due to the complexity of using them, HPC systems are not ideal for running a couple small calculations. They are at their best when you have to solve many variations on a problem at a large scale. As such, it is valuable to always be thinking about how to make your workflows simpler and easier to modify.

Now that we have job script, we can simply submit the script to sbatch:

[username@arc matmul]$ sbatch matmul_test02032021.slurm
Submitted batch job 8436364

The number reported after the sbatch job submission is the JobID and will be used in all further analyses and monitoring. With this number, we can begin the process of checking on how our job progresses through the scheduling system.

Job Monitoring

The third step in this work flow is checking the progress of the job. We will emphasize three tools:

  1. --mail-user , --mail-type
    1. sbatch options for sending email about job progress
    2. no performance information
  2. squeue
    1. slurm tool for monitoring job start, running, and end
    2. no performance information
  3. arc.job-info
    1. RCS tool for monitoring job performance once it is running
    2. provides detailed snapshot of key performance information


The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. The details of these tools applied to the above example are discussed in the article Job Monitoring. If a job is determined to be incapable of running or have a serious error in it, you can cancel it at any time using the command scancel JobID, for example scancel 8442969

Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.

Job Performance Analysis

There are a tremendous number of performance analysis tools, from complex instruction-analysis tools like gprof, valgrind, and nvprof to extremely simple tools that only provide average utilization. The minimal tools that are required for characterizing a job are

  1. \time for estimating CPU utilization on a single node
  2. sacct -j JobID -o JobIDMaxRSS,Elapsed for analyzing maximum memory utilization and wall time
  3. seff JobID for analyzing general resource utilization

When utilizing these tools, it is important to consider the different sources of variability in job performance. Specifically, it is important to perform a detailed analysis of

  1. Variability Among Datasets
  2. Variability Across Hardware
  3. Request Parameter Estimation from Samples Sets - Scaling Analysis

and how these can be addressed through thoughtful job design. This is the subject of the article Job Performance Analysis and Resource Planning.

Revisiting the Resource Request

Having completed a careful analysis of performance for our problem and the hardware that we plan to run on, we are now in a position to revise the resource request and submit more jobs. If you have 10000 such jobs then it is important to think about how busy the resources that you are planning to use are likely to be. It is also important to think about the fact that every job has some finite scheduling overhead (time to setup the environment to run in, time to figure out where to run to optimally fit jobs, etc). In order to make this scheduling overhead worthwhile, it is generally important to have jobs run for more than 10 minutes. So, we may want to pack batches of 10-20 matrix pairs into one job just to make it a sensible amount of work. This also will have some benefit in making job duration more predictable when there is variability due to data sets. New hardware also tends to be busier than old hardware and cpu2019 will often have 24 hour wait times when the cluster is busy. However, for jobs that can stay under 5 hours, the backfill partitions are often not busy and so the best approach might be to bundle 20 tasks together for a duration of about 100 minutes and then have the job request be

#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019 
#SBATCH --time=1:40:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --mem=5000M
#SBATCH --array=1-500

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

export TOP_DIR="/home/username/project/matmul/pairs"$SLURM_ARRAY_TASK_ID

DIR_LIST=$(find $TOP_DIR -type d -mindepth 1)
for DIR in $FLIST
do 
python matmul_test.py $DIR/A.csv $DIR/B.csv $DIR/C.csv; 
done

Here the directory TOP_DIR will be one of 500 directories, each named pairsN where N is an integer from 1 to 500, with each of these holding 20 subdirectories with a pair of A and B matrix files in each. The script will iterate through the 20 subdirectories and run the python script once for each pair of matrices. This is one approach to organizing work into bundles. You could then submit a job array The choice depends on a lot of assumptions that we have made, but it is a fairly good plan. The result would be 500 jobs, each 2 hours long, running mostly on the backfill partitions on 1 core a piece. If there is usually at least 1 node free, that will be about 38 jobs running at the same time for a total of roughly 13 sequences that are 2 hours long, or 26 hours. In this way, all of work could be done in a day. Typically, we would iteratively improve the script until it was something that could scale up. This is why the workflow is a loop.

Further reading

  • Comprehensive documentation is maintained by SchedMD, as well as some tutorials.
  • There is also a "Rosetta stone" mapping commands and directives from PBS/Torque, SGE, LSF, and LoadLeveler, to SLURM.
  • Here is a text tutorial from CÉCI, Belgium
  • Here is a rather minimal text tutorial from Bright Computing