Job Performance Analysis and Resource Planning

From RCSWiki
Jump to navigation Jump to search

There are a tremendous number of performance analysis tools, from complex instruction-analysis tools like gprof, valgrind, and nvprof to extremely simple tools that only provide average utilization. We will focus on two simple tools for characterizing a job

  1. \time for estimating CPU utilization on a single node
  2. sacct -j JobID -o JobIDMaxRSS,Elapsed for analyzing maximum memory utilization and wall time

and try to sort out some of the different sources of variability in job performance. Specifically, we will discuss

  1. Variability Among Datasets
  2. Variability Across Hardware
  3. Request Parameter Estimation from Samples Sets

and how these can be addressed through thoughtful job design.

sacct

We will begin this discussion of post-job analysis by talking about the role of the slurm accounting database. slurm tracks a great deal of information about every job and records it in a historical database of all jobs that have run on the cluster. sacct is the user interface to that database and it allows you to look up information ranging from the job completion code to the memory utilization. A full list of the fields that can be accessed by changing the format argument in sacct --format=field1,field2,... can be found on the sacct manual page or by running the command sacct --helpformat. The CPU utilization and GPU utilization are not recorded. (CPUTime recorded in the saccounting database is not actually helpful) Other methods will be discussed later for how to access this data.

The most basic usage of sacct is to check on the status of a completed job by using the -j option.

[username@arc matmul]$ sacct -j 8436364
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
8436364      matmul_te+  pawson-bf        all          4  COMPLETED      0:0 
8436364.bat+      batch                   all          4  COMPLETED      0:0 
8436364.ext+     extern                   all          4  COMPLETED      0:0

Here we can see that job 8436364 completed with exit code 0, which means that it ended without throwing any errors. If the ExitCode field is not 0, then something went wrong in the job script and you should look up the value in a list of UNIX exit codes to see what it is indicates. Generally, the ExitCode relates to the failure of your code and not anything done by the scheduler.

The State field also frequently provides useful information about job that relate specifically to interactions with the scheduler. The list of JOB STATE CODES can be found on the sacct manual page. The most common values are:

  • COMPLETED = successful job
  • CANCELLED = user cancelled the job with scancel
  • FAILED = job script threw an error, ExitCode should be non-zero
  • OUT_OF_MEMORY = job exceeded memory request
  • TIMEOUT = job exceeded time request

When a job ends with a State other than COMPLETED, it is often helpful to examine the Slurm-JobID.out file to see more detailed error messages generated by the software. This will be discussed in the next section.

Another common use of sacct command is to check on the memory allocated by your code during the job and the wall time during which the job ultimately ran. We can look up this information using the MaxRSS and Elapsed format fields. Returning to our matrix multiplication example,

[username@arc matmul]$ sacct -j 8436364 --format=JobID,MaxRSS,Elapsed
       JobID     MaxRSS    Elapsed 
------------ ---------- ---------- 
8436364                   00:03:55 
8436364.bat+   4652076K   00:03:55 
8436364.ext+          0   00:03:55

We can see that the job only used about 4.6GB of memory and only ran for 4 minutes. If these inputs are actually of typical size, we now know that we can revise our time request and our memory request down considerably and still have plenty of space for the work that was done. Usually, during revision of the job design, it suffices to take the MaxRSS value for a representative job and tack on a 20% error margin as the basis for your memory request. This strategy doesn't cover all collections of jobs but it is a good starting point. In this case, 6GB would be more than sufficient.

The Elapsed field corresponds to the job wall time that slurm monitors and limits base on the --time field of the request. Modifying the request to ask for 10min instead of 2 hours for this task would be entirely reasonable in this case. We will discuss request revision in more detail in the Revisiting the Resource Request section.

Analyzing Standard Output and Standard Error

Most software produces some output to the terminal apart from any files that it creates. The output generated usually depends on the verbosity of the mode it was run in (e.g. rsync -a versus rsync -avvv) and whether there were any errors. Standard Output is the text output generated from the normal operation of a piece of software. Standard Error is the error logging output generated when software does not operate as intended. Typically, when working at the terminal you will see both of these printed to the screen in the order that they were generated. For example, if we tried to run the original matmul shell script from an Interactive Job (i.e. not submitting it to the scheduler), and the referenced files did not exist, we would see

[username@cn056 matmul]$ ./matmul_test02032021.sh
/home/username/anaconda3/bin/python
Traceback (most recent call last):
  File "matmul_test.py", line 12, in <module>
    A=np.loadtxt(lmatrixpath,delimiter=",")
  File "/home/username/anaconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 968, in loadtxt
    fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
  File "/home/username/anaconda3/lib/python3.7/site-packages/numpy/lib/_datasource.py", line 269, in open
    return ds.open(path, mode, encoding=encoding, newline=newline)
  File "/home/username/anaconda3/lib/python3.7/site-packages/numpy/lib/_datasource.py", line 623, in open
    raise IOError("%s not found." % path)
OSError: /home/username/project/matmul/A.csv not found.

This is all very useful information for debugging. Where does this information go when the shell script runs through slurm as a Batch Job? During the job, slurm creates an output file in the directory that the terminal session had as its working directory when the job was submitted. This output file holds all of this text that was sent to standard output and standard error. The default name for the job-name is the file name of the slurm job script submitted using sbatch. The default name for the output file name is slurm-%j.out where %j is the JobID. There are sbatch options for customizing this naming convention:

#SBATCH --job-name=test
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

that make it possible to split the output into separate output and error files. In the SBATCH directives, the symbol %j can be used to specify a name format that includes the jobID and %x to specify a name format that includes the job-name. Keeping the names of your output and error files unique is important because it prevents overwriting the results of previous jobs with new outputs.

We can look at the output and error generated by the software by looking in the output files with a command like less slurm-8436364.out. This will frequently help in identifying why a given job failed or why it took so long. A common and useful practice when debugging code from batch scheduled jobs is to insert statements into your script that print messages at specific points in the execution of you job so that you can narrow down what might have gone wrong. Similarly, when dealing with software that you didn't write, there is often an option to run in a verbose or debugging mode where far more information than usual gets printed to standard output and standard error. So, for a slurm script with the SBATCH directives above, we would have

[username@arc matmul]$ sbatch matmul_test02032021.slurm
Submitted batch job 8436364
[username@arc matmul]$ ls
matmul_test02032021.slurm matmul.py test-8436364.out test-8436364.err
[username@arc matmul]$ cat test-8436364.out
/home/username/anaconda3/bin/python

Assuming that the job completed successfully and we plan on using sacct to analyze the time and memory utilization, we still need information about the cpu utilization. This is the subject of the next section.

\time

The simplest tool that can be used to directly report CPU utilization from a job running on a single node is the GNU version of the time command. This can be added to a slurm script by simply wrapping the command that you want to time with \time:

matmul_test02032021.slurm:

#!/bin/bash
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

\time python matmul_test.py $AMAT $BMAT $OUT

then we submit the job as usual and, when it is done, the output file will hold the additional timing and CPU utilization information

[username@arc matmul]$ sbatch matmul_test02032021.slurm
Submitted batch job 8443249
[username@arc matmul]$ cat slurm-8443249.out
/home/username/anaconda3/bin/python
930.58user 13.08system 9:36.57elapsed 163%CPU (0avgtext+0avgdata 5582968maxresident)k
9897560inputs+4882880outputs (98major+2609216minor)pagefaults 0swaps
[username@arc matmul]$ sacct -j 8443249
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
8443249      matmul_te+     single        all          4  COMPLETED      0:0 
8443249.bat+      batch                   all          4  COMPLETED      0:0 
8443249.ext+     extern                   all          4  COMPLETED      0:0 
[username@arc matmul]$ sacct -j 8443249 -o JobID,MaxRSS,Elapsed
       JobID     MaxRSS    Elapsed 
------------ ---------- ---------- 
8443249                   00:09:38 
8443249.bat+   4779532K   00:09:38 
8443249.ext+          0   00:09:38

The output of the time function includes %CPU which can be read as fractions of a CPU averaged over the full lifetime of the job. Thus, for 4 threads running on 4 cores, we are only, on average, fully utilizing about 1.6 cores. This is higher utilization than the peak of 1.1 CPUs we saw being utilized when we ran arc.job-info to see snapshots of the run on pawson-bf. However, in spite of this, the current run (on the single hardware) also took over twice as long to complete in terms of actual wall time. We can see that even though they are simple tools, between \time and sacct we can obtain a fairly illuminating picture of the resource tradeoffs that occur when running the same code on different hardware.

Variability Across Hardware

The performance of a script will still see vary across different hardware. Let's examine the resource request again with an eye to the partitions being targeted:

#!/bin/bash
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

Strictly speaking, all of the partitions in the partition list can accept this job request on a single node. However, whereas single, parallel, and lattice have fairly similar hardware, pawson-bf is about 8 years newer and has 4-5 times the number of cores and 10 times the memory. Ideally, a request should be tailored (as much as possible) to the hardware that it will run on. To see what this means, we can examine the implications of the performance measurements on the different systems.

When the code ran on single, 4 threads only ended up doing about 1.6 cores worth of work. When the code ran on pawson-bf, 4 threads did about 1.1 cores worth of work. In the case of jobs running on single, we may want to still request 2 cores, while on pawson-bf we would probably only use 1 core. Furthermore, the job on pawson-bf ran in 1/3 of the time. Depending on whether we expected a job to run on single or pawson-bf, we would probably request different numbers of cores and different amounts of time. Therefore, our partition list is not really appropriate. If we made it so that only similar hardware was in the list, we would have an easier time optimizing our resource request. Either of the following would be good requests for this job (up to probably being too small overall as jobs):

#!/bin/bash
#SBATCH --partition=single,lattice,parallel
#SBATCH --time=0:15:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=2 
#SBATCH --mem=5000M

or

#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019 
#SBATCH --time=0:5:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --mem=5000M

Variability across hardware is inevitable. Sometimes, the same computational strategy isn't even appropriate when you move to new hardware. For example, a communication intensive 80 rank MPI job would be inappropriate for the single partition but would fit on two cpu2019 nodes. Since instruction sets vary between partitions, the same code may even need to be recompiled to run on different hardware. The problem of variability across different hardware can be addressed through the use of partition lists that only include similar (and appropriate) hardware and by testing job performance on each type of hardware that you plan to run on.

Variability Among Datasets

Variability across different data inputs to a calculation is often harder to resolve than variability across hardware. Suppose we randomly generate 10 pairs of input matrices for our matrix multiplication calculation and produce 10 jobs based off of these different inputs. We may see ten different values for memory utilized and wall-time. (depending on the details of the implementation) We may even see different benefit from using more cores and threads as the problem scale may change. This kind of variability is a continuous challenge when analyzing experimental data sets using adaptive algorithms. (e.g. MRI image reconstruction)

The question of what the optimal resource request is for a collection of jobs with variable input data can be thought of as a sampling and decision theory problem. If we have 100000 pairs of matrices to multiply, we don't want to systematically over-request resources by too much. However, the price of under-requesting resources is a failed job. The first step in this analysis would be to run a limited random sample of jobs and try to approximate the distribution of resources required. (say, by plotting a histogram) Once you have this, the problem reduces choosing a resource level to request that will minimize total queue wait time. This means selecting a level that minimizes the combined effect of delay in scheduling each initial job (due to size) and jobs that need to be rerun with a higher resource request.

In the simplest case, all of the jobs have almost the same resource requirements and the optimal request is simply the average required resources plus 10%. In a more complex case, you have two peaks a large peak with lower requirements and a very small peak with higher requirements. In this case, you can simply choose your initial level to be just above large peak, knowing that you will need to rerun a handful of jobs. The more that the distribution takes a multimodal form with peaks of similar size, the harder the decision becomes. Please contact support@hpc.ucalagry.ca to discuss this kind of analysis if you believe that your data will include highly variable resource requirements for different jobs. We can support you in designing your sampling and resource optimization.

Scaling Analysis

Another aspect of uncertainty in modelling resource requirements comes from dealing with large data sets, long running problems, and potentially high levels of parallelization. Often times, the analysis of resource requirements by direct experimentation is impractical for very large jobs and crude estimates are undesirable. In this case, we can use a strategy of modelling the scaling of the problem and extrapolating from small examples. Scaling is the manner in which resource requirements change as the problem size grows and the number of parallel tasks grows. There are many variations on this problem and accurate modelling generally relies on a detailed understanding of the underlying algorithm. We will illustrate the idea from a simple example with matrices.

Suppose we want to perform a matrix multiplication using two 1 million by 1 million matrices. Is this possible on any single node on the cluster? How many nodes would you have to distribute the memory to in order to complete the task? In order to answer this, we can look at the memory required for the calculation using 1000 x 1000 matrices and then 10000 x 10000 and so on. The values for the problem size should roughly follow a logarithmic scale as this enables the use of a linear least squares regression to obtain the power of the dominant polynomial growth from the slope.

In most cases, you should be able to furnish a reasonable guess from a knowledge of algorithms. For example, dense matrices will roughly grow in memory required proportionally to the number of entries (i.e. the square of the dimension of the matrix), which is a good starting point for your analysis. Once you have a fit to the growth, you can apply this to extrapolate resources required for your larger example. In our case, if quadratic growth holds true, and mem=kN^2 we can use 4.7GB=k*10^8 to get k=47B based on a single data point and from that we could attempt to extrapolate our 10^6 x 10^6 matrix memory requirements from k*10^12=4.7TB. This could not fit even on the bigmem nodes and would need to be split into at least two. In reality, the problem is somewhat more complicated than that, as we need to account for how data location will be reconciled with representing the matrix operations in a block decomposition. This is unlikely to prove to be a faithful approximation. However, this example illustrates the extrapolation step of a scaling analysis.

Revisiting the Resource Request

Having completed a careful analysis of performance for our problem and the hardware that we plan to run on, we are now in a position to revise the resource request and submit more jobs. We have suggested two possible resource requests that could work for this job assuming that the example was at the intended scale and variability among data was minimal:

#!/bin/bash
#SBATCH --partition=single,lattice,parallel
#SBATCH --time=0:15:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=2 
#SBATCH --mem=5000M

or

#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019 
#SBATCH --time=0:5:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --mem=5000M

One of these runs on old hardware and the other runs on new hardware (but a smaller number of cores and for less time). How do we choose between the two requests? Do we submit some fraction of our jobs to each? This depends on how many jobs we have to submit. If you only have a dozen or so jobs to submit then the difference in wait times doesn't matter too much. If you have 10000 such jobs then it might be important to think about how busy the respective partitions are likely to be. A further factor that will influence this is that every job has some finite scheduling overhead (time to setup the environment to run in, time to figure out where to run to optimally fit jobs, etc). In order to make this scheduling overhead worthwhile, it is generally not advisable to have jobs run for less than 5 minutes. So, we may want to pack batches of 10-20 matrix pairs into one job just to make it a sensible amount of work. This also will have some benefit in making job duration more predictable when there is variability due to data sets. The total duration will generally stabilize around the average duration times the number of tasks as the number of tasks becomes large. New hardware also tends to be busier than old hardware and cpu2019 will often have 24 hour wait times when the cluster is busy. However, for jobs that can stay under 5 hours, the backfill partitions are often not busy and so the best approach might be to bundle 20 tasks together for a duration of about 100 minutes and then have the job request be

#!/bin/bash
#SBATCH --partition=pawson-bf,razi-bf,apophis-bf,cpu2019 
#SBATCH --time=1:40:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1 
#SBATCH --mem=5000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

\time python matmul_test.py $AMAT $BMAT $OUT

This choice depends on a lot of assumptions that we have made, but it is a fairly good plan. The result would be 500 jobs, each 2 hours long, running mostly on the backfill partitions on 1 core a piece. If there is usually at least 1 node free, that will be about 38 jobs running at the same time for a total of roughly 13 sequences that are 2 hours slong, or 26 hours. In this way, all of work could be done in a day. Typically, we would do more than one test job to model the performance of the script and the required resources and we would iteratively improve the script until it was something that could scale up. This is why the work flow is a loop. Finally, it is worth noting that if you are going to submit 500 jobs using 40 files each, you should make use of some automation (a bash script for example) to produce the required lists of file paths (and possibly slurm scripts) as well as a loop for submitting jobs or a job array. In practice, sitting around submitting even 500 jobs by hand is impractical and error prone.