Job Monitoring

From RCSWiki
Jump to navigation Jump to search

The third step in this work flow is checking the progress of the job. We will emphasize three tools:

  1. --mail-user , --mail-type
    1. sbatch options for sending email about job progress
    2. no performance information
  2. squeue
    1. slurm tool for monitoring job start, running, and end
    2. no performance information
  3. arc.job-info
    1. RCS tool for monitoring job performance once it is running
    2. provides detailed snapshot of key performance information


The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. arc.job-info also provides snapshots of performance that can be used in the Job Performance Analysis stage, so we will talk about this last.

--mail-user

Mail-user is a slurm option that can be included as an sbatch directive. If we were to modify our script to include it, it would look something like

matmul_test02032021.slurm:

#!/bin/bash
#SBATCH --mail-user=username@ucalgary.ca
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

python matmul_test.py $AMAT $BMAT $OUT

Different choices of mail-type provide information about that different stages of progress for the job. For a detailed list of mail-type options, see the sbatch manual page. The BEGIN and END options used in the above example would produce two emails (with empty body) for a jobID 8442969. One with the title

Slurm Job_id=8442969 Name=matmul_test02032021.slurm Began, Queued time 00:00:05

and another

Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0

These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original slurm script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken.

squeue

squeue provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue. This will happen whether the job was successful or not. We will discuss how to analyze the job after completion in the next section.

You can see the status of all jobs that you have personally submitted to the job scheduler using the squeue -u $USER option.

[username@arc matmul]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436360 single matmul_t username  R       1:09      1 cn056
           8436364 single,la matmul_t username PD       0:00      1 (Priority)

This provides a valuable overview of the jobs that we are waiting on. If you are interested in the progress of a particular job, you can filter on JobID using the squeue -j JobID option.

[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436364 single,la matmul_t username PD       0:00      1 (Priority) 
...time passes... and the job starts running
[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436364 pawson-bf matmul_t username  R       0:09      1 fc105
...time passes... and the job completes
[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

The field ST tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the NODELIST(REASON) field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue runs.

arc.job-info

arc.job-info provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU utilization at some times and high utilization at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources utilized by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot.

If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource utilization time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script.

1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.

  Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm

Partition: pawson-bf
    Nodes: 1
    Tasks: 1
     CPUs: 4

  Memmory: 10000.0 MB (per node)
         : 0.0 MB (per cpus)

Allocated node list:
    fc105
Batch host: fc105

Submitted on: 2021-02-03 12:49:23
  Started on: 2021-02-03 12:49:29

Time Limit: 2:00:00
 Wait Time: 0:00:06
  Run Time: 0:00:23

Collecting process data: fc105Warning: Permanently added the RSA host key for IP address '172.19.6.105' to the list of known hosts.


# ------------------------------------------------------------------------------------------
      Node                                                     Job  |  Node total       
            Procs  Threads     CPU%   Mem%      RSS Mb     VMem Mb  |    CPU %  Mem %
# ------------------------------------------------------------------------------------------
     fc105      2        2     90.2    0.6     1255.39     1489.91  |    364.0    0.6
# ------------------------------------------------------------------------------------------
     Total      2        2     90.2
Efficiency                     22.6   12.6
# ------------------------------------------------------------------------------------------

#Processes (#threads) on all nodes:

    1 (    1): /bin/bash
    1 (    1): python
    1 (    5): slurmstepd:

Total time: 2 sec.

2) ...time passes... and the matrix multiplication starts running

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.

  Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm

Partition: pawson-bf
    Nodes: 1
    Tasks: 1
     CPUs: 4

  Memmory: 10000.0 MB (per node)
         : 0.0 MB (per cpus)

Allocated node list:
    fc105
Batch host: fc105

Submitted on: 2021-02-03 12:49:23
  Started on: 2021-02-03 12:49:29

Time Limit: 2:00:00
 Wait Time: 0:00:06
  Run Time: 0:02:48

Collecting process data: fc105

# ------------------------------------------------------------------------------------------
      Node                                                     Job  |  Node total       
            Procs  Threads     CPU%   Mem%      RSS Mb     VMem Mb  |    CPU %  Mem %
# ------------------------------------------------------------------------------------------
     fc105      2        5    110.0    1.2     2476.28     2923.49  |    114.5    1.2
# ------------------------------------------------------------------------------------------
     Total      2        5    110.0
Efficiency                     27.5   24.8
# ------------------------------------------------------------------------------------------

#Processes (#threads) on all nodes:

    1 (    1): /bin/bash
    1 (    4): python
    1 (    5): slurmstepd:

Total time: 1 sec.

3) ...time passes... and the matrix multiplication completes and data is written from memory back to storage

...time passes... and the job completes

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Something is wrong about the Job ID 8436364. Just cannot do it. Aborting.

The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyze your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and cpus, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.