Job Monitoring

From RCSWiki
Revision as of 22:22, 20 September 2023 by Lleung (talk | contribs) (Added navbox)
Jump to navigation Jump to search

There are three methods to monitor the progress and status of your jobs. We recommend one of the three options:

  1. Use the --mail-user , --mail-type options with sbatch to email you about any changes to the status of your jobs.
  2. Use the squeue utility to monitor the status of your jobs.
  3. Use the arc.job-info utility. This utility provides information on your job performance which can be used as part of the Job Performance Analysis stage where you need to optimise the efficiency of your jobs.

We will cover each of these options below in further detail.

sbatch --mail option

The mail options to sbatch can be used to send emails about job statuses but will not contain any performance information.

You can pass the mail options to the sbatch script along with the other parameters that are defined for your job. An example sbatch script with the mail options specified is given below:

#!/bin/bash
#SBATCH --mail-user=username@ucalgary.ca
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --partition=single,lattice,parallel,pawson-bf 
#SBATCH --time=2:0:0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 
#SBATCH --mem=10000M

export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4

AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"

python matmul_test.py $AMAT $BMAT $OUT

Different choices of mail-type provide information about that different stages of progress for the job. For a detailed list of mail-type options, see the sbatch manual page. The BEGIN and END options used in the above example would produce two emails (with empty body) for a jobID 8442969. One with the title

Slurm Job_id=8442969 Name=matmul_test02032021.slurm Began, Queued time 00:00:05

and another

Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0

These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original Slurm script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken.

squeue

squeue provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue. This will happen whether the job was successful or not. We will discuss how to analyse the job after completion in the next section.

You can see the status of all jobs that you have personally submitted to the job scheduler using the squeue -u $USER option.

[username@arc matmul]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436360 single matmul_t username  R       1:09      1 cn056
           8436364 single,la matmul_t username PD       0:00      1 (Priority)

This provides a valuable overview of the jobs that we are waiting on. If you are interested in the progress of a particular job, you can filter on JobID using the squeue -j JobID option.

[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436364 single,la matmul_t username PD       0:00      1 (Priority) 
...time passes... and the job starts running
[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           8436364 pawson-bf matmul_t username  R       0:09      1 fc105
...time passes... and the job completes
[username@arc matmul]$ squeue -j 8436364
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

The field ST tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the NODELIST(REASON) field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue runs.

arc.job-info

arc.job-info provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU utilisation at some times and high utilisation at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources utilised by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot.

If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource utilisation time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script.

1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.

  Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm

Partition: pawson-bf
    Nodes: 1
    Tasks: 1
     CPUs: 4

  Memmory: 10000.0 MB (per node)
         : 0.0 MB (per cpus)

Allocated node list:
    fc105
Batch host: fc105

Submitted on: 2021-02-03 12:49:23
  Started on: 2021-02-03 12:49:29

Time Limit: 2:00:00
 Wait Time: 0:00:06
  Run Time: 0:00:23

Collecting process data: fc105Warning: Permanently added the RSA host key for IP address '172.19.6.105' to the list of known hosts.


# ------------------------------------------------------------------------------------------
      Node                                                     Job  |  Node total       
            Procs  Threads     CPU%   Mem%      RSS Mb     VMem Mb  |    CPU %  Mem %
# ------------------------------------------------------------------------------------------
     fc105      2        2     90.2    0.6     1255.39     1489.91  |    364.0    0.6
# ------------------------------------------------------------------------------------------
     Total      2        2     90.2
Efficiency                     22.6   12.6
# ------------------------------------------------------------------------------------------

#Processes (#threads) on all nodes:

    1 (    1): /bin/bash
    1 (    1): python
    1 (    5): slurmstepd:

Total time: 2 sec.

2) ...time passes... and the matrix multiplication starts running

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.

  Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm

Partition: pawson-bf
    Nodes: 1
    Tasks: 1
     CPUs: 4

  Memmory: 10000.0 MB (per node)
         : 0.0 MB (per cpus)

Allocated node list:
    fc105
Batch host: fc105

Submitted on: 2021-02-03 12:49:23
  Started on: 2021-02-03 12:49:29

Time Limit: 2:00:00
 Wait Time: 0:00:06
  Run Time: 0:02:48

Collecting process data: fc105

# ------------------------------------------------------------------------------------------
      Node                                                     Job  |  Node total       
            Procs  Threads     CPU%   Mem%      RSS Mb     VMem Mb  |    CPU %  Mem %
# ------------------------------------------------------------------------------------------
     fc105      2        5    110.0    1.2     2476.28     2923.49  |    114.5    1.2
# ------------------------------------------------------------------------------------------
     Total      2        5    110.0
Efficiency                     27.5   24.8
# ------------------------------------------------------------------------------------------

#Processes (#threads) on all nodes:

    1 (    1): /bin/bash
    1 (    4): python
    1 (    5): slurmstepd:

Total time: 1 sec.

3) ...time passes... and the matrix multiplication completes and data is written from memory back to storage

...time passes... and the job completes

[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Something is wrong about the Job ID 8436364. Just cannot do it. Aborting.

The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.