Job Monitoring: Difference between revisions
(Added ARC category) |
(proof read and formatting) |
||
Line 1: | Line 1: | ||
There are three methods to monitor the progress and status of your jobs. We recommend one of the three options: | |||
# Use the <code>--mail-user</code> , <code>--mail-type</code> options with <code>sbatch</code> to email you about any changes to the status of your jobs. | |||
# Use the <code>squeue</code> utility to monitor the status of your jobs. | |||
# Use the <code>arc.job-info</code> utility. This utility provides information on your job performance which can be used as part of the Job Performance Analysis stage where you need to optimise the efficiency of your jobs. | |||
We will cover each of these options below in further detail. | |||
= sbatch --mail option = | |||
<source lang="bash"> | The mail options to <code>sbatch</code> can be used to send emails about job statuses but will not contain any performance information. | ||
You can pass the mail options to the <code>sbatch</code> script along with the other parameters that are defined for your job. An example <code>sbatch</code> script with the mail options specified is given below: <source lang="bash"> | |||
#!/bin/bash | #!/bin/bash | ||
#SBATCH --mail-user=username@ucalgary.ca | #SBATCH --mail-user=username@ucalgary.ca | ||
Line 62: | Line 45: | ||
Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0 | Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0 | ||
</pre> | </pre> | ||
These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original | These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original Slurm script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken. | ||
= squeue = | |||
<code>squeue</code> provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue. This will happen whether the job was successful or not. We will discuss how to | <code>squeue</code> provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from <code>squeue</code>. This will happen whether the job was successful or not. We will discuss how to analyse the job after completion in the next section. | ||
You can see the status of all jobs that you have personally submitted to the job scheduler using the <code>squeue -u $USER</code> option. | You can see the status of all jobs that you have personally submitted to the job scheduler using the <code>squeue -u $USER</code> option. | ||
Line 90: | Line 73: | ||
</source> | </source> | ||
The field <code>ST</code> tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the <code>NODELIST(REASON)</code> field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue runs. | The field <code>ST</code> tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the <code>NODELIST(REASON)</code> field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when <code>squeue</code> runs. | ||
= arc.job-info = | |||
<code>arc.job-info</code> provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU | <code>arc.job-info</code> provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU utilisation at some times and high utilisation at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources utilised by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot. | ||
If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource | If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource utilisation time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script. | ||
1) Once the job begins and data is being loaded from storage into memory, our example job will show something like... | 1) Once the job begins and data is being loaded from storage into memory, our example job will show something like... | ||
Line 204: | Line 187: | ||
</source> | </source> | ||
The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to | The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section. | ||
[[Category:ARC]] | [[Category:ARC]] |
Revision as of 19:55, 13 September 2023
There are three methods to monitor the progress and status of your jobs. We recommend one of the three options:
- Use the
--mail-user
,--mail-type
options withsbatch
to email you about any changes to the status of your jobs. - Use the
squeue
utility to monitor the status of your jobs. - Use the
arc.job-info
utility. This utility provides information on your job performance which can be used as part of the Job Performance Analysis stage where you need to optimise the efficiency of your jobs.
We will cover each of these options below in further detail.
sbatch --mail option
The mail options to sbatch
can be used to send emails about job statuses but will not contain any performance information.
You can pass the mail options to the sbatch
script along with the other parameters that are defined for your job. An example sbatch
script with the mail options specified is given below:
#!/bin/bash
#SBATCH --mail-user=username@ucalgary.ca
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --partition=single,lattice,parallel,pawson-bf
#SBATCH --time=2:0:0
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10000M
export PATH=~/anaconda3/bin:$PATH
echo $(which python)
export OMP_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
export MKL_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4
export NUMEXPR_NUM_THREADS=4
AMAT="/home/username/project/matmul/A.csv"
BMAT="/home/username/project/matmul/B.csv"
OUT="/home/username/project/matmul/C.csv"
python matmul_test.py $AMAT $BMAT $OUT
Different choices of mail-type
provide information about that different stages of progress for the job. For a detailed list of mail-type options, see the sbatch manual page. The BEGIN and END options used in the above example would produce two emails (with empty body) for a jobID 8442969. One with the title
Slurm Job_id=8442969 Name=matmul_test02032021.slurm Began, Queued time 00:00:05
and another
Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0
These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original Slurm script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken.
squeue
squeue
provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue
. This will happen whether the job was successful or not. We will discuss how to analyse the job after completion in the next section.
You can see the status of all jobs that you have personally submitted to the job scheduler using the squeue -u $USER
option.
[username@arc matmul]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8436360 single matmul_t username R 1:09 1 cn056
8436364 single,la matmul_t username PD 0:00 1 (Priority)
This provides a valuable overview of the jobs that we are waiting on. If you are interested in the progress of a particular job, you can filter on JobID using the squeue -j JobID
option.
[username@arc matmul]$ squeue -j 8436364
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8436364 single,la matmul_t username PD 0:00 1 (Priority)
...time passes... and the job starts running
[username@arc matmul]$ squeue -j 8436364
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8436364 pawson-bf matmul_t username R 0:09 1 fc105
...time passes... and the job completes
[username@arc matmul]$ squeue -j 8436364
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
The field ST
tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the NODELIST(REASON)
field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue
runs.
arc.job-info
arc.job-info
provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU utilisation at some times and high utilisation at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources utilised by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot.
If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource utilisation time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script.
1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...
[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.
Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm
Partition: pawson-bf
Nodes: 1
Tasks: 1
CPUs: 4
Memmory: 10000.0 MB (per node)
: 0.0 MB (per cpus)
Allocated node list:
fc105
Batch host: fc105
Submitted on: 2021-02-03 12:49:23
Started on: 2021-02-03 12:49:29
Time Limit: 2:00:00
Wait Time: 0:00:06
Run Time: 0:00:23
Collecting process data: fc105Warning: Permanently added the RSA host key for IP address '172.19.6.105' to the list of known hosts.
# ------------------------------------------------------------------------------------------
Node Job | Node total
Procs Threads CPU% Mem% RSS Mb VMem Mb | CPU % Mem %
# ------------------------------------------------------------------------------------------
fc105 2 2 90.2 0.6 1255.39 1489.91 | 364.0 0.6
# ------------------------------------------------------------------------------------------
Total 2 2 90.2
Efficiency 22.6 12.6
# ------------------------------------------------------------------------------------------
#Processes (#threads) on all nodes:
1 ( 1): /bin/bash
1 ( 1): python
1 ( 5): slurmstepd:
Total time: 2 sec.
2) ...time passes... and the matrix multiplication starts running
[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.
Work dir: /home/username/project/matmul
Job script: /home/username/project/matmul/matmul_test02032021.slurm
Partition: pawson-bf
Nodes: 1
Tasks: 1
CPUs: 4
Memmory: 10000.0 MB (per node)
: 0.0 MB (per cpus)
Allocated node list:
fc105
Batch host: fc105
Submitted on: 2021-02-03 12:49:23
Started on: 2021-02-03 12:49:29
Time Limit: 2:00:00
Wait Time: 0:00:06
Run Time: 0:02:48
Collecting process data: fc105
# ------------------------------------------------------------------------------------------
Node Job | Node total
Procs Threads CPU% Mem% RSS Mb VMem Mb | CPU % Mem %
# ------------------------------------------------------------------------------------------
fc105 2 5 110.0 1.2 2476.28 2923.49 | 114.5 1.2
# ------------------------------------------------------------------------------------------
Total 2 5 110.0
Efficiency 27.5 24.8
# ------------------------------------------------------------------------------------------
#Processes (#threads) on all nodes:
1 ( 1): /bin/bash
1 ( 4): python
1 ( 5): slurmstepd:
Total time: 1 sec.
3) ...time passes... and the matrix multiplication completes and data is written from memory back to storage
...time passes... and the job completes
[username@arc matmul]$ arc.job-info 8436364
# ====================================================================================================
Something is wrong about the Job ID 8436364. Just cannot do it. Aborting.
The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.