https://rcs.ucalgary.ca/index.php?title=Job_Monitoring&feed=atom&action=historyJob Monitoring - Revision history2024-03-29T15:55:19ZRevision history for this page on the wikiMediaWiki 1.39.6https://rcs.ucalgary.ca/index.php?title=Job_Monitoring&diff=2799&oldid=prevLleung: Added guides category2023-09-21T20:54:19Z<p>Added guides category</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 20:54, 21 September 2023</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l189">Line 189:</td>
<td colspan="2" class="diff-lineno">Line 189:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Category:ARC]]</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Category:Guides]]</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Navbox ARC}}</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>{{Navbox ARC}}</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">[[Category:ARC]]</del></div></td><td colspan="2" class="diff-side-added"></td></tr>
</table>Lleunghttps://rcs.ucalgary.ca/index.php?title=Job_Monitoring&diff=2673&oldid=prevLleung: Added navbox2023-09-20T22:22:01Z<p>Added navbox</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 22:22, 20 September 2023</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l189">Line 189:</td>
<td colspan="2" class="diff-lineno">Line 189:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyse your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and CPUs, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilisation is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">{{Navbox ARC}}</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:ARC]]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:ARC]]</div></td></tr>
</table>Lleunghttps://rcs.ucalgary.ca/index.php?title=Job_Monitoring&diff=2631&oldid=prevLleung: proof read and formatting2023-09-13T19:55:17Z<p>proof read and formatting</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 19:55, 13 September 2023</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l1">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">The third step in this work flow is checking </del>the progress of <del style="font-weight: bold; text-decoration: none;">the job</del>. We <del style="font-weight: bold; text-decoration: none;">will emphasize </del>three <del style="font-weight: bold; text-decoration: none;">tools</del>: </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">There are three methods to monitor </ins>the progress <ins style="font-weight: bold; text-decoration: none;">and status </ins>of <ins style="font-weight: bold; text-decoration: none;">your jobs</ins>. We <ins style="font-weight: bold; text-decoration: none;">recommend one of the </ins>three <ins style="font-weight: bold; text-decoration: none;">options</ins>: </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"><ol> </del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li><code>--mail-user</code> , <code>--mail-type</code> </del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>sbatch options for sending email about job progress</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>no performance information</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li><code>squeue</code></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>slurm tool for monitoring job start, running, and end</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>no performance information</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </li> </del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li><code>arc.job-info</code></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>RCS tool for monitoring job performance once it is running</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> <li>provides detailed snapshot of key performance information</li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </li></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"></ol></del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"># Use the <code>--mail-user</code> , <code>--mail-type</code> options with <code>sbatch</code> to email you about any changes to the status of your jobs.</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"># Use the <code>squeue</code> utility to monitor the status of your jobs.</ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"># Use the <code>arc.job-info</code> utility. This utility provides information on your job performance which can be used as part of the Job Performance Analysis stage where you need to optimise the efficiency of your jobs.</ins></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. <code>arc.job-info</code> also provides snapshots </del>of <del style="font-weight: bold; text-decoration: none;">performance that can be used </del>in <del style="font-weight: bold; text-decoration: none;">the Job Performance Analysis stage, so we will talk about this last.</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">We will cover each </ins>of <ins style="font-weight: bold; text-decoration: none;">these options below </ins>in <ins style="font-weight: bold; text-decoration: none;">further detail</ins>.</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">===--mail-user===</del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">Mail-user is a slurm option that can be included as an sbatch directive</del>. <del style="font-weight: bold; text-decoration: none;">If we were to modify our script to include it, it would look something like </del></div></td><td colspan="2" class="diff-side-added"></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">'''matmul_test02032021</del>.<del style="font-weight: bold; text-decoration: none;">slurm</del>:<del style="font-weight: bold; text-decoration: none;">'''</del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">= sbatch --mail option =</ins></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><source lang="bash"></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">The mail options to <code>sbatch</code> can be used to send emails about job statuses but will not contain any performance information</ins>. </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">You can pass the mail options to the <code>sbatch</code> script along with the other parameters that are defined for your job. An example <code>sbatch</code> script with the mail options specified is given below</ins>: <source lang="bash"></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#!/bin/bash</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#!/bin/bash</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#SBATCH --mail-user=username@ucalgary.ca</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>#SBATCH --mail-user=username@ucalgary.ca</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l62">Line 62:</td>
<td colspan="2" class="diff-lineno">Line 45:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></pre></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></pre></div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original <del style="font-weight: bold; text-decoration: none;">slurm </del>script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original <ins style="font-weight: bold; text-decoration: none;">Slurm </ins>script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken. </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;"> </del></div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div> </div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">==</del>=squeue<del style="font-weight: bold; text-decoration: none;">==</del>=</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>= squeue =</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><code>squeue</code> provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue. This will happen whether the job was successful or not. We will discuss how to <del style="font-weight: bold; text-decoration: none;">analyze </del>the job after completion in the next section. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><code>squeue</code> provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from <ins style="font-weight: bold; text-decoration: none;"><code></ins>squeue<ins style="font-weight: bold; text-decoration: none;"></code></ins>. This will happen whether the job was successful or not. We will discuss how to <ins style="font-weight: bold; text-decoration: none;">analyse </ins>the job after completion in the next section. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>You can see the status of all jobs that you have personally submitted to the job scheduler using the <code>squeue -u $USER</code> option.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>You can see the status of all jobs that you have personally submitted to the job scheduler using the <code>squeue -u $USER</code> option.</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l90">Line 90:</td>
<td colspan="2" class="diff-lineno">Line 73:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></source></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></source></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The field <code>ST</code> tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the <code>NODELIST(REASON)</code> field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue runs. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The field <code>ST</code> tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the <code>NODELIST(REASON)</code> field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when <ins style="font-weight: bold; text-decoration: none;"><code></ins>squeue<ins style="font-weight: bold; text-decoration: none;"></code> </ins>runs. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">==</del>=arc.job-info<del style="font-weight: bold; text-decoration: none;">==</del>=</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>= arc.job-info =</div></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><code>arc.job-info</code> provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU <del style="font-weight: bold; text-decoration: none;">utilization </del>at some times and high <del style="font-weight: bold; text-decoration: none;">utilization </del>at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources <del style="font-weight: bold; text-decoration: none;">utilized </del>by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><code>arc.job-info</code> provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU <ins style="font-weight: bold; text-decoration: none;">utilisation </ins>at some times and high <ins style="font-weight: bold; text-decoration: none;">utilisation </ins>at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources <ins style="font-weight: bold; text-decoration: none;">utilised </ins>by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource <del style="font-weight: bold; text-decoration: none;">utilization </del>time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script. </div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource <ins style="font-weight: bold; text-decoration: none;">utilisation </ins>time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script. </div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...</div></td></tr>
<tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l204">Line 204:</td>
<td colspan="2" class="diff-lineno">Line 187:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></source></div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div></source></div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker" data-marker="−"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to <del style="font-weight: bold; text-decoration: none;">analyze </del>your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and <del style="font-weight: bold; text-decoration: none;">cpus</del>, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource <del style="font-weight: bold; text-decoration: none;">utilization </del>is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to <ins style="font-weight: bold; text-decoration: none;">analyse </ins>your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and <ins style="font-weight: bold; text-decoration: none;">CPUs</ins>, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource <ins style="font-weight: bold; text-decoration: none;">utilisation </ins>is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:ARC]]</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>[[Category:ARC]]</div></td></tr>
</table>Lleunghttps://rcs.ucalgary.ca/index.php?title=Job_Monitoring&diff=2630&oldid=prevLleung: Added ARC category2023-09-13T16:56:01Z<p>Added ARC category</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 16:56, 13 September 2023</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l205">Line 205:</td>
<td colspan="2" class="diff-lineno">Line 205:</td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br/></td></tr>
<tr><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyze your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and cpus, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td><td class="diff-marker"></td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyze your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and cpus, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td colspan="2" class="diff-side-deleted"></td><td class="diff-marker" data-marker="+"></td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">[[Category:ARC]]</ins></div></td></tr>
</table>Lleunghttps://rcs.ucalgary.ca/index.php?title=Job_Monitoring&diff=1134&oldid=prevIan.percel: first Draft2021-02-09T22:57:09Z<p>first Draft</p>
<p><b>New page</b></p><div>The third step in this work flow is checking the progress of the job. We will emphasize three tools: <br />
<ol> <br />
<li><code>--mail-user</code> , <code>--mail-type</code> <br />
<ol><br />
<li>sbatch options for sending email about job progress</li><br />
<li>no performance information</li><br />
</ol><br />
</li><br />
<li><code>squeue</code><br />
<ol><br />
<li>slurm tool for monitoring job start, running, and end</li><br />
<li>no performance information</li><br />
</ol><br />
</li> <br />
<li><code>arc.job-info</code><br />
<ol><br />
<li>RCS tool for monitoring job performance once it is running</li><br />
<li>provides detailed snapshot of key performance information</li><br />
</ol><br />
</li><br />
</ol><br />
<br />
<br />
The motives for using the different monitoring tools are very different but they all help you track the progress of your job from start to finish. <code>arc.job-info</code> also provides snapshots of performance that can be used in the Job Performance Analysis stage, so we will talk about this last.<br />
===--mail-user===<br />
Mail-user is a slurm option that can be included as an sbatch directive. If we were to modify our script to include it, it would look something like <br />
<br />
'''matmul_test02032021.slurm:'''<br />
<source lang="bash"><br />
#!/bin/bash<br />
#SBATCH --mail-user=username@ucalgary.ca<br />
#SBATCH --mail-type=BEGIN<br />
#SBATCH --mail-type=END<br />
#SBATCH --partition=single,lattice,parallel,pawson-bf <br />
#SBATCH --time=2:0:0 <br />
#SBATCH --nodes=1 <br />
#SBATCH --ntasks=1 <br />
#SBATCH --cpus-per-task=4 <br />
#SBATCH --mem=10000M<br />
<br />
export PATH=~/anaconda3/bin:$PATH<br />
echo $(which python)<br />
export OMP_NUM_THREADS=4<br />
export OPENBLAS_NUM_THREADS=4<br />
export MKL_NUM_THREADS=4<br />
export VECLIB_MAXIMUM_THREADS=4<br />
export NUMEXPR_NUM_THREADS=4<br />
<br />
AMAT="/home/username/project/matmul/A.csv"<br />
BMAT="/home/username/project/matmul/B.csv"<br />
OUT="/home/username/project/matmul/C.csv"<br />
<br />
python matmul_test.py $AMAT $BMAT $OUT <br />
</source><br />
<br />
Different choices of <code>mail-type</code> provide information about that different stages of progress for the job. For a detailed list of mail-type options, see the sbatch manual page. The BEGIN and END options used in the above example would produce two emails (with empty body) for a jobID 8442969. One with the title <br />
<pre><br />
Slurm Job_id=8442969 Name=matmul_test02032021.slurm Began, Queued time 00:00:05<br />
</pre><br />
and another<br />
<pre><br />
Slurm Job_id=8442969 Name=matmul_test02032021.slurm Ended, Run time 00:08:02, COMPLETED, ExitCode 0<br />
</pre><br />
These emails can overwhelm your inbox when used for every job in a large collection. However, they can be useful (especially when used in conjunction with job dependencies) in tracking the progress of collections of long running jobs that may not start or end for many days. This option must be included in the original slurm script submitted to the scheduler and can't be added later. (so it is, in some sense, the first step in job monitoring) The information provided by this method is of the coarsest character, but it also is the most passive tracking option and will function without any further user actions being taken. <br />
<br />
===squeue===<br />
<code>squeue</code> provides information about any job that has been submitted to the scheduler. This means that as soon as a job has been submitted, you can use this command to get some information on the status of the job. After a job completes, it will disappear from squeue. This will happen whether the job was successful or not. We will discuss how to analyze the job after completion in the next section. <br />
<br />
You can see the status of all jobs that you have personally submitted to the job scheduler using the <code>squeue -u $USER</code> option.<br />
<source lang="console"><br />
[username@arc matmul]$ squeue -u $USER<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <br />
8436360 single matmul_t username R 1:09 1 cn056<br />
8436364 single,la matmul_t username PD 0:00 1 (Priority) <br />
</source><br />
<br />
This provides a valuable overview of the jobs that we are waiting on. If you are interested in the progress of a particular job, you can filter on JobID using the <code>squeue -j JobID</code> option. <br />
<br />
<source lang="console"><br />
[username@arc matmul]$ squeue -j 8436364<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <br />
8436364 single,la matmul_t username PD 0:00 1 (Priority) <br />
...time passes... and the job starts running<br />
[username@arc matmul]$ squeue -j 8436364<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <br />
8436364 pawson-bf matmul_t username R 0:09 1 fc105<br />
...time passes... and the job completes<br />
[username@arc matmul]$ squeue -j 8436364<br />
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <br />
</source><br />
<br />
The field <code>ST</code> tells us the current state of the job. When the job is waiting in the queue, the state will display the value PD for Pending. When the job is Pending, the <code>NODELIST(REASON)</code> field will list the reason that the job is waiting to start. The most common reason is Priority, meaning that you are not at the front of the queue yet. When a partition list is used in the request, this field is not always useful. Once the jobs has started, the state will display the value R for Running and the REASON will be replaced by a list of specific nodes that the job is currently running on. Once the job completes, it will no longer appear when squeue runs. <br />
<br />
===arc.job-info===<br />
<code>arc.job-info</code> provides a snapshot of resource usage sampled in a small window at the time that it is run. It will only return results while the job is running and will show an error after it completes. However, the information is very detailed. If a job is taking a very long time, this can provide a sense of whether any work is being done. There are reasons why a job will have low CPU utilization at some times and high utilization at others. However, if your job has started running and is taking much longer than expected, arc.job-info can provide warning signs that you need to introduce debugging lines into your code (or run in a debugging mode) to get a handle on where the code is stalling. arc.job-info provides some of the most fine-grained information available on the resources utilized by your job (without using a more technically complex performance analysis tool). However, it is also one of the most active methods of interrogation in that you need to run the command each time that you want a snapshot. <br />
<br />
If you run arc.job-info repeatedly over the life of the job, you can develop an informative resource utilization time series that gives you a good sense of what is happening in the job. In the sequence of calls to arc.job-info below, I annotate them with some narrative lines that indicate how to interpret them relative to the events in the job script. <br />
<br />
1) Once the job begins and data is being loaded from storage into memory, our example job will show something like...<br />
<source lang="console"><br />
[username@arc matmul]$ arc.job-info 8436364<br />
# ====================================================================================================<br />
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.<br />
<br />
Work dir: /home/username/project/matmul<br />
Job script: /home/username/project/matmul/matmul_test02032021.slurm<br />
<br />
Partition: pawson-bf<br />
Nodes: 1<br />
Tasks: 1<br />
CPUs: 4<br />
<br />
Memmory: 10000.0 MB (per node)<br />
: 0.0 MB (per cpus)<br />
<br />
Allocated node list:<br />
fc105<br />
Batch host: fc105<br />
<br />
Submitted on: 2021-02-03 12:49:23<br />
Started on: 2021-02-03 12:49:29<br />
<br />
Time Limit: 2:00:00<br />
Wait Time: 0:00:06<br />
Run Time: 0:00:23<br />
<br />
Collecting process data: fc105Warning: Permanently added the RSA host key for IP address '172.19.6.105' to the list of known hosts.<br />
<br />
<br />
# ------------------------------------------------------------------------------------------<br />
Node Job | Node total <br />
Procs Threads CPU% Mem% RSS Mb VMem Mb | CPU % Mem %<br />
# ------------------------------------------------------------------------------------------<br />
fc105 2 2 90.2 0.6 1255.39 1489.91 | 364.0 0.6<br />
# ------------------------------------------------------------------------------------------<br />
Total 2 2 90.2<br />
Efficiency 22.6 12.6<br />
# ------------------------------------------------------------------------------------------<br />
<br />
#Processes (#threads) on all nodes:<br />
<br />
1 ( 1): /bin/bash<br />
1 ( 1): python<br />
1 ( 5): slurmstepd:<br />
<br />
Total time: 2 sec.<br />
</source><br />
2) ...time passes... and the matrix multiplication starts running<br />
<source lang="console"><br />
[username@arc matmul]$ arc.job-info 8436364<br />
# ====================================================================================================<br />
Job 8436364 'matmul_test02032021.slurm' by 'username' is 'RUNNING'.<br />
<br />
Work dir: /home/username/project/matmul<br />
Job script: /home/username/project/matmul/matmul_test02032021.slurm<br />
<br />
Partition: pawson-bf<br />
Nodes: 1<br />
Tasks: 1<br />
CPUs: 4<br />
<br />
Memmory: 10000.0 MB (per node)<br />
: 0.0 MB (per cpus)<br />
<br />
Allocated node list:<br />
fc105<br />
Batch host: fc105<br />
<br />
Submitted on: 2021-02-03 12:49:23<br />
Started on: 2021-02-03 12:49:29<br />
<br />
Time Limit: 2:00:00<br />
Wait Time: 0:00:06<br />
Run Time: 0:02:48<br />
<br />
Collecting process data: fc105<br />
<br />
# ------------------------------------------------------------------------------------------<br />
Node Job | Node total <br />
Procs Threads CPU% Mem% RSS Mb VMem Mb | CPU % Mem %<br />
# ------------------------------------------------------------------------------------------<br />
fc105 2 5 110.0 1.2 2476.28 2923.49 | 114.5 1.2<br />
# ------------------------------------------------------------------------------------------<br />
Total 2 5 110.0<br />
Efficiency 27.5 24.8<br />
# ------------------------------------------------------------------------------------------<br />
<br />
#Processes (#threads) on all nodes:<br />
<br />
1 ( 1): /bin/bash<br />
1 ( 4): python<br />
1 ( 5): slurmstepd:<br />
<br />
Total time: 1 sec.<br />
<br />
</source><br />
3) ...time passes... and the matrix multiplication completes and data is written from memory back to storage<br />
<br />
...time passes... and the job completes<br />
<source lang="console"><br />
[username@arc matmul]$ arc.job-info 8436364<br />
# ====================================================================================================<br />
Something is wrong about the Job ID 8436364. Just cannot do it. Aborting.<br />
</source><br />
<br />
The first execution of the job-info command happens early in the job when data is being loaded to variables from files. This is reflected in the low memory usage on the node. The second execution happens at peak memory usage after the data is loaded into memory. The last execution happens after the job has completed and no information is provided. Although more timepoints would be needed to analyze your code to identify problems, the snapshots that we have tell us something substantive about job progress and potential issues. Having seen this sequence of snapshots, it is clear that the job is not optimally using memory and cpus, even though it is creating 4 threads in support of the matrix multiplication. Further analysis of aggregate and maximum resource utilization is needed to determine changes that need to be made to the job script and the resource request. This is the focus of the Job Performance Analysis step that is the subject of the next section.</div>Ian.percel