<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://rcs.ucalgary.ca/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Dschulz</id>
	<title>RCSWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://rcs.ucalgary.ca/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Dschulz"/>
	<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/Special:Contributions/Dschulz"/>
	<updated>2026-04-05T09:46:52Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.43.3</generator>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=4041</id>
		<title>MARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=4041"/>
		<updated>2026-02-23T22:18:20Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = MARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational&lt;br /&gt;
| message = See the [[MARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 23, 2023, the MARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The MARC login node will reboot on the morning of January 23. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 27.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on MARC Login Node&lt;br /&gt;
| date = 2023/06/23&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the MARC login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
}}besian.sejdiu&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = OS Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2024/09/11&lt;br /&gt;
| message =&lt;br /&gt;
MARC will be going down for OS upgrades on 2024/Sep/16. The cluster &lt;br /&gt;
will be unavailable temporarily to complete this work. Please contact&lt;br /&gt;
support@hpc.ucalgary.ca if you have any questions or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The MARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The MARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Scheduled File System Maintenance&lt;br /&gt;
| date = 2025/06/09&lt;br /&gt;
| message = Please be advised MARC will be going down for a period of approximately 2 hours starting at 10 AM June 17, 2025. Logins will not be available and no jobs will be running during this window. &lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca. Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Maintenance Complete&lt;br /&gt;
| date = 2025/06/17&lt;br /&gt;
| message = Filesystem maintenance complete.&lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ MARC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
MARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Complete&lt;br /&gt;
| date = 2026/02/23&lt;br /&gt;
| message = ✅ Filesystem Outage Complete&lt;br /&gt;
&lt;br /&gt;
All clusters are back in production.&lt;br /&gt;
&lt;br /&gt;
As of Friday Feb 20, 2026 - Due to some configuration issues on the network Fluent is not available on ARC at this time. &lt;br /&gt;
We are working to fix the issues and the software will return as soon as possible. Please try resubmitting any affected jobs later.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4040</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4040"/>
		<updated>2026-02-23T22:17:57Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ TALC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
TALC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Returned to service&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Talc maintenance complete. &lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Complete&lt;br /&gt;
| date = 2026/02/23&lt;br /&gt;
| message = ✅ Filesystem Outage Complete&lt;br /&gt;
&lt;br /&gt;
All clusters are back in production.&lt;br /&gt;
&lt;br /&gt;
As of Friday Feb 20, 2026 - Due to some configuration issues on the network Fluent is not available on ARC at this time. &lt;br /&gt;
We are working to fix the issues and the software will return as soon as possible. Please try resubmitting any affected jobs later.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4039</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4039"/>
		<updated>2026-02-23T22:14:53Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = New Compute nodes added to Arc&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = cpu2025 partition is now available.&lt;br /&gt;
&lt;br /&gt;
New nodes have been installed to replace the aging parallel nodes. They have 1TiB of memory and 128 cpus.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
As of Friday Feb 20, 2026 - Due to some configuration issues on the network Fluent is not available on ARC at this time. &lt;br /&gt;
We are working to fix the issues and the software will return as soon as possible. Please try resubmitting any affected jobs later.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Complete&lt;br /&gt;
| date = 2026/02/23&lt;br /&gt;
| message = ✅ Filesystem Outage Complete&lt;br /&gt;
&lt;br /&gt;
All clusters are back in production.&lt;br /&gt;
&lt;br /&gt;
As of Friday Feb 20, 2026 - Due to some configuration issues on the network Fluent is not available on ARC at this time. &lt;br /&gt;
We are working to fix the issues and the software will return as soon as possible. Please try resubmitting any affected jobs later.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4033</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4033"/>
		<updated>2026-02-17T20:35:00Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = New Compute nodes added to Arc&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = cpu2025 partition is now available.&lt;br /&gt;
&lt;br /&gt;
New nodes have been installed to replace the aging parallel nodes. They have 1TiB of memory and 128 cpus.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=4032</id>
		<title>MARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=4032"/>
		<updated>2026-02-17T20:33:29Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = MARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational&lt;br /&gt;
| message = See the [[MARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 23, 2023, the MARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The MARC login node will reboot on the morning of January 23. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 27.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on MARC Login Node&lt;br /&gt;
| date = 2023/06/23&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the MARC login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
}}besian.sejdiu&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = OS Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2024/09/11&lt;br /&gt;
| message =&lt;br /&gt;
MARC will be going down for OS upgrades on 2024/Sep/16. The cluster &lt;br /&gt;
will be unavailable temporarily to complete this work. Please contact&lt;br /&gt;
support@hpc.ucalgary.ca if you have any questions or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The MARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The MARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Scheduled File System Maintenance&lt;br /&gt;
| date = 2025/06/09&lt;br /&gt;
| message = Please be advised MARC will be going down for a period of approximately 2 hours starting at 10 AM June 17, 2025. Logins will not be available and no jobs will be running during this window. &lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca. Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Maintenance Complete&lt;br /&gt;
| date = 2025/06/17&lt;br /&gt;
| message = Filesystem maintenance complete.&lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ MARC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
MARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4031</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4031"/>
		<updated>2026-02-17T20:32:59Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ TALC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
TALC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Returned to service&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Talc maintenance complete. &lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/02/17&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4030</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4030"/>
		<updated>2026-02-17T20:29:47Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = New Compute nodes added to Arc&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = cpu2025 partition is now available.&lt;br /&gt;
&lt;br /&gt;
New nodes have been installed to replace the aging parallel nodes. They have 1TiB of memory and 128 cpus.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Any jobs queued must request timelimits less then the remaining time until Feb 23, 7AM or they will wait until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4029</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4029"/>
		<updated>2026-02-17T20:27:53Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = New Compute nodes added to Arc&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = cpu2025 partition is now available.&lt;br /&gt;
&lt;br /&gt;
New nodes have been installed to replace the aging parallel nodes. They have 1TiB of memory and 128 cpus.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Outage Feb 23, 2026 Starting at 7AM&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = ⚠️ Filesystem Outage Affecting Arc, Marc, Talc, Cloudstack&lt;br /&gt;
&lt;br /&gt;
Filesystems will be unavailable Feb 23 starting at 7AM MST.  This outage is expected to end prior to 8PM.  No logins or file access will be possible until the outage is complete.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4020</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4020"/>
		<updated>2026-01-22T20:25:44Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = New Compute nodes added to Arc&lt;br /&gt;
| date = 2026/01/22&lt;br /&gt;
| message = cpu2025 partition is now available.&lt;br /&gt;
&lt;br /&gt;
New nodes have been installed to replace the aging parallel nodes. They have 1TiB of memory and 128 cpus.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4017</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4017"/>
		<updated>2026-01-16T19:49:32Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2026/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4016</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4016"/>
		<updated>2026-01-16T19:47:37Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Returned to service, scheduling resumed&lt;br /&gt;
| date = 2025/01/16&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Arc has been returned to service.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4015</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4015"/>
		<updated>2026-01-13T20:49:40Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ TALC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
TALC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Returned to service&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = Maintenance Complete&lt;br /&gt;
&lt;br /&gt;
Talc maintenance complete. &lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4014</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=4014"/>
		<updated>2026-01-13T20:04:59Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ TALC Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
TALC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Talc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4012</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=4012"/>
		<updated>2026-01-13T19:54:55Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/09/23&lt;br /&gt;
| message = gpu-v100 has 6 more nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2025/11/4&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Legacy compute nodes are being retired&lt;br /&gt;
| date = 2025/11/20&lt;br /&gt;
| message = ARC nodes cn[0513-1096] are being removed from the arc cluster. They will be removed from scheduling and removed from the cluster over the next while. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Planned outage in Jan 2026&lt;br /&gt;
| date = 2025/12/26&lt;br /&gt;
| message = ⚠️ Arc Update Week of January 12&lt;br /&gt;
&lt;br /&gt;
ARC will be going down periodically during the week of 2026/Jan/12 to allow for system maintenance. All nodes and filesystems will be unavailable for use during the outage, and filesystem performance may be variable following the upgrade for a time until the changes finish. &lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Arc Filesystem Returned to service, scheduling outages continue&lt;br /&gt;
| date = 2025/01/13&lt;br /&gt;
| message = ⚠️ Data access now possible ⚠️&lt;br /&gt;
&lt;br /&gt;
The first stage of the maintenance is complete and the filesystems are once again available.  &lt;br /&gt;
&lt;br /&gt;
There is still significant work to do on the cluster so jobs will not start until this maintenance is complete but clients will be able to access data via the login node and DTN.&lt;br /&gt;
&lt;br /&gt;
If you have questions or concerns, please contact support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3945</id>
		<title>Running alphafold3</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3945"/>
		<updated>2025-11-12T17:55:55Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: Updated Alphafold3 usage instructions&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Running Alphafold3 on Arc =&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has been compiled into an Apptainer image for use on Arc.&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory.  The second step runs the inference and requires the use of a GPU.  The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step.  To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
&lt;br /&gt;
* Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).&lt;br /&gt;
* The alphafold3 container is stored in /global/software/alphafold3.  There are example job scripts in that directory that are referenced below.&lt;br /&gt;
* Due to the nature of the pipeline stage, we need to split each alphafold run into separate pipeline and inference stages.  If we don&#039;t do this, a valuable GPU node is tied up for a long time preventing others from using it.  In our testing using an input of 120 proteins the pipeline stage took a day and the inference stage took only an hour&lt;br /&gt;
* The public databases that Alphafold uses have been pre-downloaded and are stored in a location for anyone on Arc to use.  This location, /bulk/public/alphafold3/public_databases,  is reflected in the example job scripts.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== The Pipeline stage ==&lt;br /&gt;
&lt;br /&gt;
* This stage is very time consuming.  Much of the time is spent waiting to load a very large amount of data from the filesystem.  The example jobscript in /global/software/alphafold3 copies the /bulk/public/alphafold3/public_databases/mmcif_files to /dev/shm which is a filesystem that exists entirely in the compute node&#039;s memory (this makes it very fast).  This staging of data takes 20-30 min so you&#039;ll want to run a fairly large number of proteins in one job to amortize the cost of copying the mmcif_files to the compute node. &lt;br /&gt;
* Download the model and put it in ./models (This is a legal/license requirement)&lt;br /&gt;
* Generate input data.  Make sure you&#039;ve split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we&#039;ll use filenames starting with xa and ending with .json so xa*.json.  The example job script runs 5 copies of alphafold at the same time on the compute node as alphafold3&#039;s pipeline stage can really only effectively make use of 4 cpus for a single job.  This way the system is running multiple 4 cpu jobs at the same time.&lt;br /&gt;
* Create ./pipelineoutputs for the resulting files&lt;br /&gt;
* Run the pipeline stage on cpus by submitting runpipeline.slurm.  Required Job resources:&lt;br /&gt;
** --mem=500GB # We need to request enough memory for the public_databases/mmcif_files plus whatever the job needs.&lt;br /&gt;
** -c 20 # 20 cpus&lt;br /&gt;
** -N 1  # 1 node&lt;br /&gt;
&lt;br /&gt;
== The Inference stage ==&lt;br /&gt;
&lt;br /&gt;
* This is the part of alphafold that requires a GPU.&lt;br /&gt;
* Compared to the pipeline stage, the inference is fairly quick.  &lt;br /&gt;
* copy the runinference.slurm file to a directory you have write access to.&lt;br /&gt;
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You&#039;ll have to copy them all.  You can use find to help:&lt;br /&gt;
 cd pipelineoutputs&lt;br /&gt;
 cp $(find -type f -name \*.json) ../inferenceinputs&lt;br /&gt;
* submit the runpipeline.slurm to the scheduler. Required job resources:&lt;br /&gt;
** -p gpu-h100&lt;br /&gt;
** --gres=gpu:1 # Alphafold can only use 1 gpu&lt;br /&gt;
** --mem=120G # This might need to be increased depending on the inputs.&lt;br /&gt;
** --ntasks-per-node=4&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
* https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md&lt;br /&gt;
* https://docs.alliancecan.ca/wiki/AlphaFold3&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3907</id>
		<title>Template:ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3907"/>
		<updated>2025-09-18T19:55:03Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster maintenance complete. &lt;br /&gt;
| message = System is functioning normally.  No known issues.&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3906</id>
		<title>Altis Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3906"/>
		<updated>2025-09-18T19:52:57Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/27&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Altis experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Altis cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on Altis&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Altis Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Altis (Arc) cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on Altis&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = Update complete&lt;br /&gt;
&lt;br /&gt;
The update of the Arc cluster is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3905</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3905"/>
		<updated>2025-09-18T19:51:39Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/18&lt;br /&gt;
| message = Arc update complete&lt;br /&gt;
&lt;br /&gt;
The update to arc is complete.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3881</id>
		<title>Altis Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3881"/>
		<updated>2025-09-05T20:41:53Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/27&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Altis experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Altis cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on Altis&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Altis Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Altis (Arc) cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3880</id>
		<title>Altis Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3880"/>
		<updated>2025-09-05T20:40:21Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/27&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Altis experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Altis cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3879</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3879"/>
		<updated>2025-09-05T20:34:24Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU partition changes on ARC&lt;br /&gt;
| date = 2025/08/15&lt;br /&gt;
| message = New gpu-h100 and gpu-l40 partitions are available for general scheduling with new gpu hardware. gpu-v100 also has 6 fewer nodes. You can view more details about the node specs by running the arc.hardware script on the login node. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Update and Bulk outage on ARC&lt;br /&gt;
| date = 2025/09/15&lt;br /&gt;
| message = ⚠️ Arc Update September 15-19&lt;br /&gt;
&lt;br /&gt;
An update of the Arc cluster will begin on Sep 15.  This will result in fewer resources while the compute nodes are restarted.  The login node and Scheduler will be restarted on Sep 17.&lt;br /&gt;
&lt;br /&gt;
⚠️ Bulk Filesystem Maintenance September 17&lt;br /&gt;
The filer that provides the /bulk filesystem will be down for emergency repairs at 9 AM on Wednesday Sep 17th. No access to files on /bulk will be possible for the duration of the multi-hour outage. Any jobs running that access /bulk will start and then pause when access to /bulk is attempted. Jobs should continue once service is restored.  We apologize for any inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3820</id>
		<title>Running alphafold3</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3820"/>
		<updated>2025-08-03T04:40:12Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Running Alphafold3 on Arc =&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has been compiled into an Apptainer image for use on Arc.&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory.  The second step runs the inference and requires the use of a GPU.  The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step.  To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
&lt;br /&gt;
* Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).&lt;br /&gt;
* The alphafold3 container is stored in /global/software/alphafold3.  There are example job scripts in that directory that are referenced below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== The Pipeline stage ==&lt;br /&gt;
&lt;br /&gt;
* Download the model and put it in ./models&lt;br /&gt;
* Generate input data.  Make sure you&#039;ve split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we&#039;ll use filenames starting with xa and ending with .json so xa*.json&lt;br /&gt;
* Create ./pipelineoutputs for the resulting files&lt;br /&gt;
* Run the pipeline stage on cpus by submitting runpipeline.slurm&lt;br /&gt;
&lt;br /&gt;
== The Inference stage ==&lt;br /&gt;
&lt;br /&gt;
* copy the runinference.slurm file to a directory you have write access to.&lt;br /&gt;
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You&#039;ll have to copy them all.  You can use find to help:&lt;br /&gt;
 cd pipelineoutputs&lt;br /&gt;
 cp $(find -type f -name \*.json) ../inferenceinputs&lt;br /&gt;
* submit the runpipeline.slurm to the scheduler.&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
* https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md&lt;br /&gt;
* https://docs.alliancecan.ca/wiki/AlphaFold3&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3817</id>
		<title>Template:ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3817"/>
		<updated>2025-08-01T19:40:20Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational &lt;br /&gt;
| message = System is fully operational.&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3816</id>
		<title>Template:ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Template:ARC_Cluster_Status&amp;diff=3816"/>
		<updated>2025-08-01T19:40:06Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational &lt;br /&gt;
| message = System is generally operational.&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3815</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3815"/>
		<updated>2025-08-01T19:39:35Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Maintenance Complete&lt;br /&gt;
| date = 2025/08/01&lt;br /&gt;
| message = Maintenance on /bulk was completed successfully and all filesystems are back in service on Arc.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=RCS_Home_Page&amp;diff=3814</id>
		<title>RCS Home Page</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=RCS_Home_Page&amp;diff=3814"/>
		<updated>2025-07-30T19:47:00Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: /* Software pages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Research Computing Services (RCS) is a group within the wider University of Calgary Information Technologies team that plans, manages, and supports high performance computing (HPC) systems in use by researchers throughout the University of Calgary.  Our primary focus is to meet the increasing demand for engineering and scientific computation by offering a wide range of specialized services to help researchers solve highly complex real-world problems or run large scale computationally intensive workloads on our high-end HPC resources.&lt;br /&gt;
&lt;br /&gt;
This RCS Wiki contains technical documentation for use by users of HPC systems operated by RCS&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- &lt;br /&gt;
In case cluster status changes:&lt;br /&gt;
    *  set the status to yellow or red &lt;br /&gt;
    *  provide a custom &#039;title&#039; and &#039;message&#039;&lt;br /&gt;
&lt;br /&gt;
{{Cluster Status&lt;br /&gt;
|status=green&lt;br /&gt;
}}&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
=== Contact us for support ===&lt;br /&gt;
&lt;br /&gt;
* For general RCS/HPC inquiries, please email: [mailto:support@hpc.ucalgary.ca support@hpc.ucalgary.ca]&lt;br /&gt;
* For IT related issues (networking, VPN, email), please email: [mailto:it@ucalgary.ca it@ucalgary.ca]&lt;br /&gt;
* For Compute Canada specific questions: [mailto:support@tech.alliancecan.ca support@tech.alliancecan.ca]&lt;br /&gt;
&lt;br /&gt;
{{Clear}}&lt;br /&gt;
&amp;lt;div class=&amp;quot;row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;col-md-6&amp;quot;&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
* [[General Cluster Guidelines and Policies]]&lt;br /&gt;
* [[How to get an account]]&lt;br /&gt;
* [[Data ownership]]&lt;br /&gt;
* [[Connecting to RCS HPC Systems]]&lt;br /&gt;
* [[External collaborators]]&lt;br /&gt;
&lt;br /&gt;
* [[CloudStack|Cloud/Virtual Machine Infrastructure (CloudStack)]]&lt;br /&gt;
&lt;br /&gt;
* [[On-line resources for new Linux and ARC users]]&lt;br /&gt;
* [[Acknowledging Research Computing Services Group]]&lt;br /&gt;
&lt;br /&gt;
== Cluster Guides ==&lt;br /&gt;
* [[ ARC Cluster Guide]] - ARC is a general purpose cluster for University of Calgary researchers.&lt;br /&gt;
*  [[GLaDOS Cluster Guide]] - GLaDOS is a researcher-owned cluster maintained by Research Computing Services.&lt;br /&gt;
*  [[TALC Cluster Guide]] - Teaching and Learning Cluster (TALC) is a cluster created by Research Computing Services to support academic courses and workshops.&lt;br /&gt;
* [[MARC Cluster Guide]] -- Medical Advanced Research Computing cluster at the University of Calgary created by Research Computing Services in 2020.&lt;br /&gt;
&lt;br /&gt;
== Other services ==&lt;br /&gt;
&lt;br /&gt;
* [[Jupyter Notebooks]]&lt;br /&gt;
* [[Open OnDemand | Open OnDemand portal]]&lt;br /&gt;
&lt;br /&gt;
== Software pages ==&lt;br /&gt;
* [[Managing software on ARC]]&lt;br /&gt;
* [[Gaussian on ARC]] -- How to use Gaussian 16 on ARC.&lt;br /&gt;
* [[Apache Spark on ARC]]&lt;br /&gt;
* [[ARC Software pages]]&lt;br /&gt;
* [[Bioinformatics applications]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;col-md-6&amp;quot;&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Running courses on HPC resources ==&lt;br /&gt;
* [[TALC Cluster|TALC]] - Teaching and Learning Cluster (TALC) is a cluster created by Research Computing Services to support academic courses and workshops.&lt;br /&gt;
* [[TALC Terms of Use]] - Terms of use to which TALC account holders must agree to use the cluster.&lt;br /&gt;
* [[List of courses on TALC]] - A list of current and historical courses taught using TALC.&lt;br /&gt;
&lt;br /&gt;
== Training ==&lt;br /&gt;
* Our [[HPC Systems]]&lt;br /&gt;
* [[HPC Linux topics]] - A list of topics on which RCS technical support staff can provide one-on-one or group training&lt;br /&gt;
* [[Courses]]&lt;br /&gt;
* [[Linux Introduction]]&lt;br /&gt;
* [[What is a scheduler?]]&lt;br /&gt;
* [[Running jobs]]&lt;br /&gt;
* [[Data storage options for UofC researchers]]&lt;br /&gt;
* [[Security and privacy]]&lt;br /&gt;
* [[How to transfer data]]&lt;br /&gt;
&lt;br /&gt;
* [[UofC Services]]&lt;br /&gt;
&lt;br /&gt;
* [[Book online training sessions]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [[How-Tos | More How-Tos]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
{{Clear}}&lt;br /&gt;
&lt;br /&gt;
__NOTOC__&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3810</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3810"/>
		<updated>2025-07-29T20:01:43Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = yellow&lt;br /&gt;
| title = Cluster operational - Problems with /bulk&lt;br /&gt;
| message = System is generally operational. Emergency outage planned for July 31 on /bulk&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs at 12 Noon on Thursday July 31.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3809</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3809"/>
		<updated>2025-07-29T19:13:06Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = yellow&lt;br /&gt;
| title = Cluster operational - Problems with /bulk&lt;br /&gt;
| message = System is generally operational. Emergency outage planned for July 31 on /bulk&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs on Thursday July 31 in the afternoon.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  We will update this message when we have a more accurate time of day.    Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3808</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3808"/>
		<updated>2025-07-29T19:07:20Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs on Thursday July 31 in the afternoon.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  We will update this message when we have a more accurate time of day.    Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3807</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3807"/>
		<updated>2025-07-29T19:06:50Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Bulk Filesystem Emergency Maintenance&lt;br /&gt;
| date = 2025/07/29&lt;br /&gt;
| message = The filer that provides the /bulk filesystem will be down for emergency repairs on Thursday July 31 in the afternoon.  No access to files on /bulk will be possible for the duration of the multi-hour outage.  We will update this message when we have a more accurate time of day.    Any jobs running that access /bulk will start and then pause when access to /bulk is attempted.  Jobs should continue once service is restored.  Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=RCS_Home_Page&amp;diff=3806</id>
		<title>RCS Home Page</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=RCS_Home_Page&amp;diff=3806"/>
		<updated>2025-07-29T18:36:23Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: /* Software pages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Research Computing Services (RCS) is a group within the wider University of Calgary Information Technologies team that plans, manages, and supports high performance computing (HPC) systems in use by researchers throughout the University of Calgary.  Our primary focus is to meet the increasing demand for engineering and scientific computation by offering a wide range of specialized services to help researchers solve highly complex real-world problems or run large scale computationally intensive workloads on our high-end HPC resources.&lt;br /&gt;
&lt;br /&gt;
This RCS Wiki contains technical documentation for use by users of HPC systems operated by RCS&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- &lt;br /&gt;
In case cluster status changes:&lt;br /&gt;
    *  set the status to yellow or red &lt;br /&gt;
    *  provide a custom &#039;title&#039; and &#039;message&#039;&lt;br /&gt;
&lt;br /&gt;
{{Cluster Status&lt;br /&gt;
|status=green&lt;br /&gt;
}}&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
=== Contact us for support ===&lt;br /&gt;
&lt;br /&gt;
* For general RCS/HPC inquiries, please email: [mailto:support@hpc.ucalgary.ca support@hpc.ucalgary.ca]&lt;br /&gt;
* For IT related issues (networking, VPN, email), please email: [mailto:it@ucalgary.ca it@ucalgary.ca]&lt;br /&gt;
* For Compute Canada specific questions: [mailto:support@tech.alliancecan.ca support@tech.alliancecan.ca]&lt;br /&gt;
&lt;br /&gt;
{{Clear}}&lt;br /&gt;
&amp;lt;div class=&amp;quot;row&amp;quot;&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;col-md-6&amp;quot;&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== General information ==&lt;br /&gt;
* [[General Cluster Guidelines and Policies]]&lt;br /&gt;
* [[How to get an account]]&lt;br /&gt;
* [[Data ownership]]&lt;br /&gt;
* [[Connecting to RCS HPC Systems]]&lt;br /&gt;
* [[External collaborators]]&lt;br /&gt;
&lt;br /&gt;
* [[CloudStack|Cloud/Virtual Machine Infrastructure (CloudStack)]]&lt;br /&gt;
&lt;br /&gt;
* [[On-line resources for new Linux and ARC users]]&lt;br /&gt;
* [[Acknowledging Research Computing Services Group]]&lt;br /&gt;
&lt;br /&gt;
== Cluster Guides ==&lt;br /&gt;
* [[ ARC Cluster Guide]] - ARC is a general purpose cluster for University of Calgary researchers.&lt;br /&gt;
*  [[GLaDOS Cluster Guide]] - GLaDOS is a researcher-owned cluster maintained by Research Computing Services.&lt;br /&gt;
*  [[TALC Cluster Guide]] - Teaching and Learning Cluster (TALC) is a cluster created by Research Computing Services to support academic courses and workshops.&lt;br /&gt;
* [[MARC Cluster Guide]] -- Medical Advanced Research Computing cluster at the University of Calgary created by Research Computing Services in 2020.&lt;br /&gt;
&lt;br /&gt;
== Other services ==&lt;br /&gt;
&lt;br /&gt;
* [[Jupyter Notebooks]]&lt;br /&gt;
* [[Open OnDemand | Open OnDemand portal]]&lt;br /&gt;
&lt;br /&gt;
== Software pages ==&lt;br /&gt;
* [[Managing software on ARC]]&lt;br /&gt;
* [[Gaussian on ARC]] -- How to use Gaussian 16 on ARC.&lt;br /&gt;
* [[Apache Spark on ARC]]&lt;br /&gt;
* [[ARC Software pages]]&lt;br /&gt;
* [[Bioinformatics applications]]&lt;br /&gt;
* [[Running_alphafold3]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div class=&amp;quot;col-md-6&amp;quot;&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Running courses on HPC resources ==&lt;br /&gt;
* [[TALC Cluster|TALC]] - Teaching and Learning Cluster (TALC) is a cluster created by Research Computing Services to support academic courses and workshops.&lt;br /&gt;
* [[TALC Terms of Use]] - Terms of use to which TALC account holders must agree to use the cluster.&lt;br /&gt;
* [[List of courses on TALC]] - A list of current and historical courses taught using TALC.&lt;br /&gt;
&lt;br /&gt;
== Training ==&lt;br /&gt;
* Our [[HPC Systems]]&lt;br /&gt;
* [[HPC Linux topics]] - A list of topics on which RCS technical support staff can provide one-on-one or group training&lt;br /&gt;
* [[Courses]]&lt;br /&gt;
* [[Linux Introduction]]&lt;br /&gt;
* [[What is a scheduler?]]&lt;br /&gt;
* [[Running jobs]]&lt;br /&gt;
* [[Data storage options for UofC researchers]]&lt;br /&gt;
* [[Security and privacy]]&lt;br /&gt;
* [[How to transfer data]]&lt;br /&gt;
&lt;br /&gt;
* [[UofC Services]]&lt;br /&gt;
&lt;br /&gt;
* [[Book online training sessions]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [[How-Tos | More How-Tos]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
{{Clear}}&lt;br /&gt;
&lt;br /&gt;
__NOTOC__&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3805</id>
		<title>Running alphafold3</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3805"/>
		<updated>2025-07-29T18:35:37Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Running Alphafold3 on Arc =&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has been compiled into an Apptainer image for use on Arc.&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory.  The second step runs the inference and requires the use of a GPU.  The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step.  To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
&lt;br /&gt;
* Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).&lt;br /&gt;
* The alphafold3 container is stored in /global/software/alphafold3.  There are example job scripts in that directory that are referenced below.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== The Pipeline stage ==&lt;br /&gt;
&lt;br /&gt;
* Download the model and put it in ./models&lt;br /&gt;
* Generate input data.  Make sure you&#039;ve split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we&#039;ll use filenames starting with xa and ending with .json so xa*.json&lt;br /&gt;
* Create ./pipelineoutputs for the resulting files&lt;br /&gt;
* Run the pipeline stage on cpus by submitting runpipeline.slurm&lt;br /&gt;
&lt;br /&gt;
== The Inference stage ==&lt;br /&gt;
&lt;br /&gt;
* copy the runinference.slurm file to a directory you have write access to.&lt;br /&gt;
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You&#039;ll have to copy them all.  You can use find to help:&lt;br /&gt;
 cd pipelineoutputs&lt;br /&gt;
 cp $(find -type f -name \*.json) ../inferenceinputs&lt;br /&gt;
* submit the runpipeline.slurm to the scheduler.&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3804</id>
		<title>Running alphafold3</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Running_alphafold3&amp;diff=3804"/>
		<updated>2025-07-29T17:22:14Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: Created page with &amp;quot;= Running Alphafold3 on Arc =  Alphafold3 has been compiled into an Apptainer image for use on Arc.  Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory.  The second step runs the inference and requires the use of a GPU.  The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step.  To run Alphafold3 efficiently we have to run it t...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Running Alphafold3 on Arc =&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has been compiled into an Apptainer image for use on Arc.&lt;br /&gt;
&lt;br /&gt;
Alphafold3 has two separate steps, one that requires cpus only, and reads much of the large public_databases directory.  The second step runs the inference and requires the use of a GPU.  The pipeline step takes much longer than the inference step. It is wasteful to occupy a GPU node for the duration of the pipeline step.  To run Alphafold3 efficiently we have to run it two separate times in different modes on compute nodes that have appropriate resources.&lt;br /&gt;
&lt;br /&gt;
== Prerequisites ==&lt;br /&gt;
&lt;br /&gt;
Due to licensing reasons, every client who would like to use alphafold3 must register for and download the model parameters.  See https://forms.gle/svvpY4u2jsHEwWYS6 to register.  The modelparameters are relatively small and can be stored anywhere on Arc that you have access to and can be used by multiple jobs at the same time (you do not have to have a separate copy of it for every job).&lt;br /&gt;
&lt;br /&gt;
== The Pipeline stage ==&lt;br /&gt;
&lt;br /&gt;
* Download the model and put it in ./models&lt;br /&gt;
* Generate input data.  Make sure you&#039;ve split your input sequences into 5 roughly equally sized files (by number of input sequences).  For the examples we&#039;ll use filenames starting with xa and ending with .json so xa*.json&lt;br /&gt;
* Create ./pipelineoutputs for the resulting files&lt;br /&gt;
* Run the pipeline stage on cpus by submitting runpipeline.slurm&lt;br /&gt;
&lt;br /&gt;
== The Inference stage ==&lt;br /&gt;
&lt;br /&gt;
* copy the runinference.slurm file to a directory you have write access to.&lt;br /&gt;
* Take the outputs from the pipeline stage (which were put in ./pipelineoutputs in the example, move/copy the files into a directory called something like inference_inputs.  Note that the pipeline stage makes a directory for each of the files but the inference stage expects them to all be directly in the inference_inputs.  You&#039;ll have to copy them all.  You can use find to help:&lt;br /&gt;
 cd pipelineoutputs&lt;br /&gt;
 cp $(find -type f -name \*.json) ../inferenceinputs&lt;br /&gt;
* submit the runpipeline.slurm to the scheduler.&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=3797</id>
		<title>MARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=MARC_Cluster_Status&amp;diff=3797"/>
		<updated>2025-06-17T19:38:54Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = MARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational&lt;br /&gt;
| message = See the [[MARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 23, 2023, the MARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The MARC login node will reboot on the morning of January 23. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 27.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on MARC Login Node&lt;br /&gt;
| date = 2023/06/23&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the MARC login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
}}besian.sejdiu&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = OS Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2024/09/11&lt;br /&gt;
| message =&lt;br /&gt;
MARC will be going down for OS upgrades on 2024/Sep/16. The cluster &lt;br /&gt;
will be unavailable temporarily to complete this work. Please contact&lt;br /&gt;
support@hpc.ucalgary.ca if you have any questions or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The MARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The MARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Scheduled File System Maintenance&lt;br /&gt;
| date = 2025/06/09&lt;br /&gt;
| message = Please be advised MARC will be going down for a period of approximately 2 hours starting at 10 AM June 17, 2025. Logins will not be available and no jobs will be running during this window. &lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca. Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = MARC Maintenance Complete&lt;br /&gt;
| date = 2025/06/17&lt;br /&gt;
| message = Filesystem maintenance complete.&lt;br /&gt;
&lt;br /&gt;
Please send any questions or concerns to support@hpc.ucalgary.ca.&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Think_Login_Node_Status&amp;diff=3783</id>
		<title>Think Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Think_Login_Node_Status&amp;diff=3783"/>
		<updated>2025-04-29T20:36:40Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Think login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Think GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Think cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3782</id>
		<title>Altis Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Altis_Login_Node_Status&amp;diff=3782"/>
		<updated>2025-04-29T20:36:04Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/27&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Altis experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Altis cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=3781</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=3781"/>
		<updated>2025-04-29T20:35:15Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/29&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=3780</id>
		<title>TALC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=TALC_Cluster_Status&amp;diff=3780"/>
		<updated>2025-04-29T20:09:20Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{TALC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ May System Updates&lt;br /&gt;
| date = 2023/02/02&lt;br /&gt;
| message =&lt;br /&gt;
Beginning May 1, 2023, the TALC cluster will undergo operating system updates. The upgrade will happen after the end of term to minimize any disruption. Any existing jobs may be &lt;br /&gt;
temporarily held from scheduling. The upgrade is planned to be fully complete by May 5.&lt;br /&gt;
&lt;br /&gt;
The TALC login node will reboot on the morning of May 1.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = May System Updates Completed&lt;br /&gt;
| date = 2023/05/04&lt;br /&gt;
| message =&lt;br /&gt;
TALC upgrades have been completed. If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = TALC Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/06/26&lt;br /&gt;
| message = &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Trend Micro Installation&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
[[Category:TALC]]&lt;br /&gt;
{{Navbox TALC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3779</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3779"/>
		<updated>2025-04-29T20:08:14Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
&lt;br /&gt;
Apr 29, 2025&lt;br /&gt;
To increase the security posture of the Arc cluster administrators will be installing Trend Micro on cluster login nodes over the week starting Apr 30.  Please report any inconsistencies to support@hpc.ucalgary.ca&lt;br /&gt;
}}&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3778</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3778"/>
		<updated>2025-04-29T03:22:38Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit Is Now Enforced&lt;br /&gt;
| date = 2025/04/28&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs are now limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.&lt;br /&gt;
}}&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3772</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3772"/>
		<updated>2025-04-11T16:57:44Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address down&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca Unavailable&lt;br /&gt;
&lt;br /&gt;
Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Support email address functional&lt;br /&gt;
| date = 2025/03/07&lt;br /&gt;
| message = support@hpc.ucalgary.ca is back&lt;br /&gt;
&lt;br /&gt;
support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email. &lt;br /&gt;
&lt;br /&gt;
Apologies for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Interactive Job Timelimit will be Enforced&lt;br /&gt;
| date = 2025/04/11&lt;br /&gt;
| message = In order to improve the scheduling and job throughput efficiency of ARC, interactive jobs will be limited to a maximum of 5 hours of runtime.  Interactive jobs that are submitted with a timelimit over 5 hours will be rejected at submission time.  This change will be made on Monday, April 28, 2025.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3728</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3728"/>
		<updated>2025-02-12T17:54:13Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgraded&lt;br /&gt;
| date = 2025/02/12&lt;br /&gt;
| message = The module command was upgraded&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 12, 2025 the module command was upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules should not have changed.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3718</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3718"/>
		<updated>2025-02-03T20:06:49Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Module Command Upgrade&lt;br /&gt;
| date = 2025/02/03&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3717</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3717"/>
		<updated>2025-02-03T20:06:15Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = Upgrade of the module command&lt;br /&gt;
&lt;br /&gt;
On Tuesday, February 11, 2025 the module command will be upgraded to a new verson on Arc. This should result in new capabilities and a slightly different visual experience when using the module command.  Loading modules is not expected to change.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3707</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3707"/>
		<updated>2025-01-23T16:08:41Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
Jan 23, 9:08AM&lt;br /&gt;
&lt;br /&gt;
Remount complete.  arc is back in full service.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3706</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3706"/>
		<updated>2025-01-23T01:45:58Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
&lt;br /&gt;
UPDATE Jan 22, 6:45PM&lt;br /&gt;
&lt;br /&gt;
To fix a problem it has become necessary to remount the filesystems on the login node.  This will require everyone to be logged off.   This will happen Thursday Jan 23, 2025 at 9AM.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3705</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3705"/>
		<updated>2025-01-23T01:45:38Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
UPDATE Jan 22, 6:45PM&lt;br /&gt;
To fix a problem it has become necessary to remount the filesystems on the login node.  This will require everyone to be logged off.   This will happen Thursday Jan 23, 2025 at 9AM.&lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3704</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3704"/>
		<updated>2025-01-22T20:26:30Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = green&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3703</id>
		<title>ARC Cluster Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=ARC_Cluster_Status&amp;diff=3703"/>
		<updated>2025-01-22T19:11:20Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Cluster Status&lt;br /&gt;
| cluster = ARC&lt;br /&gt;
| status = yellow&lt;br /&gt;
| title = Cluster operational - Power Bump Jan 18&lt;br /&gt;
| message = System is operational. Updates are planned for Jan 20. Please see MOTD&lt;br /&gt;
&lt;br /&gt;
See the [[ARC Cluster Status]] page for system notices. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = January System Updates&lt;br /&gt;
| date = 2023/01/01&lt;br /&gt;
| message =&lt;br /&gt;
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.&lt;br /&gt;
&lt;br /&gt;
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.&lt;br /&gt;
&lt;br /&gt;
The upgrade is planned to be fully complete by January 20.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = System Updates Completed&lt;br /&gt;
| date = 2023/01/24&lt;br /&gt;
| message =&lt;br /&gt;
The upgrade has been completed. The following has been changed:&lt;br /&gt;
* OS Updated to Rocky Linux 8.7&lt;br /&gt;
* Slurm updated to 22.05.7&lt;br /&gt;
* Apptainer replaces Singularity&lt;br /&gt;
* Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/02/28&lt;br /&gt;
| message =&lt;br /&gt;
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues&lt;br /&gt;
| date = 2023/03/1&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
Thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ARC Login node reboot&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.&lt;br /&gt;
&lt;br /&gt;
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Filesystem Issues&lt;br /&gt;
| date = 2023/03/2&lt;br /&gt;
| message =&lt;br /&gt;
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.&lt;br /&gt;
&lt;br /&gt;
We will update you with more information as it becomes available.&lt;br /&gt;
&lt;br /&gt;
We apologize for the inconvenience and thank you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Filesystem Issues Resolved&lt;br /&gt;
| date = 2023/03/10&lt;br /&gt;
| message =&lt;br /&gt;
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.&lt;br /&gt;
&lt;br /&gt;
Please let us know if you experience any issues with the filesystem performance.&lt;br /&gt;
&lt;br /&gt;
Thank-you for your patience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/05/01&lt;br /&gt;
| message =&lt;br /&gt;
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.&lt;br /&gt;
&lt;br /&gt;
If you encounter any system issues, do not hesitate to let us know.&lt;br /&gt;
&lt;br /&gt;
Thank you for your cooperation.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Apptainer (Singularity) on ARC Login Node&lt;br /&gt;
| date = 2023/06/22&lt;br /&gt;
| message =&lt;br /&gt;
Apptainer (Singularity) containers may experience an error when&lt;br /&gt;
running on the Arc login node. If apptainer complains that a system&lt;br /&gt;
administrator needs to enable user namespaces, simply run your&lt;br /&gt;
containers inside a job.&lt;br /&gt;
&lt;br /&gt;
This is a temporary measure due to security vulnerability that will be&lt;br /&gt;
patched soon.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Lattice, Single, cpu2013 partition changes&lt;br /&gt;
| date = 2023/07/13&lt;br /&gt;
| message =&lt;br /&gt;
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single&lt;br /&gt;
partition will be replaced by the nodes formerly in the cpu2013 partition but&lt;br /&gt;
will be called single.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Open OnDemand reboot&lt;br /&gt;
| date = 2023/10/17&lt;br /&gt;
| message =&lt;br /&gt;
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.&lt;br /&gt;
&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Storage Upgrade MARC/ARC cluster&lt;br /&gt;
| date = 2023/10/23&lt;br /&gt;
| message =&lt;br /&gt;
We will be performing storage upgrades on the MARC/ARC cluster on &lt;br /&gt;
November 16 and 17, 2023. To facilitate this, we will be throttling &lt;br /&gt;
down the number of jobs on both clusters while the upgrades are &lt;br /&gt;
performed&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/05/3&lt;br /&gt;
| message =&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Power Interruption&lt;br /&gt;
| date = 2024/05/07&lt;br /&gt;
| message = Arc Experienced an brief power outage around 11AM May 7, 2024.&lt;br /&gt;
Most compute nodes have or are rebooting.  Most jobs running at this time &lt;br /&gt;
were lost. Arc administrators are actively working on restarting compute &lt;br /&gt;
nodes. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation&lt;br /&gt;
| date = 2024/06/03&lt;br /&gt;
| message = Job submissions targeted to the  GPU a100 partition will be &lt;br /&gt;
affected by a temporary reservation on the nodes to accommodate the RCS&lt;br /&gt;
summer school class taking place on 2024/Jun/10. Reservation will end &lt;br /&gt;
shortly after. Please submit your jobs normally and the scheduler will &lt;br /&gt;
start them as soon as the nodes are available. Sorry for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = GPU a100 Node Reservation Removed&lt;br /&gt;
| date = 2024/06/11&lt;br /&gt;
| message = GPU a100 Nodes in ARC have been returned to normal scheduling. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes &lt;br /&gt;
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be &lt;br /&gt;
affected. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance &lt;br /&gt;
| date = 2024/12/11&lt;br /&gt;
| message = The ARC login node will be rebooted on Tuesday December 17 for scheduled maintenance. It will be down for a few minutes and return shortly. Job scheduling and jobs running on the cluster will not be affected. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal was upgraded.&lt;br /&gt;
&lt;br /&gt;
6. The Parallel partition was renamed to Legacy to show the lack of an interconnect for parallel MPI work and was restricted to maximum 4 node jobs.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Navbox ARC}}&lt;br /&gt;
[[Category:ARC]]&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
	<entry>
		<id>https://rcs.ucalgary.ca/index.php?title=Think_Login_Node_Status&amp;diff=3702</id>
		<title>Think Login Node Status</title>
		<link rel="alternate" type="text/html" href="https://rcs.ucalgary.ca/index.php?title=Think_Login_Node_Status&amp;diff=3702"/>
		<updated>2025-01-22T19:10:55Z</updated>

		<summary type="html">&lt;p&gt;Dschulz: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{ARC Cluster Status}}&lt;br /&gt;
&lt;br /&gt;
== System Messages ==&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Systems Operating Normally&lt;br /&gt;
| date = 2024/09/03&lt;br /&gt;
| message =&lt;br /&gt;
The ARC Cluster and the Think login node is operational. No upcoming upgrades are planned.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Notice of Upcoming Partial Outage&lt;br /&gt;
| date = 2024/08/23&lt;br /&gt;
| message = Several compute nodes from the ARC cluster will be unavailable &lt;br /&gt;
between Sept 23 to Sept 27 inclusive (subject to change). Some Think GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.  &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update I&lt;br /&gt;
| date = 2024/09/25&lt;br /&gt;
| message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.&lt;br /&gt;
&lt;br /&gt;
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.&lt;br /&gt;
&lt;br /&gt;
We apologise for the inconvenience.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update II&lt;br /&gt;
| date = 2024/10/04&lt;br /&gt;
| message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Partial Outage Update III&lt;br /&gt;
| date = 2024/10/07&lt;br /&gt;
| message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.&lt;br /&gt;
&lt;br /&gt;
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF GPU nodes include: wdfgpu[1-2,6,8-12].&lt;br /&gt;
&lt;br /&gt;
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Normal Scheduling has resumed. &lt;br /&gt;
| date = 2024/10/08&lt;br /&gt;
| message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = wdfgpu[1-12] System Update Reboots &lt;br /&gt;
| date = 2024/12/02&lt;br /&gt;
| message = wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/07&lt;br /&gt;
| message = The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. &lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.ucalgary.ca with any issues or concerns. &lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = ⚠️ Scheduled Maintenance and OS Update&lt;br /&gt;
| date = 2025/01/15&lt;br /&gt;
| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For the duration of the upgrade window:&lt;br /&gt;
* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.&lt;br /&gt;
* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.&lt;br /&gt;
&lt;br /&gt;
Please make sure to save your work prior to this outage window to avoid any loss of work.&lt;br /&gt;
&lt;br /&gt;
During this time the following changes will happen:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating system will be updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system will be upgraded.&lt;br /&gt;
&lt;br /&gt;
5. The Open OnDemand web portal will be upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
&lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
Update Jan 18, 2025&lt;br /&gt;
&lt;br /&gt;
Around 10AM Arc experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss &lt;br /&gt;
of a number of running jobs.  &lt;br /&gt;
&lt;br /&gt;
Sorry for the inconvenience.  &lt;br /&gt;
&lt;br /&gt;
Since Arc is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.  &lt;br /&gt;
⚠️⚠️⚠️⚠️⚠️&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{Message of the day item&lt;br /&gt;
| title = Maintenance Complete&lt;br /&gt;
| date = 2025/01/22&lt;br /&gt;
| message = The ARC/Think cluster upgrade is complete&lt;br /&gt;
&lt;br /&gt;
During this time the following changes happened:&lt;br /&gt;
&lt;br /&gt;
1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:&lt;br /&gt;
* cpu2023 (temporary)&lt;br /&gt;
* Parallel&lt;br /&gt;
* Theia/Synergy/cpu2017-bf05&lt;br /&gt;
* Single&lt;br /&gt;
Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.&lt;br /&gt;
&lt;br /&gt;
2. A component of the NetApp filer was replaced successfully.&lt;br /&gt;
&lt;br /&gt;
3. The compute node operating was updated to Rocky Linux 8.10.&lt;br /&gt;
&lt;br /&gt;
4. The Slurm scheduling system was upgraded.&lt;br /&gt;
&lt;br /&gt;
Please reach out to support@hpc.calgary.ca with any issues or concerns.&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:ARC]]&lt;br /&gt;
{{Navbox ARC}}&lt;/div&gt;</summary>
		<author><name>Dschulz</name></author>
	</entry>
</feed>