Altis Login Node Status: Difference between revisions

← Older edit

VisualWikitext

Latest revision as of 20:50, 10 March 2025

ARC status: Cluster operational

System is operational. No updates are planned.

See the ARC Cluster Status page for system notices.

System Messages

Systems Operating Normally - 2024/09/03

The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.

************************************************************************
2024/09/03
--- Systems Operating Normally

The ARC Cluster and the Altis login node is operational. No upcoming upgrades are planned.

Notice of Upcoming Partial Outage - 2024/08/27

Several compute nodes from the ARC cluster will be unavailable between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.

************************************************************************
2024/08/27
--- Notice of Upcoming Partial Outage

Several compute nodes from the ARC cluster will be unavailable

between Sept 23 to Sept 27 inclusive (subject to change). Some Altis GPU nodes will be affected during this maintenance window. These nodes will return to service as soon as the work is complete.

Partial Outage Update I - 2024/09/25

Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.

On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologise for the inconvenience.

************************************************************************
2024/09/25
--- Partial Outage Update I

Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.

On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologise for the inconvenience.

Partial Outage Update II - 2024/10/04

The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.

Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.

************************************************************************
2024/10/04
--- Partial Outage Update II

The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.

Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.

Partial Outage Update III - 2024/10/07

Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.

Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.

************************************************************************
2024/10/07
--- Partial Outage Update III

Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.

Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].

We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.

Normal Scheduling has resumed. - 2024/10/08

The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

************************************************************************
2024/10/08
--- Normal Scheduling has resumed.

The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime.

Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

wdfgpu[1-12] System Update Reboots - 2024/12/02

wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly. Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

************************************************************************
2024/12/02
--- wdfgpu[1-12] System Update Reboots

wdfgpu[1-12] will be updated today for a short reboot to install important system updates and will return shortly.

Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

Scheduled Maintenance and OS Update - 2025/01/07

The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding. Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

************************************************************************
2025/01/07
--- Scheduled Maintenance and OS Update

The ARC cluster will be rebooted for OS updates on Monday January 20, 2025. Please make sure to save your work and log out before the reboot happens. Scheduling will be paused until the cluster is back, but queued jobs will remain in the queue and nodes will start scheduling when the cluster is ready. Thank you for understanding.

Please reach out to support@hpc.ucalgary.ca with any issues or concerns.

⚠️ Scheduled Maintenance and OS Update - 2025/01/15

The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025.

For the duration of the upgrade window:

Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.
Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.

Please make sure to save your work prior to this outage window to avoid any loss of work.

During this time the following changes will happen:

1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:

cpu2023 (temporary)
Parallel
Theia/Synergy/cpu2017-bf05
Single

Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.

2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.

3. The compute node operating system will be updated to Rocky Linux 8.10.

4. The Slurm scheduling system will be upgraded.

5. The Open OnDemand web portal will be upgraded.

Please reach out to support@hpc.calgary.ca with any issues or concerns.

⚠️⚠️⚠️⚠️⚠️ Update Jan 18, 2025

Around 10AM Altis experienced an electrical power brownout. Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.

Sorry for the inconvenience.

Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.

⚠️⚠️⚠️⚠️⚠️

************************************************************************
2025/01/15
--- ⚠️ Scheduled Maintenance and OS Update

The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025.

For the duration of the upgrade window:

Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.
Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.

Please make sure to save your work prior to this outage window to avoid any loss of work.

During this time the following changes will happen:

1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:

cpu2023 (temporary)
Parallel
Theia/Synergy/cpu2017-bf05
Single

Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.

2. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.

3. The compute node operating system will be updated to Rocky Linux 8.10.

4. The Slurm scheduling system will be upgraded.

5. The Open OnDemand web portal will be upgraded.

Please reach out to support@hpc.calgary.ca with any issues or concerns.

⚠️⚠️⚠️⚠️⚠️ Update Jan 18, 2025

Around 10AM Altis experienced an electrical power brownout. Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss of a number of running jobs.

Sorry for the inconvenience.

Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday. ⚠️⚠️⚠️⚠️⚠️

Maintenance Complete - 2025/01/22

The ARC/Altis cluster upgrade is complete

During this time the following changes happened:

1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:

cpu2023 (temporary)
Parallel
Theia/Synergy/cpu2017-bf05
Single

Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.

2. A component of the NetApp filer was replaced successfully.

3. The compute node operating was updated to Rocky Linux 8.10.

4. The Slurm scheduling system was upgraded.

Please reach out to support@hpc.calgary.ca with any issues or concerns.

************************************************************************
2025/01/22
--- Maintenance Complete

The ARC/Altis cluster upgrade is complete

During this time the following changes happened:

1. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:

cpu2023 (temporary)
Parallel
Theia/Synergy/cpu2017-bf05
Single

Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.

2. A component of the NetApp filer was replaced successfully.

3. The compute node operating was updated to Rocky Linux 8.10.

4. The Slurm scheduling system was upgraded.

Please reach out to support@hpc.calgary.ca with any issues or concerns.

Support email address down - 2025/03/07

support@hpc.ucalgary.ca Unavailable

Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back.

Apologies for the inconvenience.

************************************************************************
2025/03/07
--- Support email address down

support@hpc.ucalgary.ca Unavailable

Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back.

Apologies for the inconvenience.

Support email address functional - 2025/03/07

support@hpc.ucalgary.ca is back

support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email.

Apologies for the inconvenience.

************************************************************************
2025/03/07
--- Support email address functional

support@hpc.ucalgary.ca is back

support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email.

Apologies for the inconvenience.

Altis Login Node Status: Difference between revisions

Latest revision as of 20:50, 10 March 2025

Contents

System Messages

Systems Operating Normally - 2024/09/03

Notice of Upcoming Partial Outage - 2024/08/27

Partial Outage Update I - 2024/09/25

Partial Outage Update II - 2024/10/04

Partial Outage Update III - 2024/10/07

Normal Scheduling has resumed. - 2024/10/08

wdfgpu[1-12] System Update Reboots - 2024/12/02

Scheduled Maintenance and OS Update - 2025/01/07

⚠️ Scheduled Maintenance and OS Update - 2025/01/15

Maintenance Complete - 2025/01/22

Support email address down - 2025/03/07

Support email address functional - 2025/03/07

Navigation menu

@@ Line 68: / Line 68: @@
 Please reach out to support@hpc.ucalgary.ca with any issues or concerns.
+}}
+{{Message of the day item
+| title = ⚠️ Scheduled Maintenance and OS Update
+| date = 2025/01/15
+| message = The ARC cluster will be down for maintenance and upgrades starting 9AM Monday, January 20, 2025 through Wednesday, January 22, 2025.
+For the duration of the upgrade window:
+* Scheduling will be paused and new jobs will be queued. Any queued jobs will start scheduling only after the upgrade is complete.
+* Access to files via the login node and arc-dtn will generally be available but intermittent. File transfers on the DTN node, including Globus file transfers, may be interrupted during this window.
+Please make sure to save your work prior to this outage window to avoid any loss of work.
+During this time the following changes will happen:
+. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:
+* cpu2023 (temporary)
+* Parallel
+* Theia/Synergy/cpu2017-bf05
+* Single
+Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.
+. A component of the NetApp filer will be replaced. Access to /bulk will be unavailable on Wednesday, January 22, 2025.
+. The compute node operating system will be updated to Rocky Linux 8.10.
+. The Slurm scheduling system will be upgraded.
+. The Open OnDemand web portal will be upgraded.
+Please reach out to support@hpc.calgary.ca with any issues or concerns.
+⚠️⚠️⚠️⚠️⚠️
+Update Jan 18, 2025
+Around 10AM Altis experienced an electrical power brownout.  Some percentage (how many is unknown at this time) of the nodes lost electrical power during this time causing a loss
+of a number of running jobs.
+Sorry for the inconvenience.
+Since Altis is shutting down for maintenance on Monday Jan 20, replacement jobs will likely not start unless they request a timelimit less than the time until 8AM Monday.
+⚠️⚠️⚠️⚠️⚠️
+}}
+{{Message of the day item
+| title = Maintenance Complete
+| date = 2025/01/22
+| message = The ARC/Altis cluster upgrade is complete
+During this time the following changes happened:
+. Ethernet will replace the 11 year old, unsupported Infiniband on the following partitions:
+* cpu2023 (temporary)
+* Parallel
+* Theia/Synergy/cpu2017-bf05
+* Single
+Any multi-node jobs (MPI) running on these partitions will have increased latency going forward. If you run multi-node jobs, make sure to run on a partition such as cpu2019, cpu2021, cpu2022.
+. A component of the NetApp filer was replaced successfully.
+. The compute node operating was updated to Rocky Linux 8.10.
+. The Slurm scheduling system was upgraded.
+Please reach out to support@hpc.calgary.ca with any issues or concerns.
+}}
+{{Message of the day item
+| title = Support email address down
+| date = 2025/03/07
+| message = support@hpc.ucalgary.ca Unavailable
+Please be informed that our support email address (support@hpc.ucalgary.ca) for RCS is currently not working. We are working to bring it back as soon as possible. Please keep an eye on this space for updates. The clusters are working normally, but support will not receive your messages at this time. We will begin responding as soon as we can get it back.
+Apologies for the inconvenience.
+}}
+{{Message of the day item
+| title = Support email address functional
+| date = 2025/03/07
+| message = support@hpc.ucalgary.ca is back
+support@hpc.ucalgary.ca has been repaired and RCS can be contacted there. If you had reached out for assistance in recent days without response please follow up as we may not have received your initial email.
+Apologies for the inconvenience.
 }}
 [[Category:ARC]]
 {{Navbox ARC}}

Altis Login Node Status: Difference between revisions

Latest revision as of 20:50, 10 March 2025

System Messages

Systems Operating Normally - 2024/09/03

Notice of Upcoming Partial Outage - 2024/08/27

Partial Outage Update I - 2024/09/25

Partial Outage Update II - 2024/10/04

Partial Outage Update III - 2024/10/07

Normal Scheduling has resumed. - 2024/10/08

wdfgpu[1-12] System Update Reboots - 2024/12/02

Scheduled Maintenance and OS Update - 2025/01/07

⚠️ Scheduled Maintenance and OS Update - 2025/01/15

Maintenance Complete - 2025/01/22

Support email address down - 2025/03/07

Support email address functional - 2025/03/07

Navigation menu

Search