|
|
(11 intermediate revisions by 2 users not shown) |
Line 153: |
Line 153: |
|
| |
|
| {{Message of the day item | | {{Message of the day item |
| | title = Unplanned Power Interruption | | | title = Power Interruption |
| | date = 2024/05/7 | | | date = 2024/05/07 |
| | message = Data center power was briefly affected by a city power event | | | message = Arc Experienced an brief power outage around 11AM May 7, 2024. |
| at roughly 13:00 7/May/2024. Power is back to normal and we are returning
| | Most compute nodes have or are rebooting. Most jobs running at this time |
| nodes to normal scheduling. Running jobs may have been affected, please | | were lost. Arc administrators are actively working on restarting compute |
| resubmit .
| | nodes. Sorry for the inconvenience. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = GPU a100 Node Reservation |
| | | date = 2024/06/03 |
| | | message = Job submissions targeted to the GPU a100 partition will be |
| | affected by a temporary reservation on the nodes to accommodate the RCS |
| | summer school class taking place on 2024/Jun/10. Reservation will end |
| | shortly after. Please submit your jobs normally and the scheduler will |
| | start them as soon as the nodes are available. Sorry for the inconvenience. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = GPU a100 Node Reservation Removed |
| | | date = 2024/06/11 |
| | | message = GPU a100 Nodes in ARC have been returned to normal scheduling. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = Notice of Upcoming Partial Outage |
| | | date = 2024/08/23 |
| | | message = Several compute nodes from the ARC cluster will be unavailable |
| | between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes |
| | in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be |
| | affected. These nodes will return to service as soon as the work is complete. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = Partial Outage Update I |
| | | date = 2024/09/25 |
| | | message = Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes. |
| | |
| | On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024. |
| | |
| | We apologise for the inconvenience. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = Partial Outage Update II |
| | | date = 2024/10/04 |
| | | message = The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre. |
| | |
| | Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12]. |
| | |
| | We apologize for the extended downtime and will update you as soon as we have additional information from our operations team. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = Partial Outage Update III |
| | | date = 2024/10/07 |
| | | message = Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024. |
| | |
| | Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12]. |
| | |
| | We apologize for the extended downtime and will update you as soon as we have additional information from our operations team. |
| | }} |
| | |
| | {{Message of the day item |
| | | title = Normal Scheduling has resumed. |
| | | date = 2024/10/08 |
| | | message = The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime. |
| | |
| | Please reach out to support@hpc.ucalgary.ca with any issues or concerns. |
| }} | | }} |
|
| |
|
| {{Navbox ARC}} | | {{Navbox ARC}} |
| [[Category:ARC]] | | [[Category:ARC]] |
|
ARC status:
Cluster operational
System is operational. No updates are planned.
See the ARC Cluster Status page for system notices.
|
System Messages
January System Updates - 2023/01/01
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.
The upgrade is planned to be fully complete by January 20.
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
************************************************************************
2023/01/01
--- January System Updates
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.
The upgrade is planned to be fully complete by January 20.
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
System Updates Completed - 2023/01/24
The upgrade has been completed. The following has been changed:
- OS Updated to Rocky Linux 8.7
- Slurm updated to 22.05.7
- Apptainer replaces Singularity
- Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
************************************************************************
2023/01/24
--- System Updates Completed
The upgrade has been completed. The following has been changed:
- OS Updated to Rocky Linux 8.7
- Slurm updated to 22.05.7
- Apptainer replaces Singularity
- Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
Filesystem Issues - 2023/02/28
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.
We will update you with more information as it becomes available.
Thank you for your patience.
************************************************************************
2023/02/28
--- Filesystem Issues
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.
We will update you with more information as it becomes available.
Thank you for your patience.
Filesystem Issues - 2023/03/1
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.
We will update you with more information as it becomes available.
Thank you for your patience.
************************************************************************
2023/03/1
--- Filesystem Issues
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.
We will update you with more information as it becomes available.
Thank you for your patience.
ARC Login node reboot - 2023/03/2
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node. Jobs will continue running and scheduling during this time.
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.
We apologize for the inconvenience and thank you for your patience.
************************************************************************
2023/03/2
--- ARC Login node reboot
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node. Jobs will continue running and scheduling during this time.
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.
We apologize for the inconvenience and thank you for your patience.
⚠️ Filesystem Issues - 2023/03/2
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.
We will update you with more information as it becomes available.
We apologize for the inconvenience and thank you for your patience.
************************************************************************
2023/03/2
--- ⚠️ Filesystem Issues
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.
We will update you with more information as it becomes available.
We apologize for the inconvenience and thank you for your patience.
Filesystem Issues Resolved - 2023/03/10
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.
Please let us know if you experience any issues with the filesystem performance.
Thank-you for your patience.
************************************************************************
2023/03/10
--- Filesystem Issues Resolved
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.
Please let us know if you experience any issues with the filesystem performance.
Thank-you for your patience.
Open OnDemand reboot - 2023/05/01
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
************************************************************************
2023/05/01
--- Open OnDemand reboot
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
Apptainer (Singularity) on ARC Login Node - 2023/06/22
Apptainer (Singularity) containers may experience an error when
running on the Arc login node. If apptainer complains that a system
administrator needs to enable user namespaces, simply run your
containers inside a job.
This is a temporary measure due to security vulnerability that will be
patched soon.
************************************************************************
2023/06/22
--- Apptainer (Singularity) on ARC Login Node
Apptainer (Singularity) containers may experience an error when
running on the Arc login node. If apptainer complains that a system
administrator needs to enable user namespaces, simply run your
containers inside a job.
This is a temporary measure due to security vulnerability that will be
patched soon.
Lattice, Single, cpu2013 partition changes - 2023/07/13
The Lattice and Single, and cpu2013 have all been decomissioned. The Single
partition will be replaced by the nodes formerly in the cpu2013 partition but
will be called single.
************************************************************************
2023/07/13
--- Lattice, Single, cpu2013 partition changes
The Lattice and Single, and cpu2013 have all been decomissioned. The Single
partition will be replaced by the nodes formerly in the cpu2013 partition but
will be called single.
Open OnDemand reboot - 2023/10/17
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.
************************************************************************
2023/10/17
--- Open OnDemand reboot
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.
Storage Upgrade MARC/ARC cluster - 2023/10/23
We will be performing storage upgrades on the MARC/ARC cluster on
November 16 and 17, 2023. To facilitate this, we will be throttling
down the number of jobs on both clusters while the upgrades are
performed
************************************************************************
2023/10/23
--- Storage Upgrade MARC/ARC cluster
We will be performing storage upgrades on the MARC/ARC cluster on
November 16 and 17, 2023. To facilitate this, we will be throttling
down the number of jobs on both clusters while the upgrades are
performed
Systems Operating Normally - 2024/05/3
************************************************************************
2024/05/3
--- Systems Operating Normally
Power Interruption - 2024/05/07
Arc Experienced an brief power outage around 11AM May 7, 2024.
Most compute nodes have or are rebooting. Most jobs running at this time
were lost. Arc administrators are actively working on restarting compute
nodes. Sorry for the inconvenience.
************************************************************************
2024/05/07
--- Power Interruption
Arc Experienced an brief power outage around 11AM May 7, 2024.
Most compute nodes have or are rebooting. Most jobs running at this time
were lost. Arc administrators are actively working on restarting compute
nodes. Sorry for the inconvenience.
GPU a100 Node Reservation - 2024/06/03
Job submissions targeted to the GPU a100 partition will be
affected by a temporary reservation on the nodes to accommodate the RCS
summer school class taking place on 2024/Jun/10. Reservation will end
shortly after. Please submit your jobs normally and the scheduler will
start them as soon as the nodes are available. Sorry for the inconvenience.
************************************************************************
2024/06/03
--- GPU a100 Node Reservation
Job submissions targeted to the GPU a100 partition will be
affected by a temporary reservation on the nodes to accommodate the RCS
summer school class taking place on 2024/Jun/10. Reservation will end
shortly after. Please submit your jobs normally and the scheduler will
start them as soon as the nodes are available. Sorry for the inconvenience.
GPU a100 Node Reservation Removed - 2024/06/11
GPU a100 Nodes in ARC have been returned to normal scheduling.
************************************************************************
2024/06/11
--- GPU a100 Node Reservation Removed
GPU a100 Nodes in ARC have been returned to normal scheduling.
Notice of Upcoming Partial Outage - 2024/08/23
Several compute nodes from the ARC cluster will be unavailable
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be
affected. These nodes will return to service as soon as the work is complete.
************************************************************************
2024/08/23
--- Notice of Upcoming Partial Outage
Several compute nodes from the ARC cluster will be unavailable
between Sept 23 to Sept 27 inclusive (subject to change). All compute nodes
in cpu2019, cpu2021/2, gpu-v100 most nodes from bigmem and gpu-a100 will be
affected. These nodes will return to service as soon as the work is complete.
Partial Outage Update I - 2024/09/25
Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.
We apologise for the inconvenience.
************************************************************************
2024/09/25
--- Partial Outage Update I
Due to hardware issues that is blocking our original maintenance window, most compute nodes that were taken offline on Monday has been brought back online today. An additional partial outage will occur again starting next Tuesday for the same nodes.
On Tuesday, October 1, 2024, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until Friday October 4, 2024.
We apologise for the inconvenience.
Partial Outage Update II - 2024/10/04
The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.
************************************************************************
2024/10/04
--- Partial Outage Update II
The maintenance window will be extended until at least Monday, October 7, 2024 due to a power distribution issue in our renovated data centre.
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Monday, October 7, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.
Partial Outage Update III - 2024/10/07
Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.
************************************************************************
2024/10/07
--- Partial Outage Update III
Due to technical issues beyond our control the maintenance window will be extended until at least Tuesday, October 15, 2024.
Currently, the compute nodes in cpu2019, cpu2021, cpu2022, gpu-v100, gpu-a100, and most nodes from bigmem will be unavailable until at least Tuesday, October 15, 2024. Affected WDF-Altis GPU nodes include: wdfgpu[1-2,6,8-12].
We apologize for the extended downtime and will update you as soon as we have additional information from our operations team.
Normal Scheduling has resumed. - 2024/10/08
The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime.
Please reach out to support@hpc.ucalgary.ca with any issues or concerns.
************************************************************************
2024/10/08
--- Normal Scheduling has resumed.
The ARC cluster has been successfully brought online and nodes are running jobs normally. We apologize for the extended downtime.
Please reach out to support@hpc.ucalgary.ca with any issues or concerns.