ARC Cluster Status: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
(Initial content)
 
No edit summary
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Cluster Status
{{ARC Cluster Status}}
| cluster = ARC
 
| status = green
== System Messages ==
| title = All systems operational
{{Message of the day item
| message = No upgrades planned. Please contact us if you experience system issues.
| title = January System Updates
| date = 2023/01/01
| message =
Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.
 
The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.
 
The upgrade is planned to be fully complete by January 20.
 
If you encounter any system issues, do not hesitate to let us know.
 
Thank you for your cooperation.
}}
}}


== System Messages ==
{{Message of the day item
{{Message of the day item
| title = System Updates on January 2023
| title = System Updates Completed
| date = 2023/01/24
| date = 2023/01/24
| message =
| message =
Line 21: Line 31:
Thank you for your cooperation.
Thank you for your cooperation.
}}
}}
{{Message of the day item
| title = Filesystem Issues
| date = 2023/02/28
| message =
We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.
We will update you with more information as it becomes available.
Thank you for your patience.
}}
{{Message of the day item
| title = Filesystem Issues
| date = 2023/03/1
| message =
We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.
We will update you with more information as it becomes available.
Thank you for your patience.
}}
{{Message of the day item
| title = ARC Login node reboot
| date = 2023/03/2
| message =
The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node.  Jobs will continue running and scheduling during this time.
All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.
We apologize for the inconvenience and thank you for your patience.
}}
{{Message of the day item
| title = ⚠️ Filesystem Issues
| date = 2023/03/2
| message =
We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.
We will update you with more information as it becomes available.
We apologize for the inconvenience and thank you for your patience.
}}
{{Message of the day item
| title = Filesystem Issues Resolved
| date = 2023/03/10
| message =
We have upgraded the filesystem routers in our MSRDC location to address the performance issues.
Please let us know if you experience any issues with the filesystem performance.
Thank-you for your patience.
}}
{{Message of the day item
| title = Open OnDemand reboot
| date = 2023/05/01
| message =
On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.
If you encounter any system issues, do not hesitate to let us know.
Thank you for your cooperation.
}}
{{Message of the day item
| title = Apptainer (Singularity) on ARC Login Node
| date = 2023/06/22
| message =
Apptainer (Singularity) containers may experience an error when
running on the Arc login node. If apptainer complains that a system
administrator needs to enable user namespaces, simply run your
containers inside a job.
This is a temporary measure due to security vulnerability that will be
patched soon.
}}
{{Message of the day item
| title = Lattice, Single, cpu2013 partition changes
| date = 2023/07/13
| message =
The Lattice and Single, and cpu2013 have all been decomissioned.  The Single
partition will be replaced by the nodes formerly in the cpu2013 partition but
will be called single.
}}
{{Message of the day item
| title = Open OnDemand reboot
| date = 2023/10/17
| message =
Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.
}}
{{Message of the day item
| title = Storage Upgrade MARC/ARC cluster
| date = 2023/10/23
| message =
We will be performing storage upgrades on the MARC/ARC cluster on
November 16 and 17, 2023. To facilitate this, we will be throttling
down the number of jobs on both clusters while the upgrades are
performed
}}
{{Navbox ARC}}
[[Category:ARC]]

Revision as of 18:52, 23 October 2023

ARC status: Cluster operational


Open OnDemand will be rebooted Oct 17, 2023.

See the ARC Cluster Status page for system notices.

System Messages

January System Updates - 2023/01/01

Beginning January 16, 2023, the ARC cluster will undergo operating system updates. We shall do our utmost to minimize disruption and allow ongoing jobs to be completed. New jobs may be temporarily held from scheduling.

The ARC login node will reboot on the morning of January 16. Please save your work and log out if possible.

The upgrade is planned to be fully complete by January 20.

If you encounter any system issues, do not hesitate to let us know.

Thank you for your cooperation.

System Updates Completed - 2023/01/24

The upgrade has been completed. The following has been changed:
  • OS Updated to Rocky Linux 8.7
  • Slurm updated to 22.05.7
  • Apptainer replaces Singularity
  • Each job will have its own /tmp, /dev/shm, /run/user/$uid mounted

If you encounter any system issues, do not hesitate to let us know.

Thank you for your cooperation.

Filesystem Issues - 2023/02/28

We are currently investigating a filesystem issue that is causing filesystem slowdowns across ARC.

We will update you with more information as it becomes available.

Thank you for your patience.


Filesystem Issues - 2023/03/1

We are still currently investigating a filesystem issue that is causing filesystem slowdowns across ARC. Some jobs on ARC have been paused to help us find the root cause of the slowdowns.

We will update you with more information as it becomes available.

Thank you for your patience.


ARC Login node reboot - 2023/03/2

The ARC login node will be rebooted this afternoon for an emergency maintenance. This downtime is needed to help mitigate the filesystem slowdowns experienced on the login node. Jobs will continue running and scheduling during this time.

All logins to the ARC login node will be terminated at 3:00PM and will remain unavailable until 4:00PM.

We apologize for the inconvenience and thank you for your patience.


⚠️ Filesystem Issues - 2023/03/2

We are still currently investigating a filesystem issue that is causing filesystem slowdowns on specific nodes in our MSRDC location.

We will update you with more information as it becomes available.

We apologize for the inconvenience and thank you for your patience.


Filesystem Issues Resolved - 2023/03/10

We have upgraded the filesystem routers in our MSRDC location to address the performance issues.

Please let us know if you experience any issues with the filesystem performance.

Thank-you for your patience.


Open OnDemand reboot - 2023/05/01

On May 1, 2023, the ARC Open OnDemand node will be rebooted between 5PM and 6PM. Expected downtime will be approximately 15 minutes.

If you encounter any system issues, do not hesitate to let us know.

Thank you for your cooperation.


Apptainer (Singularity) on ARC Login Node - 2023/06/22

Apptainer (Singularity) containers may experience an error when

running on the Arc login node. If apptainer complains that a system administrator needs to enable user namespaces, simply run your containers inside a job.

This is a temporary measure due to security vulnerability that will be

patched soon.

Lattice, Single, cpu2013 partition changes - 2023/07/13

The Lattice and Single, and cpu2013 have all been decomissioned. The Single

partition will be replaced by the nodes formerly in the cpu2013 partition but

will be called single.

Open OnDemand reboot - 2023/10/17

Open OnDemand will be rebooted on October 17, 2023 for an update. It will be down for up to 30 minutes.

Storage Upgrade MARC/ARC cluster - 2023/10/23

We will be performing storage upgrades on the MARC/ARC cluster on

November 16 and 17, 2023. To facilitate this, we will be throttling down the number of jobs on both clusters while the upgrades are

performed