How to check states of ARC partitions and nodes
Background
The states of the nodes in ARC (and clusters in general) changes depending on how busy the cluster is. Also, nodes can be taken out for maintenance and the can fail and break. Sometimes it is important to be able to have a glance on which nodes do what in the cluster. For example, you may want to do that before requesting an interactive job from ARC, to make sure that there are free available nodes in the partition you are planning to use.
Node states
To see the current states of the nodes in ARC's partition you can use the arc.nodes
command:
$ arc.nodes Partitions: 16 (bigmem, cpu2017-bf05, cpu2019, cpu2019-bf05, cpu2021, cpu2021-bf24, cpu2022, cpu2022-bf24, cpu2023, disa, gpu-a100, gpu-v100, ood-vis, parallel, single, wdf-altis) ========================================================================== | Total Allocated Down Drained Idle Maint Mixed -------------------------------------------------------------------------- bigmem | 5 1 0 0 3 0 1 cpu2017-bf05 | 41 7 1 0 11 0 22 cpu2019 | 40 13 0 0 0 0 27 cpu2019-bf05 | 87 12 3 1 70 0 1 cpu2021 | 17 3 0 0 0 0 14 cpu2021-bf24 | 28 0 0 0 27 0 1 cpu2022 | 52 5 0 5 0 0 42 cpu2022-bf24 | 16 0 0 2 13 0 1 cpu2023 | 20 1 1 0 0 0 18 disa | 1 0 0 0 1 0 0 gpu-a100 | 5 2 0 0 1 0 2 gpu-v100 | 7 0 2 2 1 0 2 ood-vis | 1 0 0 0 1 0 0 parallel | 576 129 92 108 0 4 243 single | 14 0 1 1 10 0 2 wdf-altis | 12 0 0 0 12 0 0 -------------------------------------------------------------------------- logical total | 922 173 100 119 150 4 376 | physical total | 922 173 100 119 150 4 376
- The first column on the left shows the partition names. Each line presents information about this specific partition.
- The second column, Total, indicates the total number of nodes in this partition.
- The following columns show node states. The table is built dynamically, so it only shows the states which currently present in ARC.
- These are standard SLURM states and their description can be found in SLURM documentation:
- https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES
At the bottom of the table there are two bottom lines, the logical total, and the physical total.
Sometimes, the same node can be assigned to more than one partition in the cluster.
This way, logically, it will be counted in each partition as many times as the number of partitions it is in.
The logical count in this case will over count the actual number of nodes in the cluster.
The physical count corrects for this and reports the actual number of nodes in the cluster.
Thus, if both the numbers are the same, this also means that each reported node is assigned to only one partition.