How to check states of ARC partitions and nodes

From RCSWiki
Revision as of 15:33, 19 October 2023 by Dmitri (talk | contribs) (→‎Node states)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Background

The states of the nodes in ARC (and clusters in general) changes depending on how busy the cluster is. Also, nodes can be taken out for maintenance and the can fail and break. Sometimes it is important to be able to have a glance on which nodes do what in the cluster. For example, you may want to do that before requesting an interactive job from ARC, to make sure that there are free available nodes in the partition you are planning to use.

Node states

To see the current states of the nodes in ARC's partition you can use the arc.nodes command:

$ arc.nodes

          Partitions: 16 (bigmem, cpu2017-bf05, cpu2019, cpu2019-bf05, cpu2021, cpu2021-bf24, cpu2022, 
                          cpu2022-bf24, cpu2023, disa, gpu-a100, gpu-v100, ood-vis, parallel, single, wdf-altis)

      ==========================================================================
                     | Total  Allocated  Down  Drained  Idle  Maint  Mixed
      --------------------------------------------------------------------------
              bigmem |     5          1     0        0     3      0      1 
        cpu2017-bf05 |    41          7     1        0    11      0     22 
             cpu2019 |    40         13     0        0     0      0     27 
        cpu2019-bf05 |    87         12     3        1    70      0      1 
             cpu2021 |    17          3     0        0     0      0     14 
        cpu2021-bf24 |    28          0     0        0    27      0      1 
             cpu2022 |    52          5     0        5     0      0     42 
        cpu2022-bf24 |    16          0     0        2    13      0      1 
             cpu2023 |    20          1     1        0     0      0     18 
                disa |     1          0     0        0     1      0      0 
            gpu-a100 |     5          2     0        0     1      0      2 
            gpu-v100 |     7          0     2        2     1      0      2 
             ood-vis |     1          0     0        0     1      0      0 
            parallel |   576        129    92      108     0      4    243 
              single |    14          0     1        1    10      0      2 
           wdf-altis |    12          0     0        0    12      0      0 
      --------------------------------------------------------------------------
       logical total |   922        173   100      119   150      4    376 
                     |
      physical total |   922        173   100      119   150      4    376 

  • The first column on the left shows the partition names. Each line presents information about this specific partition.
  • The second column, Total, indicates the total number of nodes in this partition.
  • The following columns show node states. The table is built dynamically, so it only shows the states which currently present in ARC.
These are standard SLURM states and their description can be found in SLURM documentation:
https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES


  • Available resources can be in the nodes in the Idle and, possibly, the Mixed states.
  • The nodes marked as Allocated are completely busy with work.
  • The nodes which are Down, Drained, Draining, Maint cannot be used at this moment.
  • The Down state means that the nodes are broken and off-line.


At the bottom of the table there are two bottom lines, the logical total, and the physical total. Sometimes, the same node can be assigned to more than one partition in the cluster. This way, logically, it will be counted in each partition as many times as the number of partitions it is in. The logical count in this case will over count the actual number of nodes in the cluster. The physical count corrects for this and reports the actual number of nodes in the cluster. Thus, if both the numbers are the same, this also means that each reported node is assigned to only one partition.

Links

How-Tos