How to find nodenames allocated by SLURM for a job
Background
When a distributed job that is planned to run on multiple nodes requests several compute nodes from SLURM the program that is going to perform the computations
needs to know the names of the compute nodes which are allocated to the job by SLURM to use those nodes.
If the program is based on the distributed MPI library then the distribution of the computational processed to the nodes is done by the MPI program launcher,
mpirun
or mpiexec
. Thus, these launchers need to know the list of the allocated nodes.
If the computational code (program) was build on ARC and compiled against the OpenMPI library provided on ARC ( openmpi/4.1.1-gnu
at the moment of writing),
then the launcher is aware about ARC's SLURM scheduler and can obtain the list of nodes directly from SLURM automatically.
This, however, it not always the case, some software comes in a form of pre-compiled binary files and may be build using a different implementation of the MPI library, for example using Intel MPI, or IBM MPI. Alternatively, the software may not be based on the MPI library and use some other kind of distribution mechanism. In such a case the software needs to the the list of allocated nodes explicitly.
SLURM node list
The list of allocated nodes is provided to the job environment in the SLURM_NODELIST
variable.
Inside an interactive job you can check it manually:
$ salloc -N4 -n4 -c1 -t 1:00:00 --mem=1gb salloc: Granted job allocation 21486039 salloc: Waiting for resource configuration salloc: Nodes fc[107-110] are ready for job $ echo $SLURM_NODELIST fc[107-110]
This form of the node list may be difficult to handle, when passing the computational code. Alternatively, you can get the nodes in the job script with the command:
$ scontrol show hostnames fc107 fc108 fc109 fc110
This command generates an iterate-able list and can be used with a for loop:
$ for n in `scontrol show hostnames`; do echo "Allocated node: $n"; done Allocated node: fc107 Allocated node: fc108 Allocated node: fc109 Allocated node: fc110
If the output of the command is saved into a variable, the node names will be concatenated with the " " (space) character:
$ hh=`scontrol show hostnames` [drozmano@fc107 arc-data]$ echo $hh fc107 fc108 fc109 fc110
These are some of the ways to get the list of the compute nodes that SLURM has allocated for a job.