User:Tannistha.nandi: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
Line 1: Line 1:
== '''Run jobs on ARC''' ==
== '''How to interact with ARC''' ==


ARC cluster is a collection of several compute nodes connected by a high-speed network. On ARC, computations get submitted as jobs. Once submitted, the jobs are then assigned to compute nodes by the job scheduler as resources become available.
ARC cluster is a collection of several compute nodes connected by a high-speed network. On ARC, computations get submitted as jobs. Once submitted, the jobs are then assigned to compute nodes by the job scheduler as resources become available.

Revision as of 18:14, 14 September 2020

How to interact with ARC

ARC cluster is a collection of several compute nodes connected by a high-speed network. On ARC, computations get submitted as jobs. Once submitted, the jobs are then assigned to compute nodes by the job scheduler as resources become available. Cluter.png

You can access ARC with your UCalgary IT user credentials. Once connected, you will get placed in the ARC login node, for basic tasks such as job submission, monitor job status, manage files, edit text, etc. It is a shared resource where multiple users get connected at the same time. Thus, any intensive tasks is not allowed on the login node as it may block other potential users to connect/submit their computations.

        [tannistha.nandi@arc ~]$ 

The job scheduling system on ARC is called SLURM. On ARC, there are two SLURM commands that can allocate resources to a job under appropriate conditions: ‘salloc’ and ‘sbatch’. They both accept the same set of command line options with respect to resource allocation. 

‘salloc’ is to launch an interactive session, typically for tasks under 5 hours.

Once an interactive job session is created, you can do things like explore research datasets, start R or python sessions to test your code, compile software applications etc.

a. Example 1: The following command requests for 1 node, 1 task and 1 GB of RAM for an hour. 

         [tannistha.nandi@arc ~]$ salloc --mem=1G -N 1 -n 1 -t 01:00:00
         salloc: Granted job allocation 6758015
         salloc: Waiting for resource configuration
         salloc: Nodes fc4 are ready for job
         [tannistha.nandi@fc4 ~]$ 


b. Example 2:  The following command requests for 1 GPU to be used from 1 node belonging to the gpu-v100 partition along with 1 GB of RAM for 1 hour. Generic resource scheduling (--gres) is used to request for GPU resources.

        [tannistha.nandi@arc ~]$ salloc --mem=1G -t 01:00:00 -p gpu-v100 --gres=gpu:1
        salloc: Granted job allocation 6760460
        salloc: Waiting for resource configuration
        salloc: Nodes fg3 are ready for job
        [tannistha.nandi@fg3 ~]$

Once you finish the work, type 'exit' at the command prompt to end the interactive session,

        [tannistha.nandi@fg3 ~]$ exit
        [tannistha.nandi@fg3 ~]$ salloc: Relinquishing job allocation 6760460

It is to ensure that the allocated resources are released from your job and now available to other users.

‘sbatch’ is to submit computations as jobs to run on the cluster. You can submit a job-script.sh via 'sbatch' for execution.
        [tannistha.nandi@arc ~]$ sbatch job-script.sh

When resources become available, they get allocated to this task. Batch jobs are suited for tasks that run for long periods of time without any user supervision. When the job-script terminates, the allocation is released. Please review the section on how to prepare job scripts for more information.

Prepare job scripts

Job scripts are text files saved with an extension '.sh', for example, 'job-script.sh'. A job script looks something like this:

   #!/bin/bash
   ####### Reserve computing resources #############
   #SBATCH --nodes=1
   #SBATCH --ntasks=1
   #SBATCH --time=01:00:00
   #SBATCH --mem=1G
   #SBATCH --partition=cpu2019
####### Set environment variables ############### module load python/anaconda3-2018.12
####### Run your script ######################### python myscript.py

The first line contains the text "#!/bin/bash" to interpret it as a bash script.

It is followed by lines that start with a '#SBATCH' to communicate with 'SLURM'. You may add as many #SBATCH directives as needed to reserve computing resources for your task. The above example requests for a single node, 1 task and 1GB RAM for an hour on cpu2019 partition.

Next, you have to set up environment variables either by loading the modules centrally installed on ARC or export path to the software in your home directory. The above example loads an available python module.

Finally, include the Linux command to execute the local script.

Note that failing to specify part of a resource allocation request (most notably time and memory) will result in bad resource requests as the defaults are not appropriate to most cases.