ARC Cluster Guide

From RCSWiki
Revision as of 05:06, 13 March 2020 by Phillips (talk | contribs) (Phillips moved page ARC Quickstart Guide to ARC Guide without leaving a redirect: Remove QuickStart branding)
Jump to navigation Jump to search

This guide gives an overview of the ARC (Advanced Research Computing) cluster at the University of Calgary.

It is intended to be read by new account holders getting started on ARC, covering such topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.

For ARC-related questions not answered here, please write to support@hpc.ucalgary.ca .

Cybersecurity awareness at the U of C

Please note that there are typically about 950 phishing attempts targeting University of Calgary accounts each month. This is just a reminder to be careful about computer security issues, both at home and at the University. See [1] for more information, such as tips for secure computing and how to report suspected security problems.

Introduction

ARC is a cluster primarily comprised of Linux-based computers repurposed from several separate older clusters (Breezy, Lattice, Parallel) that were formerly offered to researchers from across Canada. In addition, a large-memory compute node (Bigbyte) was salvaged from the now-retired local Storm cluster. In January 2019, a major addition to ARC was purchased, with modern hardware.

In this new configuration, as ARC, the computational resources of these clusters are now being restricted to research projects based at the University of Calgary. ARC is meant to supplement the resources available to researchers through Compute Canada. You can read about those machines at the Compute Canada web site.

The ARC cluster can be used for running large numbers (hundreds) of concurrent serial (one-core) jobs, OpenMP or other thread-based, shared-memory parallel codes using up to 40 threads per job (or even 80 on one large node), or distributed-memory (MPI-based) parallel codes using up to hundreds of cores. Software that make use of Graphics Processing Units (GPUs) can also be used.

Accounts

If you have a project you think would be appropriate for ARC, please write to support@hpc.ucalgary.ca . To assist you getting started, it would be helpful if you mention what software you plan to use.

Note that University of Calgary Information Technologies computing account credentials are used for ARC (the same as used for University of Calgary email accounts). If you don't have a University of Calgary email address, please register for one at https://itregport.ucalgary.ca/ and include your email address in your request for a ARC account.

If you don't have University of Calgary credentials because you are external to the University you may still apply for an account if you are collaborating on a project with a University of Calgary faculty member. In your email, please explain your situation and mention the project leader involved.

Hardware

Processors

Besides login and administrative servers, the ARC hardware consists of compute nodes of several types. As discussed in more detail in a section below on Using ARC when submitting jobs to run on the cluster, you can specify a partition parameter to select the particular type of hardware that is most appropriate for your work. In the hardware descriptions below, the related partition name is indicated.

Hardware installed January 2019 and later

  • General purpose nodes (cpu2019, apophis, apophis-bf, razi and razi-bf partitions): These are 40-core nodes, each node having 2 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 40 cores on the individual compute nodes share about 190 GB of RAM (memory), but, jobs should request no more than 185000 MB.
  • GPU (Graphics Processing Unit)-enabled (gpu-v100 partition): These are 40-core nodes, each node having 2 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 40 cores on the individual compute nodes share about 750 GB of RAM (memory) but, jobs should request no more than 753000 MB. In addition, each node has two Tesla V100-PCIE-16GB GPUs.
  • Large-memory nodes (bigmem partition): These are 80-core nodes with 4 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 80 cores on the individual compute nodes share about 3 TB of RAM (memory), but, jobs should request no more than 3000000 MB.

Legacy hardware migrated from older clusters

While the hardware described in this section is still quite useful for many codes, performance may typically be half or less that provided by the newer ARC hardware described above. Also, we expect more frequent hardware failures for these partitions as the components reach the end of their lifetimes.

  • Nodes moved from the former Hyperion cluster at the Cumming School of Medicine (cpu2013 partition): These are 16-core nodes, based on Intel Xeon ES-2670 8-core 2.6 GHz processors. The 16 cores associated with one of the individual nodes share about 126 GB of RAM (memory), but, jobs should request no more than 120000 MB.
  • Nodes moved from the former Lattice cluster (lattice and single partitions): These are 8-core nodes, each node having 2 sockets. Each socket has an Intel Xeon L5520 (Nehalem) quad-core processor, running at 2.27 GHz. The 8 cores associated with one of the individual nodes share 12 GB of RAM (memory), but, jobs should request no more than 12000 MB.
  • Non-GPU nodes moved from the former Parallel cluster (parallel partition): These are 12-core compute nodes based on the HP Proliant SL390 server architecture. Each node has 2 sockets, each of which has a 6-core Intel E5649 (Westmere) processor, running at 2.53 GHz. The 12 cores associated with one compute node share 24 GB of RAM, but, jobs should request no more than 23000 MB.
  • GPU (Graphics Processing Unit)-enabled nodes moved from the former Parallel cluster (gpu partition): These 12-core nodes are similar to the non-GPU Parallel nodes described above, but, also have 3 NVIDIA Tesla M2070 GPUs. Each GPU has about 5.5 GB of memory and what is known as Compute Capability 2.0 (which means that 64-bit floating point calculation is supported, along with other capabilities that make these graphics boards suitable for general-purpose use, beyond just graphics applications; however, the GPUs are typically too old to support modern machine-learning software, such as TensorFlow).
  • Nodes moved from the former Breezy cluster (breezy partition): These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB. Update 2019-11-27 - the breezy partition nodes are being repurposed as a cluster to support teaching and learning activities and are no longer available as part of ARC.
  • Bigbyte large-memory node moved from the former Storm cluster: (bigbyte partition): This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.

Interconnect

The compute nodes withing the cpu2019, apophis, apophis-bf, razi and razi-bf partitions communicate via a 100 Gbit/s Omni-Path network. The compute nodes within the lattice and parallel cluster partitions use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched fabric, with a two to one blocking factor. All those partitions are suitable for multi-node MPI parallel processing. The breezy partition, however, has a slower, Ethernet-based connection between nodes, so, multi-node jobs should not be run on the breezy partition nodes.

Storage

About a petabyte of raw disk storage is available to the ARC cluster, but for error checking and performance reasons, the amount of usable storage for researchers' projects is considerably less than that. From a user's perspective, the total amount of storage is less important than the individual storage limits. As described below, there are three storage areas: home, scratch and work, with different limits and usage policies.

Home file system: /home

There is a per-user quota of 500 GB under /home. This limit is fixed and cannot be increased. Each user has a directory under /home, which is the default working directory when logging in to ARC. It is expected that most researchers will be able to do their work from within their home directories, but, there are two options (/work and /scratch) for accessing more space.

Note on file sharing: Due to security concerns, permissions set on your home directory with the chmod command, to allow others to share your files, will be automatically removed by a system monitoring process unless an explicit exception is made. If you need to share files with other researchers on the ARC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.

Scratch file system for large job-oriented storage: /scratch

Associated with each job, under the /scratch directory, a subdirectory is created that can be referenced in job scripts as /scratch/${SLURM_JOB_ID}. You can use that directory for temporary files needed during the course of a job. Up to 30 TB of storage may be used, per user (total for all your jobs) in the /scratch file system. Deletion policy: data in /scratch associated with a given job will be deleted automatically, without exception, five days after the job finishes.

Work file system for larger projects: /work

If you need more space than provided in /home and the /scratch job-oriented space is not appropriate for you case, please write to support@hpc.ucalgary.ca with an explanation, including an indication of how much storage you expect to need and for how long. If approved, you will then be assigned a directory under /work with an appropriately large quota.

Backup policy: you are responsible for your own backups. Many researchers will have accounts with Compute Canada and may choose to back up their data there (the Project file system accessible through the Cedar cluster would often be used). We can explain more about this option if you write to support@hpc.ucalgary.ca .

Software

Look for installed software under /global/software and through the module avail command. At this time only GNU compilers are available, but, Intel compiler licenses are being arranged. The setup of the environment for using some of the installed software is through the module command. An overview of modules on WestGrid is largely applicable to ARC.

To list available modules, type:

module avail

So, for example, to load a module for Python use:

module load python/anaconda-3.6-5.1.0

and to remove it use:

module remove python/anaconda-3.6-5.1.0

To see currently loaded modules, type:

module list

Unlike some clusters, there are no modules loaded by default. So, for example, to use Intel compilers, or to use Open MPI parallel programming, you must load an appropriate module.

Write to support@hpc.ucalgary.ca if you need additional software installed.

Using ARC

Logging in

To log in to ARC, connect to arc.ucalgary.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, the WestGrid Guide for New Users may be helpful. Note that connections are accepted only from on-campus IP addresses. You can connect from off-campus by using Virtual Private Network (VPN) software available from Information Techologies.

Storage

Please review the Storage section above for important policies and advice regarding file storage and file sharing.

Working interactively

ARC uses the Linux operating system. The program that responds to your typed commands and allows you to run other programs is called the Linux shell. There are several different shells available, but, by default you will use one called bash. It is useful to have some knowledge of the shell and a variety of other command-line programs that you can use to manipulate files. If you are new to Linux systems, we recommend that you work through one of the many online tutorials that are available, such as the UNIX Tutorial for Beginners provided by the University of Surrey. The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using. For a more comprehensive introduction to Linux, see The Linux Command Line.

The ARC login node may be used for such tasks as editing files, compiling programs and running short tests while developing programs (under 15 minutes, say). Processors may also be reserved for interactive sessions using the salloc command specifying the resources needed through arguments on the salloc command line.

Running non-interactive jobs (batch processing)

Production runs and longer test runs should be submitted as (non-interactive) batch jobs, in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the sbatch command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).

Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on ARC. One major difference between running jobs on the ARC and Compute Canada clusters is in selecting the type of hardware that should be used for a job. On ARC, you choose the hardware to use primarily by specifying a partition, as described below.

Selecting a partition

The type of computer on which a job can or should be run is determined by characteristics of your software, such as whether it supports parallel processing and by simulation or data-dependent factors such as the amount of memory required. If the program you are running uses MPI (Message Passing Interface) for parallel processing, which allows the memory usage to be distributed across multiple compute nodes, then, the memory required per MPI process is an important factor. If you are running a serial code (that is, it is not able to use multiple CPU cores) or one that is parallelized with OpenMP or other thread-based techniques that restrict it to running on just a single compute node, then, the total memory required is the main factor to consider. If your program can make use of graphics processing units, then, that will be the determining factor. If you have questions about which ARC hardware to use, please write to [[2]] and we would be happy to discuss this with you.

One you have decided what type of hardware best suits your calculations, you can select it on a job-by-job basis by including the partition keyword for an SBATCH directive in your batch job. The tables below summarize the characteristics of the various partitions

If you omit the partition specification, the system will try to assign your job to appropriate hardware based on other aspects of your request, but, for more control you can specify one or more partitions yourself. You are allowed to specify a comma-separate list of partitions.

In some cases, you really should specify the partition explicitly. For example, if you are running single-node jobs with thread-based parallel processing requesting 8 cores you could use:

#SBATCH --mem=0 
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=8 
#SBATCH --partition=single,lattice

Since the single and lattice partitions both have the same type of hardware, it is appropriate to list them both. Specifying --mem=0 allows you to use all the available memory (12000 MB) on the compute node assigned to the job. Since the compute nodes in those partitions have 8 cores each and you will be using them all, you need not be concerned about other users' jobs sharing the memory with your job. However, if you didn't explicitly specify the partition in such a case, the system would try to assign your job to the cpu2019 or similar partition. Those nodes have 40 cores and much more memory than the single and lattice partitions. If you specified --mem=0 in such a case, you would be wasting 32 cores of processing. So, if you don't specify a partition yourself, you have to give greater thought to the memory specification to make sure that the scheduler will not assign your job more resources than are needed.

As time limits may be changed by administrators to adjust to maintenance schedules or system load, the values given in the tables are not definitive. See the Time limits section below for commands you can use on ARC itself to determine current limits.

Parameters such as --ntasks-per-cpu, --cpus-per-task, --mem and --mem-per-cpu> have to be adjusted according to the capabilities of the hardware also. The product of --ntasks-per-cpu and --cpus-per-task should be less than or equal to the number given in the "Cores/node" column. The --mem> parameter (or the product of --mem-per-cpu and --cpus-per-task) should be less than the "Memory limit" shown. If using whole nodes, you can specify --mem=0 to request the maximum amount of memory per node.

Partitions for modern hardware:

Note, MPI codes using this hardware should be compiled with Omni-Path networking support. This is provided by loading the openmpi/2.1.3-opa or openmpi/3.1.2-opa modules prior to compiling.

Partition Cores/node Memory limit (MB) Time limit (h) GPUs/node
cpu2019 40 185000 168
apophis† 40 185000 168
apophis-bf† 40 185000 5
razi† 40 185000 168
razi-bf† 40 185000 5
bigmem 80 3000000 24
gpu-v100 40 753000 24 2

† The apophis and razi partitions contain hardware contributed to ARC by particular researchers. They should be used only by members of those researchers' groups. However, they have generously allowed their compute nodes to be shared with others outside their research groups for relatively short jobs by specifying the apophis-bf and razi-bf partitions. (In some cases in which a partition is not explicitly specified, these "back-fill" partitions may be automatically selected by the system).

Partitions for legacy hardware:

Partition Cores/node Memory limit (MB) Time limit (h) GPUs/node
cpu2013 16 120000 168
lattice 8 12000 168
parallel 12 23000 168
breezy‡ 24 255000 72
bigbyte‡ 32 1000000 24
single 8 12000 168
gpu 12 23000 72 3

‡ Update 2019-11-27 - the breezy and bigbyte partition nodes are being repurposed as a cluster to support teaching and learning activities and are no longer available as part of ARC.

Here are some examples of specifying the various partitions.

As mentioned in the Hardware section above, the ARC cluster was expanded in January 2019. To select the 40-core general purpose nodes specify:

#SBATCH --partition=cpu2019

To run on the Tesla V100 GPU-enabled nodes, use the gpu-v100 partition. You will also need to include an SBATCH directive in the form --gres=gpu:n to specify the number of GPUs, n, that you need. For example, if the software you are running can make use of both GPUs on a gpu-v100 partition compute node, use:

#SBATCH --partition=gpu-v100 --gres=gpu:2

For very large memory jobs (more than 185000 MB), specify the bigmem partition:

#SBATCH --partition=bigmem

If the more modern computers are too busy or you have a job well-suited to run on the compute nodes described in the legacy hardware section above, choose the cpu2013, Lattice or Parallel compute nodes (without graphics processing units) by specifying the corresponding partition keyword:

#SBATCH --partition=cpu2013
#SBATCH --partition=lattice

or

#SBATCH --partition=parallel

There is an additional partition called single that provides nodes similar to the lattice partition, but, is intended for single-node jobs. Select the single partition with

#SBATCH --partition=single

For single-node jobs requiring more memory or processors than available through the breezy or single partitions, use the bigbyte partition:

#SBATCH --partition=bigbyte

To select the nodes that have GPUs, specify the gpu partition. Use an SBATCH directive in the form --gres=gpu:n to specify the number of GPUs, n, that you need. For example, if the software you are running can make use of all three GPUs on a compute node, use:

#SBATCH --partition=gpu --gres=gpu:3

Time limits

Use a directive of the form

#SBATCH --time=hh:mm:ss

to tell the job scheduler the maximum time that your job might run. You can use the command

scontrol show partitions

to see the current configuration of the partitions including the maximum time limit you can specify for each partition, as given by the MaxTime field. Alternatively, see the TIMELIMIT column in the output from

sinfo

Hardware resource and job policy limits

There are limits on the number of cores, nodes and/or GPUs that one can use on ARC at any given time. There is also a limit on the number of jobs that a user can have pending or running at a given time (the MaxSubmitJobs parameter in the command below). The limits are generally applied on a partition-by-partition basis, so, using resources in one partition should not affect the amount you can use in a different partition. To see the current limits you can run the command:

sacctmgr show qos format=Name,MaxWall,MaxTRESPU%20,MaxSubmitJobs

Support

Please send ARC-related questions to support@hpc.ucalgary.ca.