Helix Cluster Guide

From RCSWiki
Revision as of 14:42, 9 April 2020 by Phillips (talk | contribs)
Jump to navigation Jump to search

About

      • PLEASE NOTE That the Helix cluster is going to be absorbed by the main UofC cluster, ARC, some time later this year.
If you are going to apply for a Helix account now, please consider applying for an ARC account instead,
to avoid moving your work to ARC later, when Helix disappears.


This QuickStart guide gives a overview of the Helix cluster at the University of Calgary.

It is intended to be read by new account holders getting started on Helix, covering such topics as the Helix hardware and performance characteristics, available software, usage policies and how to log in and run jobs.

For Helix-related questions not answered here, please write to support@hpc.ucalgary.ca .

Introduction

Helix is a computing cluster installed in March 2016 as a test environment as Phase 1 of an infrastructure project in support of advanced research computing for the Cumming School of Medicine.

Intended use

Helix is a test environment to explore how a mix of large-memory, general purpose and GPU compute nodes can serve the needs of researchers in the Cumming School of Medicine. It does not provide the extra security measures required for storing and processing restricted (such as patient-identifiable) data. Although it may be used for "real" research projects, it should be kept in mind that it is a small-scale test environment that will not be able to accommodate the needs of all researchers.

Note that Helix runs the Linux operating system and most calculations on Helix will be run through non-interactive batch-mode job scripts. We would be happy to help you get started if you are not familiar with this kind of environment and those terms are not familiar to you.

Accounts

If you have a project associated with research in the Cumming School of Medicine that you think would benefit from running computations on Helix, please write to support@hpc.ucalgary.ca to explain a bit about your project and request an account.

Accounts on the Helix cluster use the same credentials (user name and password) as for University of Calgary computing accounts offered by Information Technologies for email and other services.

If you don't have a University of Calgary email address, please register for one at https://itregport.ucalgary.ca/ and include your email address in your request for a ARC account.

Hardware

Processors

Besides login and administrative servers, the Helix hardware consists of 10 general-purpose compute servers (each with 24 CPU cores at 2.5 GHz and 256 GB of RAM), two large-memory servers (each with 64 CPU cores at 2.2 GHz, 2 TB of RAM and 6 TB of local disk), a GPU-accelerated compute server (16 CPU cores at 2.4 GHz, 2 K40m GPUs and 256 GB of RAM) and a special higher clock rate server for serial applications or applications requiring a very fast disk access (16 CPU cores at 3.2 GHz, 256 GB of RAM, 5.8 TB fast SSD based local disk).

Interconnect

The compute nodes communicate via 10-gigabit/s Ethernet connections.

Storage

Initially there was 50 TB of usable disk space allocated for home directories, but, in February 2018 this was increased to about 70 TB. A default per-user quota (disk usage limit) is 1 TB, with an option to extent the storage space on individual basis.

Update 2017-05-19: On each of the two large-memory nodes and the GPU-enabled node there is 21.8 TB of special high-speed local disk, accessible as /local_scratch, that can be used for temporary storage.

Software

Look for installed software under /global/software and through the module avail command. GNU and Intel compilers are available. The setup of the environment for using some of the installed software is through the module command. An overview of modules on WestGrid is largely applicable to Helix.

To list available modules, type:

module avail

So, for example, to load a module for Java use:

module load java/1.8.0

and to remove it use:

module remove java

To see currently loaded modules, type:

module list

By default, modules are installed on Helix to set up Intel compilers and to support parallel programming with MPI (including the determination of which compilers are used by the wrapper scripts mpicc, mpif90, etc.).

Write to support@hpc.ucalgary.ca if you need additional software installed.

Using Helix

To log in to Helix, connect to helix.hpc.ucalgary.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, the WestGrid QuickStart Guide for New Users may be helpful. Note that connections are accepted only from on-campus IP addresses. You can connect from off-campus by using Virtual Private Network (VPN) software available from Information Technologies.

Prior to an update in October 2018, batch jobs were submitted through TORQUE and scheduled using Maui. After the update, jobs are now managed by system software called Slurm. A "Rosetta stone" page shows how some of the commands for managing jobs and directives for requesting resources in a TORQUE environment map to the corresponding functionality under Slurm. If you need help converting existing job scripts, please contact us at support@hpc.ucalgary.ca.

The Helix login node may be used for short interactive runs during development (under 15 minutes). Production runs and longer test runs should be submitted as batch jobs in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the sbatch command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).

Processors may also be reserved for interactive sessions using the salloc command, but, instead of using a job script, the resources needed are specified as arguments on the salloc command line.

There is a 7-day maximum walltime limit for non-interactive jobs on Helix. There is a 24-hour limit for the GPU node. Interactive jobs are limited to 3 hours.

Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on Helix. One major difference between running jobs on the Helix and Compute Canada clusters is in selecting the type of hardware that should be used for a job. Access to various types of the compute servers (nodes) on Helix is controlled through partitions. Partitions are selected using a directive in your batch job script of the form:

#SBATCH --partition=partition_name

where the partition_name is one of normal (the default partition, which will be used if you omit the --partition directive), bigmem, fast, and gpu.

  • The normal partition provides access to 10 general compute nodes (c2 through c9, c13 and c14) with 24 CPU cores and 256 GB of RAM.
Time limit is 7 days or 168 hours.
There is no need to specify the partition in the job script, as it is the default partition.
  • The bigmem partition provides access the 2 big memory compute nodes (c10 and c11) with 64 cores and 2 TB of RAM each.
The time limit is 24 hours.
Each of the nodes also have 6 TB fast local storage mounted on /tmp.
This partition is targeted to memory intensive jobs, but can be used for shorter general computations as well.
Request with #SBATCH --partition=bigmem line in your job script.
  • The fast partition provides access to a single "fast" node (c12) with 16 higher clocked cores and fast local disk.
The time limit is 24 hours.
The node has 5.8 TB of very fast local storage mounted on /tmp.
This partition is suitable for serial (single CPU) jobs and jobs requiring very fast disk access.
Request with #SBATCH --partition=fast line in your job script.
  • The gpu partition provides access to a single node (c1) with 2 K40 NVidia GPUs and 16 CPU cores.
The time limit is 24 hours.
The node can only run one job at a time allocating both the GPUs to the same job.
This partition is suitable for GPU enabled jobs.
Request with #SBATCH --partition=gpu --gres=gpu:n line in your job script., where n=1 or 2.

Support

Send Helix-related questions to support@hpc.ucalgary.ca.

Updates:

2016-04-01 - Page created. 2017-05-19 - Added note regarding fast local disk on the GPU and high-memory nodes. 2018-03-01 - Noted that nodes=1:ppn=n should be used instead of procs=n. 2018-07-31 - Noted that the default user storage quota is 1 TB (raised from 500 GB in February 2018). 2018-10-12 - Change from TORQUE/Maui to Slurm for job management.