GLaDOS Cluster Guide

From RCSWiki
Revision as of 14:39, 9 April 2020 by Phillips (talk | contribs)
Jump to navigation Jump to search

This guide gives an overview of the GlaDOS cluster at the University of Calgary.

It is intended to be read by new account holders getting started on GlaDOS, covering such topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.

For GlaDOS-related questions not answered here, please write to support@hpc.ucalgary.ca .

Introduction

GlaDOS is a computing cluster purchased through a collaboration of researchers in Biological Sciences, Chemistry and Geoscience and maintained by the Research Computing Services group within Information Technologies. Faculty members involved are:

  • Department of Biological Sciences
    • Sergei Noskov
    • Peter Tieleman
  • Department of Chemistry
    • Justin MacCallum
  • Department of Geoscience
    • Jan Dettmer
    • David Eaton
    • Hersh Gilbert
    • Kris Innanen
    • Daniel Trad

Accounts

If you have a project associated with one of the faculty members listed above and would like to use GlaDOS, please write to support@hpc.ucalgary.ca , copying the relevant faculty member on your request. To assist you getting started, it would be helpful if you mention what software you plan to use. Also note that University of Calgary Information Technologies computing account credentials are used for GlaDOS (the same as used for University of Calgary email accounts). If you don't have a University of Calgary email address, please register for one at https://itregport.ucalgary.ca/ and include your email address in your request for a GlaDOS account.

Hardware

Processors

Besides login and administrative servers, the GlaDOS hardware consists of a number of different compute nodes.

---------------------------------------------------------------------------------------------------------------
Node name  Processors/node              CPUs        Available memory 	Graphics processors
                                        per node    per node (MB)       GPUs per node
---------------------------------------------------------------------------------------------------------------
g1-g8 	   Two 2.4 GHz Intel Xeon CPU  	20 	    250000 	        Four GeForce GTX 1080 Ti (Pascal) GPUs 
           E5-2640 V4 processors                                        each with 11 GB of memory

g9-g24 	   Two 2.1 GHz Intel Xeon CPU  	16 	     62000 	        Four GeForce GTX 1080 Ti (Pascal) GPUs 
           E5-2620 V4 processors                                        each with 11 GB of memory

g25-g34    Two 2.4 GHz Intel Xeon CPU  	20 	     62000 	        None
           E5-2640 V4 processors
---------------------------------------------------------------------------------------------------------------

Interconnect

The compute nodes communicate via 100 Gbit/s Intel Omnipath network connection.

Storage

There is approximately 1 PB (petabyte) of storage. Note that this is not backed up, so, users are responsible for transferring data they want to keep to their own systems (or perhaps to Compute Canada storage) for safe-keeping. Software

Look for installed software under /global/software and through the module avail command. GNU and Intel compilers are available. The setup of the environment for using some of the installed software is through the module command. An overview of modules on WestGrid is largely applicable to GlaDOS.

To list available modules, type:

module avail

So, for example, to load a module for Python use:

module load python/anaconda3-5.0.1

and to remove it use:

module remove python/anaconda3-5.0.1

To see currently loaded modules, type:

module list

By default, modules are installed on GlaDOS to set up Intel compilers and to support parallel programming with MPI (including the determination of which compilers are used by the wrapper scripts mpicc, mpif90, etc.).

Write to support@hpc.ucalgary.ca if you need additional software installed.

Using GlaDOS

To log in to GlaDOS, connect to glados.ucalgary.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, the WestGrid QuickStart Guide for New Users (external link) may be helpful. Note that connections are accepted only from on-campus IP addresses. You can connect from off-campus by using Virtual Private Network (VPN) software available from Information Techologies (external link).

The GlaDOS login node may be used for short interactive runs during development (under 15 minutes). Production runs and longer test runs should be submitted as batch jobs using the sbatch command. Batch jobs are submitted through the Slurm job management and scheduling software. Processors may also be reserved for interactive sessions, in a similar manner to batch jobs, using the salloc command.

Most of the information on the Running Jobs (external link) page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on GlaDOS. One difference is that no --accounts specification is needed for GlaDOS. Another relates to partitions, as explained below.

Choosing a partition

For the purposes of assigning resources, GlaDOS hardware is divided into a number of logical partitions. During system maintenance ending 2019-09-27 the partitions were reorganized to improve scheduling of GPU resources, with the removal of glados and short partition names and the introduction of new partitions, glados-gpu, glados16 and glados12. Finally, there is a separate partition called geo, intended for use by the Geoscience-based projects, which is comprised of compute nodes g25 through g34.

Nodes g1 through g24 have 4 GPUs each and are all grouped together in a partition called glados-gpu. The large-memory 20-core nodes g1 through g8 are also accessible through a partition named glados16. Similarly the small-memory 16-core nodes g9 through g24 can be accessed through the name glados12. As explained below, the names of these partitions were chosen to indicate the number of cores per node in these partitions that can be used for CPU-only jobs.

Typical software that makes use of GPUs requires one CPU core per GPU requested. To prevent CPU-only jobs from blocking jobs from making use of the more expensive GPU hardware, the CPU-only jobs are not allowed to use 4 of the cores on each GPU node. Consequently, only 16 cores of the 20 cores on the glados16 nodes can be used for CPU-only jobs. Similarly, only 12 of the 16 cores on the glados12 nodes can be used for CPU-only jobs.

There is a separate partition called geo, intended for use by the Geoscience-based projects, which is comprised of compute nodes g25 through g34. These nodes do not have GPUs.

CPU-only jobs requesting more than 16 cores will be rejected unless you specifically request the geo partition.

When reserving computing resources, a partition can be specified with the --partition=partition_name argument on the salloc command or with a directive of the form

#SBATCH --partition=partition_name

in batch job scripts submitted with sbatch. The partition_name is one of glados-gpu, glados16, glados12 or geo.

Using GPUs

For jobs that require GPUs, use a directive of the form

#SBATCH --gres=gpu:n

where n=1, 2, 3, or 4.

Time limits

The maximum job time limit that can be requested is 7 days for any of the partitions. The time limit for a batch job is specified with a directive of the form

#SBATCH --time=hh:mm:ss

Support

Send GlaDOS-related questions to support@hpc.ucalgary.ca.

History

2018-03-15 - Page created.
2018-08-01 - Added hardware and partition descriptions.
2019-09-30 - New partition scheme with glados16, glados12, glados-gpu and geo.