TALC Cluster Guide: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
spelling, formatting
No edit summary
Line 13: Line 13:


==Introduction==
==Introduction==
TALC is a cluster of computers created by Research Computing Services in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. However, procedures for using TALC are very similar to those for ARC, so, what students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.  At least for now, we also rely heavily on the ARC or external documentation, instead of repeating it here.
TALC is a cluster of computers created by Research Computing Services in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. The software environment in the TALC and ARC clusters very similar and workflows between the two clusters are identical.  What students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.  


If you are the instructor for a course that could benefit from using TALC, please review this page and the TALC terms of use (which will be made available at https://rcs.ucalgary.ca/TALC_Terms_of_Use).  Then, write to support@hpc.ucalgary.ca to discuss your requirements.  In order to ensure that software and accounts are in place and that your tutorial assistants have appropriate training, it is best to start this discussion several months prior to the start of the course.
If you are the instructor for a course that could benefit from using TALC, please review this guide, the [[TALC Terms of Use]], then contact us at support@hpc.ucalgary.ca to discuss your requirements.  To ensure that the appropriate software is available, student accounts are in place, and appropriate training has been provided for your teaching assistants, it is best to start this discussion several months prior to the start of the course.


If you are a student in a course for which TALC is being used, please review this guide for basic instructions in using the cluster.  Questions should first be directed to the tutorial assistants or instructor for your course.
If you are a student in a course using TALC, please review this guide for basic instructions in using the cluster.  Questions should first be directed to the teaching assistants or instructor for your course.


==Accounts==
==Accounts==
TALC account requests are expected to be submitted by the instructor for the course involved, rather than from individual students.
TALC account requests are expected to be submitted by the course instructor rather than from individual students. You must have a University of Calgary IT account in order to use TALC. If you do not have a University of IT account or email address, please register for one at https://itregport.ucalgary.ca/.
 
Note that University of Calgary Information Technologies computing account credentials are used for TALC (the same as used for University of Calgary email accounts). If you don't have a University of Calgary email address, please register for one at https://itregport.ucalgary.ca/ ,


==Hardware==
==Hardware==
===Processors===
===Processors===
Besides login and administrative servers, the TALC hardware consists of compute nodes where the bulk of the computations are performed.  These compute nodes were repurposed from older research computing clusters. As such, individual processor performance will not be comparable to modern computers, but, should be sufficient for courses. There are two types of compute node in the TALC cluster and are distinctly grouped into separate partitions:
The TALC cluster is comprised of repurposed research clusters that are a few generations old. As a result, individual processor performance will not be comparable to the latest processors but should be sufficient for educational purposes and course work.  
 
There are two types of compute node in the TALC cluster and are distinctly grouped into separate partitions:
*'''cpu24 partition nodes''': These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB.  The cpu24 paritition nodes are connected with a high-speed InfiniBand network.
*'''cpu24 partition nodes''': These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB.  The cpu24 paritition nodes are connected with a high-speed InfiniBand network.
*'''cpu32-bigmem partition node''': This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.
*'''cpu32-bigmem partition node''': This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.


===Storage===
===Storage===
In addition to the compute nodes, TALC is connected to a disk storage system. There are two storage areas that are relevant when using TALC, the /home and /scratch file systems.  When you log in to TALC, your working directory is in the /home file system.  There is a quota of 500 GB for each user in /home.  The /scratch file system is used for large, job-oriented storage.
{{Message Box
| title=No Backup Policy!
| message=You are responsible for your own backups.  Since accounts on TALC and related data are removed shortly after the associated course has finished, you should download anything you need to save to your own computer before the end of the course.
}}
 
TALC is connected to a network disk storage system. This storage is split across the <code>/home</code> and <code>/scratch</code> file systems.   
 


Associated with each job (calculation to be completed) a subdirectory of /scratch is created that can be referenced in job scripts as /scratch/${SLURM_JOB_ID}. You can use that directory for temporary files needed during the course of a job. Up to 30 TB of storage may be used, per user (total for all your jobs) in the /scratch file system. Deletion policy: data in /scratch associated with a given job will be deleted automatically, without exception, five days after the job finishes.
====<code>/home</code>: Home file system====
Each user has a directory under /home and is the default working directory when logging in to TALC. Each home directory has a per-user quota of 500 GB. This limit is fixed and cannot be increased.


'''Backup policy''': you are responsible for your own backupsSince accounts on TALC and related data are removed shortly after the associated course has finished, you should download anything you need to save to your own computer before the end of the course.
Note on file sharing: Due to security concerns, permissions set using <code>chmod</code> on your home directory to allow other users to read/write to your home directory be automatically reverted by an automated system process unless an explicit exception is madeIf you need to share files with other researchers on the ARC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.


===Software===
====<code>/scratch</code>: Scratch file system for large job-oriented storage====
Look for installed software under /global/software and through the module avail command. The configuration of the environment for using some of the installed software is facilitated through options of the module command.
Associated with each job, under the <code>/scratch</code> directory, a subdirectory is created that can be referenced in job scripts as <code>/scratch/${SLURM_JOB_ID}</code>. You can use that directory for temporary files needed during the course of a job. Up to 30 TB of storage may be used, per user (total for all your jobs) in the <code>/scratch</code> file system.


To list available environment modules, type:
Data in <code>/scratch</code> associated with a given job will be deleted automatically, without exception, five days after the job finishes.


module avail


To set up your environment to use one of the listed software packages, use the module load command. For example, to load a module for a particular version of Python 3 you can use:
=== Software ===
{{Message Box
| title=Software Package Requests
| message=Course instructors or teaching assistants should write to support@hpc.ucalgary.ca if additional software is required for their course.
}}


module load python/anaconda-3.6-5.1.0
All ARC nodes run the latest version of CentOS 7 with the same set of base software packages. For your convenience, we have packaged commonly used software packages and dependencies as modules available under <code>/global/software</code>. If your software package is not available as a module, you may also try Anaconda which allows users to manage and install custom packages in an isolated environment.


To remove the software from your environment, use
For a list of available packages that have been made available, please see [[ARC Software pages]].


module remove python/anaconda-3.6-5.1.0
==== Modules ====
The setup of the environment for using some of the installed software is through the <code>module</code> command.


To see currently loaded modules, type:
Software packages bundled as a module will be available under <code>/global/software</code> and can be listed with the <code>module avail</code> command.
<syntaxhighlight lang="bash">
$ module avail
</syntaxhighlight>


module list
To enable Python, load the Python module by running:
<syntaxhighlight lang="bash">
$ module load python/anaconda-3.6-5.1.0
</syntaxhighlight>


Course instructors or assistants should write to support@hpc.ucalgary.ca if additional software needs to be installed.
To unload the Python module, run:
<syntaxhighlight lang="bash">
$ module remove python/anaconda-3.6-5.1.0
</syntaxhighlight>
 
To see currently loaded modules, run:
<syntaxhighlight lang="bash">
$ module list
</syntaxhighlight>


==Using TALC==
==Using TALC==
===Logging in===
===Logging in===
To log in to TALC, connect to talc.ucalgary.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, the WestGrid QuickStart Guide for New Users may be helpful.  Note that connections are accepted only from on-campus IP addresses. You can connect from off-campus by using Virtual Private Network (VPN) software available from Information Technologies.
To log in to TALC, connect using SSH to talc.ucalgary.ca. Connections to TALCare accepted only from the University of Calgary network (on campus) or through the University of Calgary General VPN (off campus).
 
See [Connecting to RCS HPC Systems] for more information.


===Working interactively===
===Working interactively===

Revision as of 20:13, 28 July 2020

Cybersecurity awareness at the U of C

Please note that there are typically about 950 phishing attempts targeting University of Calgary accounts each month. This is just a reminder to be careful about computer security issues, both at home and at the University. Please visit https://it.ucalgary.ca/it-security for more information, tips on secure computing, and how to report suspected security problems.

Need Help or have other ARC Related Questions?

For all general RCS related issues, questions, or comments, please contact us at support@hpc.ucalgary.ca.


This guide gives an overview of the Teaching and Learning Cluster (TALC) at the University of Calgary and is intended to be read by new account holders getting started on TALC. This guide covers topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.

Introduction

TALC is a cluster of computers created by Research Computing Services in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. The software environment in the TALC and ARC clusters very similar and workflows between the two clusters are identical. What students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.

If you are the instructor for a course that could benefit from using TALC, please review this guide, the TALC Terms of Use, then contact us at support@hpc.ucalgary.ca to discuss your requirements. To ensure that the appropriate software is available, student accounts are in place, and appropriate training has been provided for your teaching assistants, it is best to start this discussion several months prior to the start of the course.

If you are a student in a course using TALC, please review this guide for basic instructions in using the cluster. Questions should first be directed to the teaching assistants or instructor for your course.

Accounts

TALC account requests are expected to be submitted by the course instructor rather than from individual students. You must have a University of Calgary IT account in order to use TALC. If you do not have a University of IT account or email address, please register for one at https://itregport.ucalgary.ca/.

Hardware

Processors

The TALC cluster is comprised of repurposed research clusters that are a few generations old. As a result, individual processor performance will not be comparable to the latest processors but should be sufficient for educational purposes and course work.

There are two types of compute node in the TALC cluster and are distinctly grouped into separate partitions:

  • cpu24 partition nodes: These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB. The cpu24 paritition nodes are connected with a high-speed InfiniBand network.
  • cpu32-bigmem partition node: This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.

Storage

No Backup Policy!

You are responsible for your own backups. Since accounts on TALC and related data are removed shortly after the associated course has finished, you should download anything you need to save to your own computer before the end of the course.

TALC is connected to a network disk storage system. This storage is split across the /home and /scratch file systems.


/home: Home file system

Each user has a directory under /home and is the default working directory when logging in to TALC. Each home directory has a per-user quota of 500 GB. This limit is fixed and cannot be increased.

Note on file sharing: Due to security concerns, permissions set using chmod on your home directory to allow other users to read/write to your home directory be automatically reverted by an automated system process unless an explicit exception is made. If you need to share files with other researchers on the ARC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.

/scratch: Scratch file system for large job-oriented storage

Associated with each job, under the /scratch directory, a subdirectory is created that can be referenced in job scripts as /scratch/${SLURM_JOB_ID}. You can use that directory for temporary files needed during the course of a job. Up to 30 TB of storage may be used, per user (total for all your jobs) in the /scratch file system.

Data in /scratch associated with a given job will be deleted automatically, without exception, five days after the job finishes.


Software

Software Package Requests

Course instructors or teaching assistants should write to support@hpc.ucalgary.ca if additional software is required for their course.

All ARC nodes run the latest version of CentOS 7 with the same set of base software packages. For your convenience, we have packaged commonly used software packages and dependencies as modules available under /global/software. If your software package is not available as a module, you may also try Anaconda which allows users to manage and install custom packages in an isolated environment.

For a list of available packages that have been made available, please see ARC Software pages.

Modules

The setup of the environment for using some of the installed software is through the module command.

Software packages bundled as a module will be available under /global/software and can be listed with the module avail command.

$ module avail

To enable Python, load the Python module by running:

$ module load python/anaconda-3.6-5.1.0

To unload the Python module, run:

$ module remove python/anaconda-3.6-5.1.0

To see currently loaded modules, run:

$ module list

Using TALC

Logging in

To log in to TALC, connect using SSH to talc.ucalgary.ca. Connections to TALCare accepted only from the University of Calgary network (on campus) or through the University of Calgary General VPN (off campus).

See [Connecting to RCS HPC Systems] for more information.

Working interactively

TALC uses the Linux operating system. The program that responds to your typed commands and allows you to run other programs is called the Linux shell. There are several different shells available, but, by default you will use one called bash. It is useful to have some knowledge of the shell and a variety of other command-line programs that you can use to manipulate files. If you are new to Linux systems, we recommend that you work through one of the many online tutorials that are available, such as the UNIX Tutorial for Beginners provided by the University of Surrey. The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using. For a more comprehensive introduction to Linux, see The Linux Command Line.

The TALC login node may be used for such tasks as editing files, compiling programs and running short tests while developing programs (under 15 minutes, say). Processors may also be reserved for interactive sessions using the salloc command specifying the resources needed through arguments on the salloc command line.

Running non-interactive jobs (batch processing)

Production runs and longer test runs should be submitted as (non-interactive) batch jobs, in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the sbatch command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).

Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on TALC. One major difference between running jobs on the TALC and Compute Canada clusters is in selecting the type of hardware that should be used for a job. On TALC, you choose the hardware to use primarily by specifying a partition, as described below.

Selecting a partition

The type of computer on which a job can or should be run is determined by characteristics of your software, such as whether it supports parallel processing and by simulation or data-dependent factors such as the amount of memory required. If the program you are running uses MPI (Message Passing Interface) for parallel processing, which allows the memory usage to be distributed across multiple compute nodes, then, the memory required per MPI process is an important factor. If you are running a serial code (that is, it is not able to use multiple CPU cores) or one that is parallelized with OpenMP or other thread-based techniques that restrict it to running on just a single compute node, then, the total memory required is the main factor to consider.

Once you have decided what type of hardware best suits your calculations, you can select it on a job-by-job basis by including the partition keyword for an SBATCH directive in your batch job. The table below summarizes the characteristics of the various partitions.

Partition	Cores/node	Memory limit (MB)	Time limit (h)
cpu24	24	255000	24
cpu32-bigmem	32	1000000	24

For example, to select the cpu24 partition, include the following line in your batch job script:

#SBATCH --partition=cpu24

Time limits

Use a directive of the form

#SBATCH --time=hh:mm:ss

to tell the job scheduler the maximum time that your job might run. You can use the command

scontrol show partitions

to see the current configuration of the partitions including the maximum time limit you can specify for each partition, as given by the MaxTime field. Alternatively, see the TIMELIMIT column in the output from

sinfo

Hardware resource limits

There are limits on the number of cores, nodes and jobs one can use on TALC at any given time. The limits are generally applied on a partition-by-partition basis, so, using resources in one partition should not affect the amount you can use in a different partition. To see the current limits you can run the command:

sacctmgr show qos format=name,maxwall,maxtrespu%20,MaxSubmitJobs

Support

Please send TALC-related questions to your course instructor or tutorial assistants (who can relay reports of system problems to support@hpc.ucalgary.ca).