TALC Cluster Guide: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
No edit summary
Added TALC navbox
(28 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Message Box
{{TALC Cluster Status}}{{Message Box
|icon=Security Icon.png
|icon=Security Icon.png
|title=Cybersecurity awareness at the U of C
|title=Cybersecurity awareness at the U of C
|message=Please note that there are typically about 950 phishing attempts targeting University of Calgary accounts each month. This is just a reminder to be careful about computer security issues, both at home and at the University. Please visit https://it.ucalgary.ca/it-security for more information, tips on secure computing, and how to report suspected security problems.}}
|message=Please note that there are typically about 950 phishing attempts targeting University of Calgary accounts each month. This is just a reminder to be careful about computer security issues, both at home and at the University. Please visit https://it.ucalgary.ca/it-security for more information, tips on secure computing, and how to report suspected security problems.}}


This guide gives an overview of the Teaching and Learning Cluster (TALC) at the University of Calgary and is intended to be read by new account holders getting started on TALC. This guide covers topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.
==Introduction==
TALC is a cluster of computers created by Research Computing Services (RCS) in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. The software environment in the TALC and ARC clusters is very similar and workflows between the two clusters are identical.  What students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.
If you are the instructor for a course that could benefit from using TALC, please review this guide and the [[TALC Terms of Use]] and then contact us at support@hpc.ucalgary.ca to discuss your requirements. 
Please note that in order to ensure that the appropriate software is available, student accounts are in place, and appropriate training has been provided for your teaching assistants, it is best to start this discussion several months prior to the start of the course.


{{Message Box
If you are a student in a course using TALC, please review this guide for basic instructions in using the cluster.  Questions should first be directed to the teaching assistants or instructor for your course.
|title=Need Help or have other TALC Related Questions?
 
|message='''Students''', please send TALC-related questions to your course instructor or teaching assistants.<br />
===Obtaining an account===
'''Course instructors and TAs''', please report system issues to support@hpc.ucalgary.ca).
TALC account requests are expected to be submitted by the course instructor rather than from individual students. You must have a University of Calgary IT account in order to use TALC. If you do not have a University of IT account or email address, please register for one at https://itregport.ucalgary.ca/. In order to ensure TALC is provisioned in time for a course start date, the instructor should submit the initial list of @ucalgary.ca accounts needed for the course 2 weeks before the start date.
|icon=Support Icon.png}}


User accounts for classes will exist for the duration of the semester they are being taught, and are deleted along with the data in the home directories when the semester ends. You must ensure you save anything you want to access later saved elsewhere before these dates. We do not keep backups of data on TALC.


This guide gives an overview of the Teaching and Learning Cluster (TALC) at the University of Calgary and is intended to be read by new account holders getting started on TALC. This guide covers topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.
For the upcoming academic calendar, accounts will be deleted on the following dates:


==Introduction==
Spring: 23 June 2023
TALC is a cluster of computers created by Research Computing Services in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. The software environment in the TALC and ARC clusters very similar and workflows between the two clusters are identical.  What students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.


If you are the instructor for a course that could benefit from using TALC, please review this guide, the [[TALC Terms of Use]], then contact us at support@hpc.ucalgary.ca to discuss your requirements.  To ensure that the appropriate software is available, student accounts are in place, and appropriate training has been provided for your teaching assistants, it is best to start this discussion several months prior to the start of the course.
Summer: 25 Aug 2023


If you are a student in a course using TALC, please review this guide for basic instructions in using the cluster.  Questions should first be directed to the teaching assistants or instructor for your course.
Fall: 22 Dec 2023


==Accounts==
Winter: 30 Apr 2024
TALC account requests are expected to be submitted by the course instructor rather than from individual students. You must have a University of Calgary IT account in order to use TALC. If you do not have a University of IT account or email address, please register for one at https://itregport.ucalgary.ca/.
 
=== Getting Support ===
{{Message Box
|title=Need Help or have other TALC Related Questions?
|message='''Students''', please send TALC-related questions to your course instructor or teaching assistants.<br />
'''Course instructors and TAs''', please report system issues to support@hpc.ucalgary.ca.
|icon=Support Icon.png}}


==Hardware==
==Hardware==
===Processors===
The TALC cluster is comprised of repurposed research clusters that are a few generations old. As a result, individual processor performance will not be comparable to the latest processors but should be sufficient for educational purposes and course work.
The TALC cluster is comprised of repurposed research clusters that are a few generations old. As a result, individual processor performance will not be comparable to the latest processors but should be sufficient for educational purposes and course work.  
{| class="wikitable"
 
!Partition
There are two types of compute node in the TALC cluster and are distinctly grouped into separate partitions:
!Description
*'''cpu24 partition nodes''': These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB.  The cpu24 paritition nodes are connected with a high-speed InfiniBand network.
!Nodes
*'''cpu32-bigmem partition node''': This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.
!CPU Cores, Model, and Year
!Installed Memory
!GPU
!Network
|-
|gpu
|GPU Compute
|3
|12 cores, 2x Intel Xeon Bronze 3204 CPU @ 1.90GHz (2019)
|192 GB
|5x NVIDIA Corporation TU104GL [Tesla T4]
|40 Gbit/s InfiniBand
|-
|cpu16
|General Purpose Compute
|36
|16 cores, 2x Eight-Core Intel Xeon CPU E5-2650 @ 2.00GHz (2012)
|64 GB
|N/A
|40 Gbit/s InfiniBand
|-
|bigmem
|General Purpose Compute
|2
|32 cores, 4x Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz (2015)
|1024 GB
|N/A
|40 Gbit/s InfiniBand
|}


===Storage===
===Storage===
Line 39: Line 80:


TALC is connected to a network disk storage system. This storage is split across the <code>/home</code> and <code>/scratch</code> file systems.   
TALC is connected to a network disk storage system. This storage is split across the <code>/home</code> and <code>/scratch</code> file systems.   
====<code>/home</code>: Home file system====
====<code>/home</code>: Home file system====
Each user has a directory under /home and is the default working directory when logging in to TALC. Each home directory has a per-user quota of 500 GB. This limit is fixed and cannot be increased.
Each user has a directory under /home and is the default working directory when logging in to TALC. Each home directory has a per-user quota of 500 GB. This limit is fixed and cannot be increased.


Note on file sharing: Due to security concerns, permissions set using <code>chmod</code> on your home directory to allow other users to read/write to your home directory be automatically reverted by an automated system process unless an explicit exception is made.  If you need to share files with other researchers on the ARC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.
Note on file sharing: Due to security concerns, permissions set using <code>chmod</code> on your home directory to allow other users to read/write to your home directory will be automatically reverted by an automated system process unless an explicit exception is made.  If you need to share files with others on the TALC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.


====<code>/scratch</code>: Scratch file system for large job-oriented storage====
====<code>/scratch</code>: Scratch file system for large job-oriented storage====
Line 51: Line 90:
Data in <code>/scratch</code> associated with a given job will be deleted automatically, without exception, five days after the job finishes.
Data in <code>/scratch</code> associated with a given job will be deleted automatically, without exception, five days after the job finishes.


 
== Software ==
=== Software ===
{{Message Box
{{Message Box
| title=Software Package Requests
| title=Software Package Requests
Line 58: Line 96:
}}
}}


All ARC nodes run the latest version of CentOS 7 with the same set of base software packages. For your convenience, we have packaged commonly used software packages and dependencies as modules available under <code>/global/software</code>. If your software package is not available as a module, you may also try Anaconda which allows users to manage and install custom packages in an isolated environment.
All TALC nodes run a version of Rocky Linux. For your convenience, we have packaged commonly used software packages and dependencies as modules available under <code>/global/software</code>. If your software package is not available as a module, you may also try Anaconda which allows users to manage and install custom packages in an isolated environment.


For a list of available packages that have been made available, please see [[ARC Software pages]].  
For a list of available packages that have been made available, please see [[ARC Software pages]].  


==== Modules ====
=== Modules ===
The setup of the environment for using some of the installed software is through the <code>module</code> command.
The setup of the environment for using some of the installed software is through the <code>module</code> command.


Line 86: Line 124:


==Using TALC==
==Using TALC==
{{Message Box
|title=Usage subject to [[TALC Terms of Use]]
|message=Please review the [[TALC Terms of Use]] prior to using TALC.
|icon=Support Icon.png}}
===Logging in===
===Logging in===
To log in to TALC, connect using SSH to talc.ucalgary.ca. Connections to TALC are accepted only from the University of Calgary network (on campus) or through the University of Calgary General VPN (off campus).
To log in to TALC, connect using SSH to talc.ucalgary.ca. Connections to TALC are accepted only from the University of Calgary network (on campus) or through the University of Calgary General VPN (off campus).


See [Connecting to RCS HPC Systems] for more information.
When logging into a new TALC account for '''the first time''' the new user has to agree to the '''conditions of use''' for TALC.
Until the conditions are accepted the account is not active.
 
See [[Connecting to RCS HPC Systems]] for more information.


===Working interactively===
===Working interactively===
<!-- original chunk -->
<!-- original chunk -->
ARC uses the Linux operating system. The program that responds to your typed commands and allows you to run other programs is called the Linux shell. There are several different shells available, but, by default you will use one called bash. It is useful to have some knowledge of the shell and a variety of other command-line programs that you can use to manipulate files. If you are new to Linux systems, we recommend that you work through one of the many online tutorials that are available, such as the [http://www.ee.surrey.ac.uk/Teaching/Unix/index.html UNIX Tutorial for Beginners (external link)] provided by the University of Surrey. The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using.  For a more comprehensive introduction to Linux, see [http://linuxcommand.sourceforge.net/tlcl.php The Linux Command Line (external link)].
TALC uses the Linux operating system. The program that responds to your typed commands and allows you to run other programs is called the Linux shell. There are several different shells available, but, by default you will use one called bash. It is useful to have some knowledge of the shell and a variety of other command-line programs that you can use to manipulate files. If you are new to Linux systems, we recommend that you work through one of the many online tutorials that are available, such as the [http://www.ee.surrey.ac.uk/Teaching/Unix/index.html UNIX Tutorial for Beginners (external link)] provided by the University of Surrey. The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using.  For a more comprehensive introduction to Linux, see [http://linuxcommand.sourceforge.net/tlcl.php The Linux Command Line (external link)].


The TALC login node may be used for such tasks as editing files, compiling programs and running short tests while developing programs. We suggest CPU intensive workloads on the login node be restricted to under 15 minutes as per [[General Cluster Guidelines and Policies|our cluster guidelines]]. For interactive workloads exceeding 15 minutes, use the '''[[Running_jobs#Interactive_jobs|salloc command]]''' to allocate an interactive session on a compute node.
The TALC login node may be used for such tasks as editing files, compiling programs and running short tests while developing programs. CPU intensive workloads on the login node should be restricted to under 15 minutes as per [[General Cluster Guidelines and Policies|our cluster guidelines]]. For interactive workloads exceeding 15 minutes, use the '''[[Running_jobs#Interactive_jobs|salloc command]]''' to allocate an interactive session on a compute node.


The default salloc allocation is 1 CPU and 1 GB of memory. Adjust this by specifying <code>-n CPU#</code> and <code>--mem Megabytes</code>. You may request up to 5 hours of CPU time for interactive jobs.
The default <code>salloc</code> allocation is 1 CPU and 1 GB of memory. Adjust this by specifying <code>-n CPU#</code> and <code>--mem Megabytes</code>. You may request up to 5 hours of CPU time for interactive jobs.
  salloc --time 5:00:00 --partition cpu24
  salloc --time 5:00:00 --partition cpu16


===Running non-interactive jobs (batch processing)===
===Running non-interactive jobs (batch processing)===
Production runs and longer test runs should be submitted as (non-interactive) batch jobs, in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the sbatch command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).
Production runs and longer test runs should be submitted as (non-interactive) batch jobs, in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the <code>sbatch</code> command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).


Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on TALC.  One major difference between running jobs on the TALC and Compute Canada clusters is in selecting the type of hardware that should be used for a job. On TALC, you choose the hardware to use primarily by specifying a partition, as described below.
Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on TALC.  One major difference between running jobs on the TALC and Compute Canada clusters is in selecting the type of hardware that should be used for a job. On TALC, you choose the hardware to use primarily by specifying a partition, as described below.


===Selecting a partition===
===Using JupyterHub on Talc===
TALC has a Jupyterhub server which runs a Jupyter server on one of the TALC compute nodes and provides all the necessary encryption and plumbing to deliver the notebook to your computer.  To access this service you must have a TALC account. Point your browser at http://talc.ucalgary.ca and login with your usual UC account.  As of this writing, the job that runs the jupyter notebook is 1 cpu and 10GiB of memory on a cpu16 node.


There are some aspects to consider when selecting a partition including:
* Resource requirements in terms of memory and CPU cores
* Hardware specific requirements, such as GPU or CPU Instruction Set Extensions
* Partition resource limits and potential wait time
* Software support parallel processing using Message Passing Interface (MPI), OpenMP, etc.
** Eg. MPI for parallel processing can distribute memory across multiple nodes, per-node memory requirements could be lower. Whereas, OpenMP or single process code that is restricted to one node would require a higher memory node.
** Note: MPI code running on hardware with Omni-Path networking should be compiled with Omni-Path networking support. This is provided by loading the <code>openmpi/2.1.3-opa</code> or <code>openmpi/3.1.2-opa</code> modules prior to compiling.


Since resources that are requested are reserved for your job, please request only as much CPU and memory as your job requires to avoid reducing the cluster efficiency. If you are unsure which partition to use or the specific resource requests that are appropriate for your jobs, please contact us at [mailto:support@hpc.ucalgary.ca support@hpc.ucalgary.ca] and we would be happy to work with you.
'''Please note''' that before using the Jupyterhub on TALC a new user has to login into his/her TALC account using SSH at least once to '''accept the conditions of TALC use'''.  
Until the conditions are accepted the account is not activated and the Jupyterhub login will not work either.


{| class="wikitable" style="width: 100%;"
===Selecting a partition===
TALC currently has the following partitions available for use. The <code>gpu</code> and <code>cpu12</code> partitions are refer to the same nodes. The <code>cpu12</code> partition was created to only expose the CPUs on the GPU hardware for general purpose use. Each GPU node has 5 Tesla T4 GPUs installed, but you may only request one per job within the TALC environment.
{| class="wikitable"
!Partition
!Partition
!Description
!Description
!Cores/node
!Nodes
!Cores
!Memory
!Memory Request Limit
!Memory Request Limit
!Time Limit
!Time Limit
!GPU
!GPU Request per Job
!Networking
!Network
|-
|gpu
|GPU Compute
|3
|12 cores
|192 GB
|190 GB
|24 hours
|1x NVIDIA Corporation TU104GL [Tesla T4]
|40 Gbit/s InfiniBand
|-
|cpu12
|General Purpose Compute
|3
|12 cores
|192 GB
|190 GB
|24 hours
|None
|40 Gbit/s InfiniBand
|-
|-
|cpu24
|cpu16
|General Purpose Compute
|General Purpose Compute
|24
|36
|255,000 MB
|16 cores
|64 GB
|62 GB
|24 hours
|24 hours
|
|None
|40 Gbit/s InfiniBand
|40 Gbit/s InfiniBand
|-
|-
|cpu32-bigmem
|bigmem
|General Purpose Compute
|General Purpose Compute
|32
|2
|1,000,000 MB
|32 cores
|1024 GB
|1022 GB
|24 hours
|24 hours
|
|None
|40 Gbit/s InfiniBand
|40 Gbit/s InfiniBand
|}
|}
There are some aspects to consider when selecting a partition including:
* Resource requirements in terms of memory and CPU cores
* Hardware specific requirements, such as GPU or CPU Instruction Set Extensions
* Partition resource limits and potential wait time
* Software support for parallel processing using Message Passing Interface (MPI), OpenMP, etc. For example, MPI for parallel processing can distribute memory across multiple nodes, so that per-node memory requirements could be lower. Whereas OpenMP or single process serial code that is restricted to one node would require a higher memory node.


For example, to select the <code>cpu24</code> partition, include the following line in your batch job script:
Since resources that are requested are reserved for your job, please request only as much CPU and memory as your job requires to avoid reducing the cluster efficiency.  If you are unsure which partition to use or the specific resource requests that are appropriate for your jobs, '''Course instructors and TAs''' may contact us at [mailto:support@hpc.ucalgary.ca support@hpc.ucalgary.ca] and we would be happy to work with you.


#SBATCH --partition=cpu24
=== Using a partition ===


==== CPU only jobs ====
To select the <code>cpu16</code> partition, include the following line in your batch job script:<syntaxhighlight lang="text">
#SBATCH --partition=cpu16
</syntaxhighlight>You may also start an interactive session with <code>salloc</code>:<syntaxhighlight lang="text">
$ salloc --time 1:00:00 -p cpu16
</syntaxhighlight>


==== GPU jobs ====
In TALC, you are limited to exactly 1 GPU per job. Jobs that request for 0 GPUs or 2 or more GPUs will not be scheduled.


In addition to the hardware limitations, please be aware that there may also be policy limits imposed on your account for each partition. These limits restrict the number of cores, nodes, or GPUs that can be used at any given time. Since the limits are applied on a partition-by-partition basis, using resources in one partition should not affect the available resources you can use in another partition.
To submit a job using the <code>gpu</code> partition with one GPU request, include the following to your batch job script:<syntaxhighlight lang="text">
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
</syntaxhighlight>
 
Like the previous example, you may also request interactive sessions with GPU nodes using <code>salloc</code>. Just specify the <code>gpu</code> partition and the number of GPUs required. <syntaxhighlight lang="text">
$ salloc --time 1:00:00 -p gpu -n 1 --gpus-per-node 1
</syntaxhighlight>You may verify that a GPU was assigned to your job or interactive session by running <code>nvidia-smi</code>. This command will show you the status of the GPU that was assigned to you.<syntaxhighlight lang="text">
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|  0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A  36C    P0    14W /  70W |      0MiB / 15109MiB |      5%      Default |
|                              |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU  GI  CI        PID  Type  Process name                  GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
</syntaxhighlight>
 
==== Partition limitations ====
In addition to the hardware limitations of the nodes within the partition, please be aware that there may also be policy limits imposed on your account for each partition. These limits restrict the number of cores, nodes, or GPUs that can be used at any given time. Since the limits are applied on a partition-by-partition basis, using resources in one partition should not affect the available resources you can use in another partition.


These limits can be listed by running:
These limits can be listed by running:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
$ sacctmgr show qos format=Name,MaxWall,MaxTRESPU%20,MaxSubmitJobs
$ sacctmgr show qos format=Name,MaxWall,MaxTRESPU%20,MaxSubmitJobs
       Name    MaxWall            MaxTRESPU MaxSubmit
       Name    MaxWall            MaxTRESPU MaxSubmit  
---------- ----------- -------------------- ---------
---------- ----------- -------------------- ---------  
      cpu24 1-00:00:00                         2000
    normal 1-00:00:00                              
cpu32-bigmem  1-00:00:00            cpu=384      2000
  cpulimit                          cpu=48         
gpucpulim+                          cpu=18         
  gpulimit                cpu=2,gres/gpu=1               
</syntaxhighlight>
</syntaxhighlight>


=== Time limits ===
=== Time limits ===
Line 168: Line 282:
<syntaxhighlight lang="bash" highlight="6">
<syntaxhighlight lang="bash" highlight="6">
$ scontrol show partitions
$ scontrol show partitions
PartitionName=single                                                               
PartitionName=cpu16
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL                                  
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=single                                             
   AllocNodes=ALL Default=YES QoS=cpulimit
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO      
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cn[001-168]                                                                
   Nodes=n[1-36]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO      
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF                                              
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1344 TotalNodes=168 SelectTypeParameters=NONE                
   State=UP TotalCPUs=576 TotalNodes=36 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED                                  
  JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
 
 
 
</syntaxhighlight>
</syntaxhighlight>


Alternatively, with <code>sinfo</code> under the <code>TIMELIMIT</code> column:
Alternatively, with <code>sinfo</code> under the <code>TIMELIMIT</code> column:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
$ sinfo                                                    
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST              
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
single       up 7-00:00:00      1 drain* cn097                 
cpu12       up 1-00:00:00      3  idle t[1-3]
single       up 7-00:00:00      1 maint cn002                 
cpu16       up 1-00:00:00    36  idle n[1-36]
single        up 7-00:00:00      4 drain* cn[001,061,133,154]  
bigmem      up 1-00:00:00      2  idle bigmem[1-2]
gpu          up 1-00:00:00      3  idle t[1-3]
...
...
</syntaxhighlight>
</syntaxhighlight>
 
[[Category:TALC]]
 
[[Category:Guides]]
== Support ==
{{Navbox TALC}}
 
{{Message Box
|title=Need Help or have other TALC Related Questions?
|message='''Students''', please send TALC-related questions to your course instructor or teaching assistants.<br />
'''Course instructors and TAs''', please report system issues to support@hpc.ucalgary.ca).
|icon=Support Icon.png}}

Revision as of 18:32, 21 September 2023

TALC status: Cluster operational


No upgrades planned. Please contact us if you experience system issues.

See the TALC Cluster Status page for system notices.

Cybersecurity awareness at the U of C

Please note that there are typically about 950 phishing attempts targeting University of Calgary accounts each month. This is just a reminder to be careful about computer security issues, both at home and at the University. Please visit https://it.ucalgary.ca/it-security for more information, tips on secure computing, and how to report suspected security problems.

This guide gives an overview of the Teaching and Learning Cluster (TALC) at the University of Calgary and is intended to be read by new account holders getting started on TALC. This guide covers topics as the hardware and performance characteristics, available software, usage policies and how to log in and run jobs.

Introduction

TALC is a cluster of computers created by Research Computing Services (RCS) in response to requests for a central computing resource to support academic courses and workshops offered at the University of Calgary. It is a complement to the Advanced Research Computing (ARC) cluster that is used for research, rather than educational purposes. The software environment in the TALC and ARC clusters is very similar and workflows between the two clusters are identical. What students learn about using TALC will have direct applicability to using ARC should they go on to use ARC for research work.

If you are the instructor for a course that could benefit from using TALC, please review this guide and the TALC Terms of Use and then contact us at support@hpc.ucalgary.ca to discuss your requirements.

Please note that in order to ensure that the appropriate software is available, student accounts are in place, and appropriate training has been provided for your teaching assistants, it is best to start this discussion several months prior to the start of the course.

If you are a student in a course using TALC, please review this guide for basic instructions in using the cluster. Questions should first be directed to the teaching assistants or instructor for your course.

Obtaining an account

TALC account requests are expected to be submitted by the course instructor rather than from individual students. You must have a University of Calgary IT account in order to use TALC. If you do not have a University of IT account or email address, please register for one at https://itregport.ucalgary.ca/. In order to ensure TALC is provisioned in time for a course start date, the instructor should submit the initial list of @ucalgary.ca accounts needed for the course 2 weeks before the start date.

User accounts for classes will exist for the duration of the semester they are being taught, and are deleted along with the data in the home directories when the semester ends. You must ensure you save anything you want to access later saved elsewhere before these dates. We do not keep backups of data on TALC.

For the upcoming academic calendar, accounts will be deleted on the following dates:

Spring: 23 June 2023

Summer: 25 Aug 2023

Fall: 22 Dec 2023

Winter: 30 Apr 2024

Getting Support

Need Help or have other TALC Related Questions?

Students, please send TALC-related questions to your course instructor or teaching assistants.
Course instructors and TAs, please report system issues to support@hpc.ucalgary.ca.

Hardware

The TALC cluster is comprised of repurposed research clusters that are a few generations old. As a result, individual processor performance will not be comparable to the latest processors but should be sufficient for educational purposes and course work.

Partition Description Nodes CPU Cores, Model, and Year Installed Memory GPU Network
gpu GPU Compute 3 12 cores, 2x Intel Xeon Bronze 3204 CPU @ 1.90GHz (2019) 192 GB 5x NVIDIA Corporation TU104GL [Tesla T4] 40 Gbit/s InfiniBand
cpu16 General Purpose Compute 36 16 cores, 2x Eight-Core Intel Xeon CPU E5-2650 @ 2.00GHz (2012) 64 GB N/A 40 Gbit/s InfiniBand
bigmem General Purpose Compute 2 32 cores, 4x Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz (2015) 1024 GB N/A 40 Gbit/s InfiniBand

Storage

No Backup Policy!

You are responsible for your own backups. Since accounts on TALC and related data are removed shortly after the associated course has finished, you should download anything you need to save to your own computer before the end of the course.

TALC is connected to a network disk storage system. This storage is split across the /home and /scratch file systems.

/home: Home file system

Each user has a directory under /home and is the default working directory when logging in to TALC. Each home directory has a per-user quota of 500 GB. This limit is fixed and cannot be increased.

Note on file sharing: Due to security concerns, permissions set using chmod on your home directory to allow other users to read/write to your home directory will be automatically reverted by an automated system process unless an explicit exception is made. If you need to share files with others on the TALC cluster, please write to support@hpc.ucalgary.ca to ask for such an exception.

/scratch: Scratch file system for large job-oriented storage

Associated with each job, under the /scratch directory, a subdirectory is created that can be referenced in job scripts as /scratch/${SLURM_JOB_ID}. You can use that directory for temporary files needed during the course of a job. Up to 30 TB of storage may be used, per user (total for all your jobs) in the /scratch file system.

Data in /scratch associated with a given job will be deleted automatically, without exception, five days after the job finishes.

Software

Software Package Requests

Course instructors or teaching assistants should write to support@hpc.ucalgary.ca if additional software is required for their course.

All TALC nodes run a version of Rocky Linux. For your convenience, we have packaged commonly used software packages and dependencies as modules available under /global/software. If your software package is not available as a module, you may also try Anaconda which allows users to manage and install custom packages in an isolated environment.

For a list of available packages that have been made available, please see ARC Software pages.

Modules

The setup of the environment for using some of the installed software is through the module command.

Software packages bundled as a module will be available under /global/software and can be listed with the module avail command.

$ module avail

To enable Python, load the Python module by running:

$ module load python/anaconda-3.6-5.1.0

To unload the Python module, run:

$ module remove python/anaconda-3.6-5.1.0

To see currently loaded modules, run:

$ module list

Using TALC

Usage subject to TALC Terms of Use

Please review the TALC Terms of Use prior to using TALC.

Logging in

To log in to TALC, connect using SSH to talc.ucalgary.ca. Connections to TALC are accepted only from the University of Calgary network (on campus) or through the University of Calgary General VPN (off campus).

When logging into a new TALC account for the first time the new user has to agree to the conditions of use for TALC. Until the conditions are accepted the account is not active.

See Connecting to RCS HPC Systems for more information.

Working interactively

TALC uses the Linux operating system. The program that responds to your typed commands and allows you to run other programs is called the Linux shell. There are several different shells available, but, by default you will use one called bash. It is useful to have some knowledge of the shell and a variety of other command-line programs that you can use to manipulate files. If you are new to Linux systems, we recommend that you work through one of the many online tutorials that are available, such as the UNIX Tutorial for Beginners (external link) provided by the University of Surrey. The tutorial covers such fundamental topics, among others, as creating, renaming and deleting files and directories, how to produce a listing of your files and how to tell how much disk space you are using. For a more comprehensive introduction to Linux, see The Linux Command Line (external link).

The TALC login node may be used for such tasks as editing files, compiling programs and running short tests while developing programs. CPU intensive workloads on the login node should be restricted to under 15 minutes as per our cluster guidelines. For interactive workloads exceeding 15 minutes, use the salloc command to allocate an interactive session on a compute node.

The default salloc allocation is 1 CPU and 1 GB of memory. Adjust this by specifying -n CPU# and --mem Megabytes. You may request up to 5 hours of CPU time for interactive jobs.

salloc --time 5:00:00 --partition cpu16

Running non-interactive jobs (batch processing)

Production runs and longer test runs should be submitted as (non-interactive) batch jobs, in which commands to be executed are listed in a script (text file). Batch jobs scripts are submitted using the sbatch command, part of the Slurm job management and scheduling software. #SBATCH directive lines at the beginning of the script are used to specify the resources needed for the job (cores, memory, run time limit and any specialized hardware needed).

Most of the information on the Running Jobs page on the Compute Canada web site is also relevant for submitting and managing batch jobs and reserving processors for interactive work on TALC. One major difference between running jobs on the TALC and Compute Canada clusters is in selecting the type of hardware that should be used for a job. On TALC, you choose the hardware to use primarily by specifying a partition, as described below.

Using JupyterHub on Talc

TALC has a Jupyterhub server which runs a Jupyter server on one of the TALC compute nodes and provides all the necessary encryption and plumbing to deliver the notebook to your computer. To access this service you must have a TALC account. Point your browser at http://talc.ucalgary.ca and login with your usual UC account. As of this writing, the job that runs the jupyter notebook is 1 cpu and 10GiB of memory on a cpu16 node.


Please note that before using the Jupyterhub on TALC a new user has to login into his/her TALC account using SSH at least once to accept the conditions of TALC use. Until the conditions are accepted the account is not activated and the Jupyterhub login will not work either.

Selecting a partition

TALC currently has the following partitions available for use. The gpu and cpu12 partitions are refer to the same nodes. The cpu12 partition was created to only expose the CPUs on the GPU hardware for general purpose use. Each GPU node has 5 Tesla T4 GPUs installed, but you may only request one per job within the TALC environment.

Partition Description Nodes Cores Memory Memory Request Limit Time Limit GPU Request per Job Network
gpu GPU Compute 3 12 cores 192 GB 190 GB 24 hours 1x NVIDIA Corporation TU104GL [Tesla T4] 40 Gbit/s InfiniBand
cpu12 General Purpose Compute 3 12 cores 192 GB 190 GB 24 hours None 40 Gbit/s InfiniBand
cpu16 General Purpose Compute 36 16 cores 64 GB 62 GB 24 hours None 40 Gbit/s InfiniBand
bigmem General Purpose Compute 2 32 cores 1024 GB 1022 GB 24 hours None 40 Gbit/s InfiniBand

There are some aspects to consider when selecting a partition including:

  • Resource requirements in terms of memory and CPU cores
  • Hardware specific requirements, such as GPU or CPU Instruction Set Extensions
  • Partition resource limits and potential wait time
  • Software support for parallel processing using Message Passing Interface (MPI), OpenMP, etc. For example, MPI for parallel processing can distribute memory across multiple nodes, so that per-node memory requirements could be lower. Whereas OpenMP or single process serial code that is restricted to one node would require a higher memory node.

Since resources that are requested are reserved for your job, please request only as much CPU and memory as your job requires to avoid reducing the cluster efficiency. If you are unsure which partition to use or the specific resource requests that are appropriate for your jobs, Course instructors and TAs may contact us at support@hpc.ucalgary.ca and we would be happy to work with you.

Using a partition

CPU only jobs

To select the cpu16 partition, include the following line in your batch job script:

#SBATCH --partition=cpu16

You may also start an interactive session with salloc:

$ salloc --time 1:00:00 -p cpu16

GPU jobs

In TALC, you are limited to exactly 1 GPU per job. Jobs that request for 0 GPUs or 2 or more GPUs will not be scheduled.

To submit a job using the gpu partition with one GPU request, include the following to your batch job script:

#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1

Like the previous example, you may also request interactive sessions with GPU nodes using salloc. Just specify the gpu partition and the number of GPUs required.

$ salloc --time 1:00:00 -p gpu -n 1 --gpus-per-node 1

You may verify that a GPU was assigned to your job or interactive session by running nvidia-smi. This command will show you the status of the GPU that was assigned to you.

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P0    14W /  70W |      0MiB / 15109MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Partition limitations

In addition to the hardware limitations of the nodes within the partition, please be aware that there may also be policy limits imposed on your account for each partition. These limits restrict the number of cores, nodes, or GPUs that can be used at any given time. Since the limits are applied on a partition-by-partition basis, using resources in one partition should not affect the available resources you can use in another partition.

These limits can be listed by running:

$ sacctmgr show qos format=Name,MaxWall,MaxTRESPU%20,MaxSubmitJobs
      Name     MaxWall            MaxTRESPU MaxSubmit 
---------- ----------- -------------------- --------- 
    normal  1-00:00:00                                
  cpulimit                           cpu=48           
gpucpulim+                           cpu=18           
  gpulimit                 cpu=2,gres/gpu=1

Time limits

Use the --time directive to tell the job scheduler the maximum time that your job might run. For example:

#SBATCH --time=hh:mm:ss

You can use scontrol show partitions or sinfo to see the current maximum time that a job can run.

$ scontrol show partitions
PartitionName=cpu16
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=cpulimit
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=n[1-36]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=576 TotalNodes=36 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Alternatively, with sinfo under the TIMELIMIT column:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu12        up 1-00:00:00      3   idle t[1-3]
cpu16        up 1-00:00:00     36   idle n[1-36]
bigmem       up 1-00:00:00      2   idle bigmem[1-2]
gpu          up 1-00:00:00      3   idle t[1-3]
 
...