System Configuration

From RCSWiki
Jump to navigation Jump to search

Networking

Firewall rules

ARC's rules:

  • ARC's login node only allows connections from inside the UofC campus network.
  • Compute Canada clusters are exceptions and direct connections are accepted from the Cedar, Graham, Beluga, and Niagara clusters.
  • The login node is also open for connections from 3 IP addresses of the Harvward School of Public Health (Tatum D Mortimer), for Dr. Ian Lewis external collaborators.
Ian has been working to give them the GA status with UofC, so at some point this exception should be removed.


ARC-DTN:

  • Should be exactly the same as ARC login node.
  • There is an exception for Dmitri's home IP address.


TALC:

  • Just the campus network.


GlaDOS:

  • Not sure.

ARC's local network

  • Direct ssh connections to the compute nodes are only allowed where there is an active job on the nodes. Controlled by PAM.
    • RCS analysts are exception and can SSH to any node on the cluster.
    • This rule is designed to prevent any abusive use outside of the scheduler, and
    • Kills any leftover zombies from crashed Ansys Fluent runs.
  • Compute nodes are not directly accessible from the outside of the cluster.
One can use ssh tunnel to connect to a compute node from the outside.
  • SSH connections without password should be enabled between compute nodes of the same job.
In practice users should be able to ssh to any node that has their jobs scheduled to without password
and without having an ssh-key in the .ssh diretory.
  • SSH connections without password should be enabled for all relevant interfaces on compute nodes.
That is, for parallel and lattice nodes it is for eth0 and ib0.

DTN, login, and compute nodes

  • Egress access rules
  • Ingress access rules
  • Exclusions to rules

Per partition

  • Types of networking available (eg ethernet, OPA/IB)
  • Network speeds (1/10/40/100 Gbs) per type

Software configuration

Tools and utilities

  • nedit -- X11 GUI Linux text editor. It is part of the training.
  • mc -- file manager;
  • screen -- window manager;
  • python 2.7


Ansys

  • Ansys Fluent 20219r2 required some specific library and the -pib.infinipath parameter to run on the cpu2019 partition using MPI over OPA.

Scheduling (SLURM) Parameters

  • Cluster level maxima and by partition maxima per user for jobs submitted, jobs running
  • CPU limits by partition
    • GPU limits by partition -- Must reject 0 gpu jobs on gpu partitions
  • Time limits by partition
  • File handle limits by job
  • Core Equivalency if we end up doing that?
  • Decay half-life or threshold for priority impact from jobs run
  • Default job parameters set by the lua script

Something I was thinking of would be to use a table like this for this type of information:

Parameter Original Setting Current Setting Reason for Change Link to change log for historical events Comments
Max Cores 1000 10 To annoy users No link Why do you want comments

Storage Parameters

  • /home Quotas, 500GB / 200K files per user.
  • /work Quotas (by group)
  • /bulk Quotas (by group)
  • Home directory access permissions and enforcement, chmod 700 by default, enforced every 1 hour.
  • /tmp Quotas (is this uniform across the compute nodes?)
  • inode limits by user and by group
  • backup/snapshot frequency

Automatic process management

Login and DTN nodes

ARC login node

  • Max 5 GB of RAM for all processes belonging to a single user.
  • No swapping is allowed, max virtual mem size is also 5GB.
  • Cron jobs should be disabled for normal users.
This is to prevent potential troubles from multiple login nodes;
Abusive regular jobs from users.

ARC compute nodes

  • Swap is disabled on compute nodes. Swap request is a resource, it will set to 0 in SLURM QOS (this is planned).
Discussed on 2021-02-17 morning meeting.
This has been set before the December upgrade (on June 8th, 2020), and it was not re-implemented.

Dave's message from June 8th, 2020, about it (subject "Memory changes on Arc"):

Hi (many in bcc:),

Recently, it was noticed that a number of jobs rely on the swap partition to provide more virtual memory than was requested by the job.  Swap memory is around 100 times slower than RAM and it's use can usually be avoided by requesting more memory or changing to a node with more memory.

Since running from swap is so slow, jobs run from it are basically stuck and waste resources doing nothing. To prevent this, Arc is being changed to kill the job instead of allowing this behaviour.

One or more of your jobs have been identified as using the swap memory on the compute nodes of Arc since May 1, 2020.  In the future if a job exceeds the --mem= request to sbatch/salloc, it will be automatically terminated by the scheduling system.

Research Computing Services, RCS, would be happy to consult with you on requesting sufficient memory to prevent termination. Please reply to this message requesting assistance if you would like someone to have a look at your job submissions.  We expect to make this change on Wednesday, June 10, 2020.  Jobs that have started running by this time will not be affected.

Sincerely
-Dave
Dave Schulz
Research Computing Services

This is accomplished by having the following in /etc/slurm/cgroup.conf (Specifically the addition of the last 2 lines of the blockquote):

CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

AllowedSwapSpace=0
ConstrainSwapSpace=yes

Procedural Rules

Access

  • Who can access ARC, who requires approval (currently undergrads and external collaborators) and who can grant it (only the PI or also postdocs?)
  • Who needs to be consulted before granting access to a group, directory (always the PI?)
  • Account renewal and deletion procedures (is this settled in the TOU, I wasn't in the last several of those meetings)
  • Blocking users with a nologin for maintenance/changes

Storage

  • Process for requesting and confirming /work and bulk storage (this is a longer discussion)

Jobs

  • What kinds of long-running processes we permit on the login node (are we going to insist on moving all data transfers to the DTN)
  • What kinds of long-running processes we permit on the DTN node (are tar and md5 allowed)

Other services

OnDemand portal

  • Currently managed by Leo.

In summary, this morning, we agreed that there will be 3 types of partitions on OOD:

  1. Single + Backfill, 5 hour limit
  2. Bigmem, 2 hour limit
  3. GPU-v100, 2 hour limit
  • Selecting the Single partition will also include all backfill partitions (apophis-bf,pawson-bf,razi-bf,synergy-bf,theia-bf). Bigmem and GPU-v100 will not.
  • We also discussed potentially limiting interactive jobs to 1 per user. We're exploring the possibilities of perhaps injecting job dependencies or custom GRES to do this if it's even possible.
  • I have added a partition dropdown to the desktop interactive app form. It will look like what I have attached to this email. The time/CPU/memory/GPU limits on this form will be constrained to the limits listed on our ARC cluster guide for the selected partition.
  • Question: Do we want to limit the number of GPUs that can be requested for one interactive job on gpu-v100? Currently it is set to accept interactive jobs requesting or 1-2 GPUs only.