ARC Hardware: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
(Moved from the ARC Cluster Guide)
 
mNo edit summary
Line 1: Line 1:
[[Category::ARC Cluster]]
Due to the complicated history of the ARC cluster, it is comprised of a mixture of old and relatively new hardware that can be lumped into two broad categories, as described below.  Each type of compute node is associated with a partition name. Partitions are discussed on the main [[ARC Cluster Guide#Selecting a partition]] page.
The network interconnect between the compute nodes


=== Hardware installed January 2019 and later ===
=== Hardware installed January 2019 and later ===

Revision as of 15:29, 9 April 2020

[[Category::ARC Cluster]] Due to the complicated history of the ARC cluster, it is comprised of a mixture of old and relatively new hardware that can be lumped into two broad categories, as described below. Each type of compute node is associated with a partition name. Partitions are discussed on the main ARC Cluster Guide#Selecting a partition page.

The network interconnect between the compute nodes

Hardware installed January 2019 and later

  • General purpose nodes (cpu2019, apophis, apophis-bf, razi and razi-bf partitions): These are 40-core nodes, each node having 2 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 40 cores on the individual compute nodes share about 190 GB of RAM (memory), but, jobs should request no more than 185000 MB.
  • GPU (Graphics Processing Unit)-enabled (gpu-v100 partition): These are 40-core nodes, each node having 2 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 40 cores on the individual compute nodes share about 750 GB of RAM (memory) but, jobs should request no more than 753000 MB. In addition, each node has two Tesla V100-PCIE-16GB GPUs.
  • Large-memory nodes (bigmem partition): These are 80-core nodes with 4 sockets. Each socket has an Intel Xeon Gold 6148 20-core processor, running at 2.4 GHz. The 80 cores on the individual compute nodes share about 3 TB of RAM (memory), but, jobs should request no more than 3000000 MB.

Legacy hardware migrated from older clusters

While the hardware described in this section is still quite useful for many codes, performance may typically be half or less that provided by the newer ARC hardware described above. Also, we expect more frequent hardware failures for these partitions as the components reach the end of their lifetimes.

  • Nodes moved from the former Hyperion cluster at the Cumming School of Medicine (cpu2013 partition): These are 16-core nodes, based on Intel Xeon ES-2670 8-core 2.6 GHz processors. The 16 cores associated with one of the individual nodes share about 126 GB of RAM (memory), but, jobs should request no more than 120000 MB.
  • Nodes moved from the former Lattice cluster (lattice and single partitions): These are 8-core nodes, each node having 2 sockets. Each socket has an Intel Xeon L5520 (Nehalem) quad-core processor, running at 2.27 GHz. The 8 cores associated with one of the individual nodes share 12 GB of RAM (memory), but, jobs should request no more than 12000 MB.
  • Non-GPU nodes moved from the former Parallel cluster (parallel partition): These are 12-core compute nodes based on the HP Proliant SL390 server architecture. Each node has 2 sockets, each of which has a 6-core Intel E5649 (Westmere) processor, running at 2.53 GHz. The 12 cores associated with one compute node share 24 GB of RAM, but, jobs should request no more than 23000 MB.
  • GPU (Graphics Processing Unit)-enabled nodes moved from the former Parallel cluster (gpu partition): These 12-core nodes are similar to the non-GPU Parallel nodes described above, but, also have 3 NVIDIA Tesla M2070 GPUs. Each GPU has about 5.5 GB of memory and what is known as Compute Capability 2.0 (which means that 64-bit floating point calculation is supported, along with other capabilities that make these graphics boards suitable for general-purpose use, beyond just graphics applications; however, the GPUs are typically too old to support modern machine-learning software, such as TensorFlow).
  • Nodes moved from the former Breezy cluster (breezy partition): These are 24-core nodes, each with four sockets having 6-core AMD Istanbul processors running at 2.4 GHz. The 24 cores associated with one compute node share 256 GB of RAM, but, it is recommended that jobs request at most 255000 MB. Update 2019-11-27 - the breezy partition nodes are being repurposed as a cluster to support teaching and learning activities and are no longer available as part of ARC.
  • Bigbyte large-memory node moved from the former Storm cluster: (bigbyte partition): This is a single 32-core node with four 8-core Intel Xeon CPU E7- 4830 processors running at 2.13 GHz. These processors share 1 TB of memory of which you can request up to 1000000 MB.

Interconnect

The compute nodes withing the cpu2019, apophis, apophis-bf, razi and razi-bf partitions communicate via a 100 Gbit/s Omni-Path network. The compute nodes within the lattice and parallel cluster partitions use an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched fabric, with a two to one blocking factor. All those partitions are suitable for multi-node MPI parallel processing.