How to use local storage on ARC's compute nodes

From RCSWiki
Jump to navigation Jump to search

Background

On ARC cluster’s compute nodes, the local file systems /tmp (disk-based) and /dev/shm (memory-based) offer fast, node-local storage that avoids network latency. They are ideal for workloads needing quick access to temporary data, such as intermediate files, caches, or scratch space. Access to data in these storage locations is much faster than to shared network storage, such as /home or /scratch. They are limited in total size but have no file count quotas. Since they are private to each node and cleared after the job ends, they’re best for short-lived data that doesn’t need to persist.

These file systems are not shared between compute nodes, so if you have a very large number of small files which you want to process, your job will have to, possibly, decompress the files on the the local storage at the beginning.

Why is this important?

Imagine we have 1,000,000 small files that will be used in 1,000 concurrent jobs. The total size of these files is 3.9 GB, and we also have a bzip2 archive of them, 2.5 MB in size:

$ du -sh *
3.9G    files-1M
2.5M    files-1M.tar.bz2

Reading many small files from shared storage is slow and puts heavy load on the central file server. For example, 1,000 jobs each reading 1,000,000 small files (3.9 GB total) 10 times would cause 10 billion slow reads over the network.

If instead each job downloads a 2.5 MB archive from shared storage and extracts it to its local /tmp, the central storage only handles 1,000 slow reads. All subsequent access, 3.9 GB of 1M files read 10 times, happens on fast local storage, in parallel on each node, greatly reducing runtime and avoiding central storage bottlenecks.

Local /tmp file system

The local temporary directory /tmp on all ARC's compute nodes is located on a local storage drive, either a spinning hard disk drive (HDD) or a solid-state drive (SSD). Depending on the node, the total size of the drive providing /tmp ranges from 500 GB to 1 TB. Each job running on a compute node is allocated its own private /tmp directory within this local storage. To prevent a single job from consuming all temporary space, a storage quota of 100 GB per job is currently enforced.


One can start an interactive job on the cpu2019-bf05 partition:

$ salloc -N1 -n1 -c4 --mem=16gb -t 1:00:00 -p cpu2019-bf05

and then check the current storage usage with the arc.quota command:

 $ arc.quota
Filesystem              Available        Used / Total            Files Used / Total
----------              ---------    -------- / ----------       ---------- / ----------
Home Directory          423.5GB        76.4GB / 500.0GB (15%)       1.1 Mil / 1.5 Mil (76%)
/scratch                15.0TB          0.0TB / 15.0TB (0%)           0.0 K / 1000.0 K (0%)
/tmp (on fc106)         99.9GB          0.0GB / 100.0GB (0%)          0.0 K / Unlimited

The third record shows the available local storage for our job, 99.9 GB in the /tmp directory with Unlimited file count. It is also indicated that this storage is local to the fc106 compute node.

Using /tmp in a job

We have a job script that runs a command, say do_work on a directory my_data containing many small input files.

The data directory is in our home directory which can be addressed as $HOME.

A job script for processing from the shared file system could look like this:

#! /bin/bash
#SBATCH --job-name=test1

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=0-06:00:00
#SBATCH --partition=single

DATA_DIR="$HOME/my_data"

do_work $DATA_DIR

To convert it to using the local /tmp file system, we have to compress the files first:

$ cd
$ tar cjf my_data.tar.bz2 my_data

The job script for the local /tmp storage then can look like this:

#! /bin/bash
#SBATCH --job-name=test1

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=0-06:00:00
#SBATCH --partition=single

DATA_DIR="$HOME/my_data"

do_work $DATA_DIR