How to use local storage on ARC's compute nodes
Background
On ARC cluster’s compute nodes,
the local file systems /tmp (disk-based) and /dev/shm (memory-based)
offer fast, node-local storage that avoids network latency.
They are ideal for workloads needing quick access to temporary data, such as
intermediate files, caches, or scratch space.
Access to data in these storage locations is much faster than to shared network storage, such as /home or /scratch.
They are limited in total size but have no file count quotas.
Since they are private to each node and cleared after the job ends,
they’re best for short-lived data that doesn’t need to persist.
These file systems are not shared between compute nodes, so if you have a very large number of small files which you want to process, your job will have to copy or decompress the files to the the local storage at the beginning, before the processing step.
Why this is important
Imagine we have 1,000,000 small files that will be used in 1,000 concurrent jobs. The total size of these files is 3.9 GB, and we also have a bzip2 archive of them, 2.5 MB in size:
$ du -sh *
3.9G files-1M
2.5M files-1M.tar.bz2
Each job will read every file 10 times during its runtime—possibly accessing different parts of each file.
Reading many small files from shared storage is slow and puts heavy load on the central file server.
For example, 1,000 jobs each reading 1,000,000 small files (3.9 GB total) 10 times would cause 10 billion slow reads over the network.
If instead each job downloads a 2.5 MB archive from shared storage and extracts it to its local /tmp,
the central storage only handles 1,000 slow reads.
All subsequent access, 3.9 GB of 1M files read 10 times, happens on fast local storage,
in parallel on each node, greatly reducing runtime and avoiding central storage bottlenecks.
Local /tmp file system
The local temporary directory /tmp on all ARC's compute nodes
is located on a local storage drive,
either a spinning hard disk drive (HDD) or a solid-state drive (SSD).
Depending on the node,
the total size of the drive providing /tmp ranges from 500 GB to 1 TB.
Each job running on a compute node is allocated its own private /tmp directory within this local storage.
To prevent a single job from consuming all temporary space, a storage quota of 100 GB per job is currently enforced.
One can start an interactive job on the cpu2019-bf05 partition:
$ salloc -N1 -n1 -c4 --mem=16gb -t 1:00:00 -p cpu2019-bf05
and then check the current storage usage with the arc.quota command:
$ arc.quota Filesystem Available Used / Total Files Used / Total ---------- --------- -------- / ---------- ---------- / ---------- Home Directory 423.5GB 76.4GB / 500.0GB (15%) 1.1 Mil / 1.5 Mil (76%) /scratch 15.0TB 0.0TB / 15.0TB (0%) 0.0 K / 1000.0 K (0%) /tmp (on fc106) 99.9GB 0.0GB / 100.0GB (0%) 0.0 K / Unlimited
The third record shows the available local storage for our job, 99.9 GB in the /tmp directory with Unlimited file count.
It is also indicated that this storage is local to the fc106 compute node.
The actual storage device can be seen with
$ df -h /tmp Filesystem Size Used Avail Use% Mounted on /dev/sda2 873G 4.4G 869G 1% /tmp
Using /tmp in a job
We have a job script that runs a command, say do_work on a directory my_data containing many small input files.
The data directory is in our home directory which can be addressed as $HOME.
A job script for processing from the shared file system could look like this:
#! /bin/bash
#SBATCH --job-name=test1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=0-06:00:00
#SBATCH --partition=single
# Set a variable to point to the data directory.
DATA_DIR="$HOME/my_data"
do_work $DATA_DIR
To convert it to using the local /tmp file system, we have to compress the files first:
$ cd $ tar cjf my_data.tar.bz2 my_data
The job script for the local /tmp storage then can look like this:
#! /bin/bash
#SBATCH --job-name=test2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=0-06:00:00
#SBATCH --partition=single
# Decompress files to the temporary location.
TMP=/tmp
tar xjf $HOME/my_data.tar.bz2 -C $TMP
# Set the variable to point to the data directory on the local temporary storage.
DATA_DIR="$TMP/my_data"
do_work $DATA_DIR
Note, that you do not have to delete the temporary files when the job is over, the space was created for your job by the scheduler and will be cleaned up by the scheduler when the job is over.
Local RAM based /dev/shm file system
The /dev/shm file system is memory-backed (RAM).
On ARC compute nodes, it is created fresh for each job, with its size set equal to the memory limit requested.
It is local to the node and not shared between jobs (even between jobs on the same node).
Like /tmp, it has no file count limit.
When working with small files, it is expected to be 2–3x faster than the /tmp file system.
For example:
- A job requesting
--mem=16gbwill have a/dev/shmof 16 GB.
- A job requesting
--mem=200gbwill have a/dev/shmof 200 GB.
Because /dev/shm is part of the job’s memory allocation,
the requested memory must cover both the program’s needs and the space to hold data in /dev/shm.
If the filesystem is filled beyond this limit,
the job will terminate with an out of memory error.
That is, even though the size of the /dev/shm is indicated as 200 GB, you cannot expect to put 200 GB data into it.
This is not accidental, it is the defined behavior of how /dev/shm works on ARC.
For instance:
- If a job needs 30 GB to run code and processes 100 GB of data from normal storage, a 32 GB memory request will be sufficient.
- If the same job decompresses 100 GB into
/dev/shm, the memory request must include 30 GB (for the code) + 100 GB (for the data) + overhead = about 140 GB in total.
- The size of
/dev/shmwill be shown as 140GB, but the job will only be putting 100 GB of data there.
Please note, that you have to make sure that the compute nodes in the partition you are going to submit the job to,
should have enough memory for the new memory request.
Compute nodes on ARC have different amounts of memory (RAM),
ranging from 180GB to 500GB on general partitions, to 2-8TB on big memory nodes.
Please use the arc.hardware command to find the partition with enough memory per node to fit your job's resource request.
/dev/shm in an interactive job
Start an interactive job:
$ salloc -N1 -n1 -c4 --mem=32gb -t 5:00:00 -p cpu2019-bf05
Check the available space in /dev/shm:
$ df -h /dev/shm Filesystem Size Used Avail Use% Mounted on tmpfs 32G 0 32G 0% /dev/shm
At this point, it can be used as any other directory, during the job's run time.
Using /dev/shm in a job scirpt
For example, we have been running jobs to process multiple data sets. Each dataset had many files stored in a corresponding data directory.
We have been using job scripts that run a command, do_work,
on a data directory dataset_NN, that contains the dataset files,
_NN here is the number of the dataset.
The data directories are in $HOME/datasets.
Thus, the job script to process dataset 10 could look like this:
#! /bin/bash
#SBATCH --job-name=dataset_10
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=32gb
#SBATCH --time=0-05:00:00
#SBATCH --partition=cpu2019-bf05
# Set a variable to point to the data directory.
DATA_DIR="$HOME/datasets/dataset_10"
do_work $DATA_DIR
However, the dataset 11 contains 2M files with the total size of 100GB.
This dataset cannot be decompressed into a home directory due to the file count quota of 1.5M files total.
The plan will be to upload and save the dataset as $HOME/datasets/dataset_10.zip, then to decompress it to the
compute node's /dev/shm at job's runtime.
The memory request for the job will have to include the runtime memory, 32GB, as well as the memory for
storing the data, 100GB, that is 132GB.
We have to find a partition that can provide this amount of memory using
the arc.hardware command.
After checking, we will know that the nodes in the cpu2019-bf05 partition have 180000MB of memory per node, which is sufficient for our purpose.
Thus, the job script for the dataset 11 can be like this:
#! /bin/bash
#SBATCH --job-name=dataset_11
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=132gb
#SBATCH --time=0-05:00:00
#SBATCH --partition=cpu2019-bf05
# Decompress files to the temporary location.
TMP=/dev/shm
unzip $HOME/datasets/dataset_11.zip -d $TMP
# Set the variable to point to the data directory on the local temporary storage.
DATA_DIR="$TMP/dataset_11"
do_work $DATA_DIR
Note, that you do not have to delete the temporary files when the job is over, the space was created for your job and will be cleaned up when the job is over.
Decompressing the files may take some time, 15-30 minutes for example.
This time has to be added to the expected runtime, if the requested time limit margin is tight.