How to use local storage on ARC's compute nodes

From RCSWiki
Jump to navigation Jump to search

Background

On ARC cluster’s compute nodes, the local file systems /tmp (disk-based) and /dev/shm (memory-based) offer fast, node-local storage that avoids network latency. They are ideal for workloads needing quick access to temporary data, such as intermediate files, caches, or scratch space. Access to data in these storage locations is much faster than to shared network storage, such as /home or /scratch. They are limited in total size but have no file count quotas. Since they are private to each node and cleared after the job ends, they’re best for short-lived data that doesn’t need to persist.

These file systems are not shared between compute nodes, so if you have a very large number of small files which you want to process, your job will have to, possibly, decompress the files on the the local storage at the beginning.

Why is this important?

Let us imagine that we have 1000000 small files that are going to be used in 1000 concurrent jobs. The total size of these 1M files is 3.9GB. We also have an bzip2 archive of the files, 2.5MB in size.

$ du -sh *
3.9G	files-1M
2.5M	files-1M.tar.bz2

Each job will read each file 10 times over the run time. Probably, different parts of the files.

We also know, that reads from a shared file systems, such as /home or /scratch are done over network from a server, an are relatively slow, while reads from a local drive are done over local data channel and are relatively fast.


Let us consider the case when all these files are stored in user's home directory, on the /home file system.

  • 1000 jobs will read 1000000 files 10 times from the central file server:
1000 x 10 x 1000000 = 10B slow reads from the central storage.


If each job runs on its own compute node and decompresses the data into the local /tmp directory:

  • Read the archive from central storage from 1000 jobs: 1000 slow reads.
  • Write 1M files to local storage: 1M fast writes.
  • Read each file 10 times: 1M x 10 = 10M fast reads.
  • Because each job runs on its own compute node, the fast local writes and reads can be done at the same time without interference.
That is, 1M writes on 1000 nodes will take the same time as 1M writes on 1 node.

Thus, the total read/write time will be: 1000 slow reads + 1M fast writes + 10M fast reads.

The benefits:

  • The input/output time is expected to be much shorter in the second case.
  • The central shared storage is loaded much less.

Local /tmp file system

The local temporary directory /tmp on all ARC's compute nodes is located on a local storage drive, either a spinning hard disk drive (HDD) or a solid-state drive (SSD). Depending on the node, the total size of the drive providing /tmp ranges from 500 GB to 1 TB. Each job running on a compute node is allocated its own private /tmp directory within this local storage. To prevent a single job from consuming all temporary space, a storage quota of 100 GB per job is currently enforced.


One can start an interactive job on the cpu2019-bf05 partition:

$ salloc -N1 -n1 -c4 --mem=16gb -t 1:00:00 -p cpu2019-bf05

and then check the current storage usage with the arc.quota command:

 $ arc.quota
Filesystem              Available        Used / Total            Files Used / Total
----------              ---------    -------- / ----------       ---------- / ----------
Home Directory          423.5GB        76.4GB / 500.0GB (15%)       1.1 Mil / 1.5 Mil (76%)
/scratch                15.0TB          0.0TB / 15.0TB (0%)           0.0 K / 1000.0 K (0%)
/tmp (on fc106)         99.9GB          0.0GB / 100.0GB (0%)          0.0 K / Unlimited

The third record shows the available local storage for our job, 99.9 GB in the /tmp directory with Unlimited file count. It is also indicated that this storage is local to the fc106 compute node.