How to use ARC scratch space: Difference between revisions
(Replaced navbox with guides navbox) |
|||
(6 intermediate revisions by 2 users not shown) | |||
Line 15: | Line 15: | ||
at the moment of writing. | at the moment of writing. | ||
The '''current storage usage''' on ARC can be checked at any moment with the <code>arc.quota</code> command. | The '''current storage usage''' on ARC can be checked at any moment with the <code>arc.quota</code> command. | ||
The <code>/scratch</code> is '''not faster''' than other storage on ARC, about the same performance as the <code>/home</code> file system, | |||
it only provides access to large temporary run time storage. | |||
Line 21: | Line 23: | ||
Before | Before a job starts, its '''job ID''' is not yet known, and the '''scratch directory''' does not yet exist. | ||
Thus, is it impossible to stage any data into that directory before the job starts. | Thus, is it impossible to stage any data into that directory before the job starts. | ||
Any data staging, if required has to be during job's run time in the job script. | Any data staging, if required has to be during job's run time in the job script. | ||
Line 33: | Line 35: | ||
Let us assume that one have a large data set that is compressed into an archive <code>data.zip</code> in one's home directory. | Let us assume that one have a large data set that is compressed into an archive <code>data.zip</code> in one's home directory. | ||
The data set has to be processed by a Python 3 code <code>analysis.py</code> that is located in the <code>bin</code> directory in the home. | The data set has to be processed by a Python 3 code <code>analysis.py</code> that is located in the <code>bin</code> directory in the home. | ||
The data set is too large to | The data set is too large to be uncompressed into the home directory and the scratch space has to be used to process the data. | ||
The result of the analysis is saved by the Python code to the <code>results.log</code> in the working directory. | The result of the analysis is saved by the Python code to the <code>results.log</code> in the working directory. | ||
Line 43: | Line 45: | ||
* The scratch space is cleaned up by deleting the data, and the output from the scratch directory. | * The scratch space is cleaned up by deleting the data, and the output from the scratch directory. | ||
* Job ends and the scratch directory is automatically deleted. | * Job ends and the scratch directory is automatically deleted. | ||
This could be done with the following job script, <code>job.slurm</code>: | This could be done with the following job script, <code>job.slurm</code>: | ||
Line 77: | Line 80: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
[[Category:Guides]] | |||
[[How-Tos]] | [[Category:How-Tos]] | ||
[[Category:ARC]] | |||
{{Navbox Guides}} |
Latest revision as of 20:35, 21 September 2023
Background
References:
The scratch space provided on ARC is designed to handle large temporary files generated by the job during job's run time.
The scratch space is crated by SLURM when the job starts on the /scratch
file systems
as a directory with the name /scratch/<job ID>
.
For example, if the job ID is 1234567, then the directory name will be /scratch/1234567
.
The quota, the storage limit, for the /scratch
file system is set to 15TB and 1 million files per user,
at the moment of writing.
The current storage usage on ARC can be checked at any moment with the arc.quota
command.
The /scratch
is not faster than other storage on ARC, about the same performance as the /home
file system,
it only provides access to large temporary run time storage.
If the scratch directory is empty at the end of the job, then it is deleted automatically upon job's completion.
If, however, the scratch directory in not empty and contains some data, when the job finishes, the directory is not deleted immediately, but instead it is allowed to stay for another 5 days, starting from the time of the job's end, to let the user to move the data to a proper storage location. After 5 days the scratch directory will be automatically deleted even if it still contains data.
Before a job starts, its job ID is not yet known, and the scratch directory does not yet exist.
Thus, is it impossible to stage any data into that directory before the job starts.
Any data staging, if required has to be during job's run time in the job script.
While a job is running there is a SLURM environmental variable $SLURM_JOBID
is set to the actual job ID of the job.
Hence, the scratch directory can be referenced as /scratch/$SLURM_JOBID
.
Examples
Large data set
Let us assume that one have a large data set that is compressed into an archive data.zip
in one's home directory.
The data set has to be processed by a Python 3 code analysis.py
that is located in the bin
directory in the home.
The data set is too large to be uncompressed into the home directory and the scratch space has to be used to process the data.
The result of the analysis is saved by the Python code to the results.log
in the working directory.
Then it can be done like this:
- Job starts and changes its working directory to the scratch space;
- The data file is decompressed from user's home directly in the scratch space;
- The data is processed by the
analysis.py
code in the scratch space; - Once the analysis is done, the output
results.log
is copied back to the directory the job started from. - The scratch space is cleaned up by deleting the data, and the output from the scratch directory.
- Job ends and the scratch directory is automatically deleted.
This could be done with the following job script, job.slurm
:
#! /bin/bash
# ======================================================================
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4gb
#SBATCH --time=0-02:00:00
# ======================================================================
module load python/3.10.4
# Remember the initial working directory and put the scratch storage path into a variable.
WORKDIR=`pwd`
SCRATCH=/scratch/$SLURM_JOBID
# Change the working directory to the scratch location and unzip the data from the archive in the home.
cd $SCRATCH
unzip $HOME/data.zip
# Run the analysis code to process all the data.
python3 $HOME/bin/analysis.py .....
# Copy the results back to the original working directory.
cp results.log $WORKDIR/
# Clean up the scratch space and change the working directory back to the initial location.
rm -rf $SCRATCH/*
cd $WORKDIR
# The job ends here.