General Cluster Guidelines and Policies: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
Line 20: Line 20:


* The '''gpu-v100''' partition is strictly for computations that utilize GPUs.  
* The '''gpu-v100''' partition is strictly for computations that utilize GPUs.  
: Please do not run CPU-only computations on the '''gpu-v1000''' partition.
: Please do not run CPU-only computations on the '''gpu-v100''' partition.





Revision as of 20:30, 21 July 2020

General Rules

  • Never run anything related to your research on the Login Node.
The login node is for
=> Data management, that is file management, compression / decompression, and, possibly, data transfer.
=> Job management: job script creation / submission / monitoring.
=> Software development: Source editing / compilation.
=> Short data analysis computations that take 100% of 1 CPU for up to 15 minutes.
Everything else should be run on compute nodes, either via the sbatch command or in an interactive job via the salloc command.


  • You have to make sure that the resources you request for the job are used by the job.
When resources are requested from SLURM by the job script, they will be provided for the job, but it does not mean that the code that is run as a part of the job knows how to use them. It is instrumental that the user makes sure that the resources that are requested are properly used.

Guidelines

  • The bigmem partitions while can be used for general shorter jobs is intended for computations that need lots of memory.
Please avoid running low memory computations on the bigmem partition.


  • The gpu-v100 partition is strictly for computations that utilize GPUs.
Please do not run CPU-only computations on the gpu-v100 partition.


  • Interactive jobs should be limited to less than 5 hours.
=> If an interactive job asks for more than 5 hours of run time, it is hardly interactive. Who can stare in the screen for more than 5 hours straight?
=> Interactive jobs tend to be resource-wise wasteful as the job does not finish when the computation is done, but keeps running until it times out.
=> The partition setup allows for much quicker resource allocation for jobs that are 5 hours or less, so it is *significantly* easier to get resources in the default partitions for shorter jobs.