General Cluster Guidelines and Policies: Difference between revisions
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
* The '''gpu-v100''' partition is strictly for computations that utilize GPUs. | * The '''gpu-v100''' partition is strictly for computations that utilize GPUs. | ||
: Please do not run CPU-only computations on the '''gpu- | : Please do not run CPU-only computations on the '''gpu-v100''' partition. | ||
Revision as of 20:30, 21 July 2020
General Rules
- Never run anything related to your research on the Login Node.
- The login node is for
- => Data management, that is file management, compression / decompression, and, possibly, data transfer.
- => Job management: job script creation / submission / monitoring.
- => Software development: Source editing / compilation.
- => Short data analysis computations that take 100% of 1 CPU for up to 15 minutes.
- Everything else should be run on compute nodes, either via the sbatch command or in an interactive job via the salloc command.
- You have to make sure that the resources you request for the job are used by the job.
- When resources are requested from SLURM by the job script, they will be provided for the job, but it does not mean that the code that is run as a part of the job knows how to use them. It is instrumental that the user makes sure that the resources that are requested are properly used.
Guidelines
- The bigmem partitions while can be used for general shorter jobs is intended for computations that need lots of memory.
- Please avoid running low memory computations on the bigmem partition.
- The gpu-v100 partition is strictly for computations that utilize GPUs.
- Please do not run CPU-only computations on the gpu-v100 partition.
- Interactive jobs should be limited to less than 5 hours.
- => If an interactive job asks for more than 5 hours of run time, it is hardly interactive. Who can stare in the screen for more than 5 hours straight?
- => Interactive jobs tend to be resource-wise wasteful as the job does not finish when the computation is done, but keeps running until it times out.
- => The partition setup allows for much quicker resource allocation for jobs that are 5 hours or less, so it is *significantly* easier to get resources in the default partitions for shorter jobs.