PyTorch on ARC: Difference between revisions
Line 75: | Line 75: | ||
== Interactive Job == | == Interactive Job == | ||
1 GPU on the '''gpu-v100''' partition for 1 hour (16 GB of RAM): | 1 GPU and 4 CPUs on the '''gpu-v100''' partition for 1 hour (16 GB of RAM): | ||
salloc -N1 -n1 -c4 --mem=16GB --gres=gpu:1 -p gpu-v100 -t 1:00:00 | salloc -N1 -n1 -c4 --mem=16GB --gres=gpu:1 -p gpu-v100 -t 1:00:00 | ||
1 GPU on the '''bigmem''' partition for 1 hour (256 GB of RAM): | 1 GPU and 4 CPUs on the '''bigmem''' partition for 1 hour (256 GB of RAM): | ||
$ salloc -N1 -n1 -c4 --mem=256gb --gres=gpu:1 -p bigmem -t 1:00:00 | $ salloc -N1 -n1 -c4 --mem=256gb --gres=gpu:1 -p bigmem -t 1:00:00 | ||
Revision as of 17:38, 3 June 2022
Intro to Torch
- Compute Canada article is not directly applicable on ARC but contains a lot of good information:
Checkpointing
Installing PyTorch
You will need a working local Conda install in your home directory first. If you do not have it yet, plaese follow these instructions to have it isntalled.
Once you have your own Conda, activate it with
$ ~/software/init-conda
We will install PyTorch into its own conda environment.
It is very important to create the environment with python and pytorch in the same command. This way conda can select the best pytorch and python combination.
$ conda create -n pytorch python pytorch-gpu torchvision
Once it is done, activate your pytorch environment:
$ conda activate pytorch
You can test with the torch-gpu-test.py
script shown below.
Copy and paste the text into a file and run if from the command line:
$ python torch-gpu-test.py
If you try this on the login node, it should tell you that GPUs are not available. It is normal, as the login node does not have any. You will need a GPU node to test the GPUs.
Once you know that your pytorch environment is working properly, you can add more packages to the environment using conda.
To deactivate the environment use the
$ conda deactivate
command.
Test script
torch-gpu-test.py
:
#! /usr/bin/env python
# -------------------------------------------------------
import torch
# -------------------------------------------------------
print("Defining torch tensors:")
x = torch.Tensor(5, 3)
print(x)
y = torch.rand(5, 3)
print(y)
# -------------------------------------------------------
# let us run the following only if CUDA is available
if torch.cuda.is_available():
print("CUDA is available.")
x = x.cuda()
y = y.cuda()
print(x + y)
else:
print("CUDA is NOT available.")
# -------------------------------------------------------
Using PyTorch on ARC
Requesting GPU Resources for PyTorch Jobs
Interactive Job
1 GPU and 4 CPUs on the gpu-v100 partition for 1 hour (16 GB of RAM):
salloc -N1 -n1 -c4 --mem=16GB --gres=gpu:1 -p gpu-v100 -t 1:00:00
1 GPU and 4 CPUs on the bigmem partition for 1 hour (256 GB of RAM):
$ salloc -N1 -n1 -c4 --mem=256gb --gres=gpu:1 -p bigmem -t 1:00:00
Use the nvidia-smi command to check the GPU:
$ nvidia-smi Fri Jun 3 11:35:14 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:17:00.0 Off | 0 | | N/A 39C P0 42W / 250W | 0MiB / 40536MiB | 32% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+