PyTorch on ARC: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Intro to Torch =
= Background =


* Compute Canada article is not directly applicable on ARC but contains a lot of good information:
* https://github.com/pytorch/pytorch - PyTorch project page on GitHub.
:https://docs.computecanada.ca/wiki/PyTorch
* https://docs.computecanada.ca/wiki/PyTorch - Compute Canada article is not directly applicable on ARC but contains a lot of good information.


== Checkpointing ==
'''PyTorch''' is a Python package that provides two high-level features:
 
* Tensor computation (like NumPy) with strong GPU acceleration
* Deep neural networks built on a tape-based autograd system


* https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.


= Installing PyTorch =
= Installing PyTorch =
You will need a working local '''Conda''' install in your home directory first. If you do not have it yet, please follow [[Conda_on_ARC#Installing_Conda | these instructions]] to have it installed.


You will need a working local '''Conda''' install in your home directory first.
Once you have your own '''Conda''', activate it with
If you do not have it yet, plaese follow [[Conda_on_ARC#Installing_Conda | these instructions]]
$ source ~/software/init-conda
to have it isntalled.


We will install '''PyTorch''' into its own '''conda environment'''.
<pre>
$ conda create -n pytorch
....
</pre>


Once you have your own '''Conda''', activate it with
Now we activate our new environment
  $ ~/software/init-conda
$ conda activate pytorch
 
Check for the available versions of <code>pytorch</code> on the <code>conda-forge</code> channel and decide on what you want to install.
Please make sure that there is '''GPU''' or '''cuda''' build for it (3rd column in the output table).
<pre>
(pytorch) $ conda search -c conda-forge pytorch
....
pytorch                        2.1.2 cpu_mkl_py39h9c325db_100 conda-forge       
pytorch                        2.1.2 cuda112_py310hb684afd_301  conda-forge       
....
....
</pre>
For example, we would like to get the version '''2.1.2''', then we see that there are '''CPU''' and '''GPU/cuda''' version.
 
 
If we try to install this version on the login node, '''Conda''' will detect that there is no GPU available
and will install the '''CPU version'''.
We have to use the <code>CONDA_OVERRIDE_CUDA</code> variable to override auto-detection and force installation of the '''GPU version'''.
Like this:
<pre>
(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge pytorch=2.1.2 python pip
....
....
</pre>
'''Before''' confirming the installation,
'''check the list of the packages''' to be installed to make sure that the correct version and build are selected.
 
Once it is done, your '''PyTorch''' environment is ready.


We will install '''PyTorch''' into its own '''conda environment'''.


It is '''very important''' to create the environment with '''python''' and '''pytorch''' in the same command.
At this point you can add more packages to the environment using '''conda'''.
This way '''conda''' can select the best '''pytorch''' and '''python''' combination.
<pre>
$ conda create -n pytorch python pytorch-gpu torchvision
(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge torchvision cuda-nvcc
....
....
</pre>
'''Note''' that you will have to use the '''override''' variable again, if the additional packages depend on the '''GPU presence'''.  


Once it is done,
== Testing ==
activate your '''pytorch''' environment:
$ conda activate pytorch


You can test with the <code>torch-gpu-test.py</code> script shown below.
You can test with the <code>torch-gpu-test.py</code> script shown below.  
Copy and paste the text into a file and run if from the command line:
Copy and paste the text into a file and run if from the command line:
  $ python torch-gpu-test.py
  (pytorch) $ python torch-gpu-test.py  


If you try this on the login node, it should tell you that GPUs are not available.
If you try this on the '''login node''', it should tell you that '''GPUs are not available'''.  
It is normal, as the login node does not have any.  
It is normal, as the login node does not have any. You will need a GPU node to test the GPUs.  
You will need a GPU node to test the GPUs.  


Once you know that your '''pytorch''' environment is working properly, you can add more packages to the environment using '''conda'''.
To deactivate the environment (and conda itself) the
 
<pre>
To deactivate the environment use the
(pytorch) $ conda deactivate  
$ conda deactivate  
(base) $ conda deactivate  
command.
$
</pre>
commands.


=== Test script ===
=== Test script ===
Line 66: Line 103:
     print("CUDA is NOT available.")
     print("CUDA is NOT available.")


# -------------------------------------------------------
</syntaxhighlight>
=== Test script 2 ===
<code>torch-gpu-test2.py</code>:
<syntaxhighlight lang=python>
#! /usr/bin/env python
# -------------------------------------------------------
import os
import sys
import socket
import torch
# -------------------------------------------------------
dev = os.environ['CUDA_VISIBLE_DEVICES']
host = socket.gethostname()
tdev = torch.cuda.current_device()
tavail = torch.cuda.is_available()
tcount = torch.cuda.device_count()
tname = torch.cuda.get_device_name()
print("Host: %s\nENV Devices: %s\nCudaDev: %s\nCUDA is available: %s\nDevice count: %d\nDevice: %s" % \
        (host, dev, tdev, tavail, tcount, tname))
print(os.popen("/usr/bin/nvidia-smi -L").read().strip())
print(os.popen("env | grep CUDA").read().strip())
print("")
# -------------------------------------------------------
# -------------------------------------------------------
</syntaxhighlight>
</syntaxhighlight>
Line 71: Line 137:
= Using PyTorch on ARC =
= Using PyTorch on ARC =


== Requesting GPU Resources for PyTorch Jobs ==
=== Requesting GPU Resources for PyTorch Jobs ===
For '''interactive''' use see this How-To: [[How to request an interactive GPU on ARC]].


== Interactive Job ==
An example of the job script <code>torch_job.slurm</code>:
<syntaxhighlight lang=bash>
#! /bin/bash
# ====================================
#SBATCH --job-name=torch-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=0-04:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
# ====================================
 
source ~/software/init-conda
conda activate pytorch
 
python torch-gpu-test.py
</syntaxhighlight>


1 GPU on the '''gpu-v100''' partition for 1 hour:
== Checkpointing ==
salloc -N1 -n1 -c4 --mem=16GB --gres=gpu:1 -p gpu-v100 -t 1:00:00
Refer to the checkpointing tutorial at https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html.


1 GPU on the '''bigmem''' partition for 1 hour:
= Links =
$ salloc -N1 -n1 -c4 --mem=16gb --gres=gpu:1 -p bigmem -t 1:00:00


Use the '''nvidia-smi''' command to check the GPU:
[[ARC Software pages]]
<pre>
$ nvidia-smi
Fri Jun  3 11:35:14 2022     
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|        Memory-Usage | GPU-Util  Compute M. |
|                              |                      |              MIG M. |
|===============================+======================+======================|
|  0  NVIDIA A100-PCI...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A  39C    P0    42W / 250W |      0MiB / 40536MiB |    32%      Default |
|                              |                      |            Disabled |
+-------------------------------+----------------------+----------------------+
                                                                             
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU  GI  CI        PID  Type  Process name                  GPU Memory |
|        ID  ID                                                  Usage      |
|=============================================================================|
|  No running processes found                                                |
+-----------------------------------------------------------------------------+
</pre>


[[Category:Software]]
[[Category:ARC]]
[[Category:ARC]]
[[Category:Software]]
{{Navbox ARC}}
[[Category:Stub]]
{{Navbox Software}}

Latest revision as of 22:52, 7 March 2024

Background

PyTorch is a Python package that provides two high-level features:

  • Tensor computation (like NumPy) with strong GPU acceleration
  • Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.

Installing PyTorch

You will need a working local Conda install in your home directory first. If you do not have it yet, please follow these instructions to have it installed.

Once you have your own Conda, activate it with

$ source ~/software/init-conda

We will install PyTorch into its own conda environment.

$ conda create -n pytorch 
....

Now we activate our new environment

$ conda activate pytorch

Check for the available versions of pytorch on the conda-forge channel and decide on what you want to install. Please make sure that there is GPU or cuda build for it (3rd column in the output table).

(pytorch) $ conda search -c conda-forge pytorch
....
pytorch                        2.1.2 cpu_mkl_py39h9c325db_100  conda-forge         
pytorch                        2.1.2 cuda112_py310hb684afd_301  conda-forge         
....
....

For example, we would like to get the version 2.1.2, then we see that there are CPU and GPU/cuda version.


If we try to install this version on the login node, Conda will detect that there is no GPU available and will install the CPU version. We have to use the CONDA_OVERRIDE_CUDA variable to override auto-detection and force installation of the GPU version. Like this:

(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge pytorch=2.1.2 python pip 
....
....

Before confirming the installation, check the list of the packages to be installed to make sure that the correct version and build are selected.

Once it is done, your PyTorch environment is ready.


At this point you can add more packages to the environment using conda.

(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge torchvision cuda-nvcc
....
....

Note that you will have to use the override variable again, if the additional packages depend on the GPU presence.

Testing

You can test with the torch-gpu-test.py script shown below. Copy and paste the text into a file and run if from the command line:

(pytorch) $ python torch-gpu-test.py 

If you try this on the login node, it should tell you that GPUs are not available. It is normal, as the login node does not have any. You will need a GPU node to test the GPUs.

To deactivate the environment (and conda itself) the

(pytorch) $ conda deactivate 
(base) $ conda deactivate 
$ 

commands.

Test script

torch-gpu-test.py:

#! /usr/bin/env python 
# -------------------------------------------------------
import torch
# -------------------------------------------------------
print("Defining torch tensors:")
x = torch.Tensor(5, 3)
print(x)
y = torch.rand(5, 3)
print(y)

# -------------------------------------------------------
# let us run the following only if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available.")
    x = x.cuda()
    y = y.cuda()
    print(x + y)
else:
    print("CUDA is NOT available.")

# -------------------------------------------------------

Test script 2

torch-gpu-test2.py:

#! /usr/bin/env python 
# -------------------------------------------------------
import os
import sys
import socket
import torch
# -------------------------------------------------------

dev = os.environ['CUDA_VISIBLE_DEVICES']

host = socket.gethostname()
tdev = torch.cuda.current_device()
tavail = torch.cuda.is_available()
tcount = torch.cuda.device_count()
tname = torch.cuda.get_device_name()

print("Host: %s\nENV Devices: %s\nCudaDev: %s\nCUDA is available: %s\nDevice count: %d\nDevice: %s" % \
        (host, dev, tdev, tavail, tcount, tname))

print(os.popen("/usr/bin/nvidia-smi -L").read().strip())
print(os.popen("env | grep CUDA").read().strip())
print("")
# -------------------------------------------------------

Using PyTorch on ARC

Requesting GPU Resources for PyTorch Jobs

For interactive use see this How-To: How to request an interactive GPU on ARC.

An example of the job script torch_job.slurm:

#! /bin/bash
# ====================================
#SBATCH --job-name=torch-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=0-04:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
# ====================================

source ~/software/init-conda
conda activate pytorch

python torch-gpu-test.py

Checkpointing

Refer to the checkpointing tutorial at https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html.

Links

ARC Software pages