Tensorflow on ARC: Difference between revisions
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Background = | |||
* Website: https://www.tensorflow.org/ | |||
Tensorflow is a tool for evaluating dataflow graphs that represent both the computations and model state in a machine learning algorithm. It enables distributed evaluation and explicit communication across a large number of computing devices (e.g. numerous CPUs or GPUs). The core tools of tensorflow consist of shared C libraries for constructing dataflow graphs and executing computations (typically linear algebra on tensors). This constitutes a fairly low level language for setting up data preprocessing, model training, and inference infrastructure for online machine learning applications. For an overview of tensorflow's design principles and its relationship with other popular machine learning technologies (like Caffe, Torch, and MXNet), Abadi's 2016 article is a good starting point. | Tensorflow is a tool for evaluating dataflow graphs that represent both the computations and model state in a machine learning algorithm. It enables distributed evaluation and explicit communication across a large number of computing devices (e.g. numerous CPUs or GPUs). The core tools of tensorflow consist of shared C libraries for constructing dataflow graphs and executing computations (typically linear algebra on tensors). This constitutes a fairly low level language for setting up data preprocessing, model training, and inference infrastructure for online machine learning applications. For an overview of tensorflow's design principles and its relationship with other popular machine learning technologies (like Caffe, Torch, and MXNet), Abadi's 2016 article is a good starting point. | ||
Line 14: | Line 18: | ||
You will need a working local '''Conda''' install in your home directory first. | You will need a working local '''Conda''' install in your home directory first. | ||
If you do not have it yet, plaese follow [[Conda_on_ARC#Installing_Conda | these instructions]] | If you do not have it yet, plaese follow [[Conda_on_ARC#Installing_Conda | these instructions]] | ||
to have it | to have it installed. | ||
Once you have your own '''Conda''', activate it with | |||
$ source ~/software/init-conda | |||
Here you have two choices how to install Tensorflow | |||
(1) using Conda or | |||
(2) how it is explained on the Tensorflow site, using <code>pip</code>. | |||
== Installing with Conda == | |||
We will install '''Tensorflow''' into its own '''conda environment'''. | |||
<pre> | |||
$ conda create -n tensorflow | |||
.... | |||
</pre> | |||
Now we activate our new environment | |||
activate | |||
$ conda activate tensorflow | $ conda activate tensorflow | ||
Check for the available versions of <code>pytorch</code> on the <code>conda-forge</code> channel and decide on what you want to install. | |||
Please make sure that there is '''GPU''' or '''cuda''' build for it (3rd column in the output table). | |||
<pre> | |||
(tensorflow) $ conda search -c conda-forge tensorflow | |||
.... | |||
tensorflow 2.14.0 cpu_py39h4655687_0 conda-forge | |||
tensorflow 2.14.0 cuda118py310h148f8e3_0 conda-forge | |||
.... | |||
.... | |||
</pre> | |||
For example, we would like to get the version '''2.14.0''', then we see that there are '''CPU''' and '''GPU/cuda''' version. | |||
If we try to install this version on the login node, '''Conda''' will detect that there is no GPU available | |||
and will install the '''CPU version'''. | |||
We have to use the <code>CONDA_OVERRIDE_CUDA</code> variable to override auto-detection and force installation of the '''GPU version'''. | |||
Like this: | |||
<pre> | |||
(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge tensorflow=2.14.0 python pip | |||
.... | |||
.... | |||
</pre> | |||
'''Before''' confirming the installation, | |||
'''check the list of the packages''' to be installed to make sure that the correct version and build are selected. | |||
Once it is done, your '''Tensorflow''' environment is ready. | |||
== Installing with PIP == | |||
If you want to follow the Tensorflow installation instructions (https://www.tensorflow.org/install/pip) and use <code>pip</code>, | |||
you still need a separate Conda environment, and you have to install required version of Python and Pip. | |||
<pre> | |||
(base) $ conda create -n tensorflow | |||
.... | |||
(base) $ conda activate tensorflow | |||
(tensorflow) $ conda install python=3.11.7 pip | |||
.... | |||
</pre> | |||
If the installation has been successful, you now have the required versions of Python and Pip installed. | |||
You can now follow the instructions from the manual. | |||
<pre> | |||
(tensorflow) $ pip install tensorflow[and-cuda] | |||
.... | |||
</pre> | |||
After this installation is finished Tensorflow is ready for testing. | |||
== Testing == | |||
You can test with the <code>tensorflow-test.py</code> script shown below. | You can test with the <code>tensorflow-test.py</code> script shown below. | ||
Copy and paste the text into a file and run if from the command line: | Copy and paste the text into a file and run if from the command line: | ||
$ python tensorflow-test.py | (tensorflow) $ python tensorflow-test.py | ||
If you try this on the login node, it should tell you that GPUs are not available. | If you try this on the login node, it should tell you that GPUs are not available. | ||
Line 41: | Line 99: | ||
To deactivate the environment use the | To deactivate the environment use the | ||
$ conda deactivate | (tensorflow) $ conda deactivate | ||
command. | |||
command. And leave Conda | |||
(base) $ conda deactivate | |||
=== Test script === | === Test script === | ||
Line 48: | Line 108: | ||
<code>tensorflow-test.py</code>: | <code>tensorflow-test.py</code>: | ||
<syntaxhighlight lang=python> | <syntaxhighlight lang=python> | ||
#! /usr/bin/env | #! /usr/bin/env python3 | ||
# ------------------------------------------ | # ------------------------------------------ | ||
import os | import os | ||
import tensorflow as tf | import tensorflow as tf | ||
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices' | os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices' | ||
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' | |||
sep = "=" * 80 | |||
# ------------------------------------------ | |||
print(sep) | |||
print("Checking for available GPUs...\n") | |||
gg = tf.config.list_physical_devices("GPU") | |||
print("GPUs found: %d" % len(gg)) | |||
for g in gg: | |||
print(" | print("\t%s: %s" % (g.device_type, g.name)) | ||
# ------------------------------------------ | # ------------------------------------------ | ||
print(" | print(sep) | ||
print("Testing Tensorflow...\n") | |||
print( | try: | ||
x = tf.reduce_sum(tf.random.normal([1000, 1000])) | |||
except: | |||
print("FAILED!") | |||
sys.exit() | |||
print("Tensorflow works properly. SUCCESS!") | |||
# ------------------------------------------ | # ------------------------------------------ | ||
print() | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Latest revision as of 22:44, 7 March 2024
Background
- Website: https://www.tensorflow.org/
Tensorflow is a tool for evaluating dataflow graphs that represent both the computations and model state in a machine learning algorithm. It enables distributed evaluation and explicit communication across a large number of computing devices (e.g. numerous CPUs or GPUs). The core tools of tensorflow consist of shared C libraries for constructing dataflow graphs and executing computations (typically linear algebra on tensors). This constitutes a fairly low level language for setting up data preprocessing, model training, and inference infrastructure for online machine learning applications. For an overview of tensorflow's design principles and its relationship with other popular machine learning technologies (like Caffe, Torch, and MXNet), Abadi's 2016 article is a good starting point.
However, in its most common usage, Tensorflow's implementation of common deep neural network operations are exposed through the Python API. Direct access to all of this functionality from a C++ API is under development but most of the core deep neural network functionality that users are familiar with are still only available through specific client languages with Python being the most developed. The high-level Tensorflow API is consistent with the modelling standards established by Keras (that support multiple alternate backend evaluation engines). The higher level abstractions do not necessarily offer the same flexibility of parallel computation as the core Tensorflow tool. However, they generally are more accessible to users that are primarily familiar with the modelling techniques that are common to artificial neural networks. Consequently, Keras has been integrated directly into the core utility libraries of the Tensorflow Python API.
Tensorflow documentation and tutorials
- Compute Canada article is not directly applicable on ARC but contains a lot of good information:
Installing Tensorflow
You will need a working local Conda install in your home directory first. If you do not have it yet, plaese follow these instructions to have it installed.
Once you have your own Conda, activate it with
$ source ~/software/init-conda
Here you have two choices how to install Tensorflow
(1) using Conda or
(2) how it is explained on the Tensorflow site, using pip
.
Installing with Conda
We will install Tensorflow into its own conda environment.
$ conda create -n tensorflow ....
Now we activate our new environment
$ conda activate tensorflow
Check for the available versions of pytorch
on the conda-forge
channel and decide on what you want to install.
Please make sure that there is GPU or cuda build for it (3rd column in the output table).
(tensorflow) $ conda search -c conda-forge tensorflow .... tensorflow 2.14.0 cpu_py39h4655687_0 conda-forge tensorflow 2.14.0 cuda118py310h148f8e3_0 conda-forge .... ....
For example, we would like to get the version 2.14.0, then we see that there are CPU and GPU/cuda version.
If we try to install this version on the login node, Conda will detect that there is no GPU available
and will install the CPU version.
We have to use the CONDA_OVERRIDE_CUDA
variable to override auto-detection and force installation of the GPU version.
Like this:
(pytorch) $ CONDA_OVERRIDE_CUDA="12.2" conda install -c conda-forge tensorflow=2.14.0 python pip .... ....
Before confirming the installation, check the list of the packages to be installed to make sure that the correct version and build are selected.
Once it is done, your Tensorflow environment is ready.
Installing with PIP
If you want to follow the Tensorflow installation instructions (https://www.tensorflow.org/install/pip) and use pip
,
you still need a separate Conda environment, and you have to install required version of Python and Pip.
(base) $ conda create -n tensorflow .... (base) $ conda activate tensorflow (tensorflow) $ conda install python=3.11.7 pip ....
If the installation has been successful, you now have the required versions of Python and Pip installed. You can now follow the instructions from the manual.
(tensorflow) $ pip install tensorflow[and-cuda] ....
After this installation is finished Tensorflow is ready for testing.
Testing
You can test with the tensorflow-test.py
script shown below.
Copy and paste the text into a file and run if from the command line:
(tensorflow) $ python tensorflow-test.py
If you try this on the login node, it should tell you that GPUs are not available. It is normal, as the login node does not have any. You will need a GPU node to test the GPUs.
Once you know that your tensorflow environment is working properly, you can add more packages to the environment using conda.
To deactivate the environment use the
(tensorflow) $ conda deactivate
command. And leave Conda
(base) $ conda deactivate
Test script
tensorflow-test.py
:
#! /usr/bin/env python3
# ------------------------------------------
import os
import tensorflow as tf
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
sep = "=" * 80
# ------------------------------------------
print(sep)
print("Checking for available GPUs...\n")
gg = tf.config.list_physical_devices("GPU")
print("GPUs found: %d" % len(gg))
for g in gg:
print("\t%s: %s" % (g.device_type, g.name))
# ------------------------------------------
print(sep)
print("Testing Tensorflow...\n")
try:
x = tf.reduce_sum(tf.random.normal([1000, 1000]))
except:
print("FAILED!")
sys.exit()
print("Tensorflow works properly. SUCCESS!")
# ------------------------------------------
print()
The --tf_xla_enable_xla_devices
option allows GPUs to be also seen as XLA capable devices as well.
- See here: https://www.tensorflow.org/xla
It is not critical and can be omitted.
Using Tensorflow on ARC
Requesting GPU Resources for Tensorflow Jobs
For interactive use see this How-To: How to request an interactive GPU on ARC.
An example of the job script tensorflow_job.slurm
:
#! /bin/bash
# ====================================
#SBATCH --job-name=tf-test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=0-04:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
# ====================================
source ~/software/init-conda
conda activate tensorflow
python tensorflow-test.py
Links
|