- Python2 vs Python3: https://learn.onemonth.com/python-2-vs-python-3/
- Pandas (0.24.2): https://pandas.pydata.org/
- Manual: http://pandas.pydata.org/pandas-docs/stable/
- Quick start: http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
- Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
- StatsModels: http://www.statsmodels.org/stable/index.html
- A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
- mpi4py: requires MPI libraries.
- scikit-learn: https://scikit-learn.org/stable/
- Simple and efficient tools for data mining and data analysis
- TensorFlow: https://www.tensorflow.org
- An open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.
- Keras ( The Python Deep Learning library): https://keras.io
- Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
- PyCaret: https://pycaret.org/
- PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.
- PyTorch: https://pytorch.org/
- An open source machine learning framework that accelerates the path from research prototyping to production deployment.
High Performance Python
- NumPy (1.16): http://www.numpy.org/
- Manual: https://docs.scipy.org/doc/numpy/user/quickstart.html
- Reference: https://docs.scipy.org/doc/numpy/reference/
- NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- -- a powerful N-dimensional array object
- -- sophisticated (broadcasting) functions
- -- tools for integrating C/C++ and Fortran code
- -- useful linear algebra, Fourier transform, and random number capabilities
- Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
- Dask -- an open source library designed to provide parallelism to the existing Python stack.
- Dask provides integrations with Python libraries like NumPy Arrays, Pandas DataFrames and scikit-learn
- to enable parallel execution across multiple cores, processors and computers, without having to learn new libraries or languages.
- Numba -- an open-source JIT (Just-in-time) compiler for Python.
- It can translate a subset of Python and Numpy code into fast machine code.
- It usually only requires the programmer to add some decorators to Python code so that efficient machine code can be produced.
- In addition, some changes may have to be made to avoid using Python features which Numba cannot efficiently translate into machine code.
- Cython -- a programming language that aims to be a superset of the Python programming language,
- designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax.
- Cython is a compiled language that is typically used to generate CPython extension modules.
Python and parallel programming
There are a lot of nuances to implementing parallelism in python depending on your expertise and what you are trying to achieve. The decision tree on how to implement it depends on a few things:
- If you are comfortable writing in C and using OpenMP or MPI directly, you could write an extension to python with a library interface and you could get very good performance with only high-level logic represented in python commands and low-level operations, parallelism, and inter-process communication represented in C.
- If you need to work in python but can use Cython then you can make use of an approach to generating the relevant C code that is done entirely in a python superset that gets precompiled. This can open up multithreading but (as far as we know) not multiprocessing. This is how much of NumPy works.
- If you need to work entirely in python but can use common libraries that go beyond the python standard library then there are a lot of options with interesting tradeoffs. At a high level, you can automate decompositions of task graphs in parallel parts with something like DASK (which is mostly useful for task parallel problems with natural data decomposition based parallelism), this still depends on choosing a low-level engine among threading, multiprocessing, and mpi4py. You can generate a spark cluster with pyspark, which also facilitates a simple functional language for expressing the parallelism: Apache Spark on ARC You can use mpi4py to access an MPI standard based way of starting multiple python processes on multiple nodes.
- If you need to work entirely in python and can only use the standard library then you are pretty limited in how parallel the code can be as you almost have to use the threading and multiprocessing libraries. Threading is still limited by the GIL except where it is explicitly escaped in the underlying C code (e.g. using the Py_BEGIN_ALLOW_THREADS macro or the NPY BEGIN THREADS macro). Multiprocessing is not GIL limited but carries around a lot of overhead and is, by default, bound to one node.
Just moving all of the heavy lifting to C can produce an enormous speedup. Many popular C/C++ libraries (e.g. GDAL, FSL, Tensorflow) use a python interface for writing a few lines of high-level logic and the real work happens behind the scenes. If your code is going to use an interface to a low level fast library, the performance improvement is immediately accessible, but making the code parallel may be difficult as you have little control on how the things are done inside the library calls. Also, note that the libraries may have their own parallel programming model implemented.
Installing your own Python
You can install a local copy of miniconda in your home directory on our clusters. It will give you flexibility to install packages needed for the workflow. Here are the steps to follow:
Once connected to the login node, in your SSH session, make sure you are in your home directory:
Create a "software" subdirectory for all custom software you are going to have:
$ mkdir software; cd software
Download the software the latest Miniconda distribution file:
Install the downloaded
$ bash Miniconda3-latest-Linux-x86_64.sh
Follow the instructions (choosing
~/software/miniconda3 as the directory to create),
agree to the license, decline the offer to initialize.
Every time you launch a new terminal and want to use this version of python, set the path as follows
$ export PATH=/home/<username>/software/miniconda3/bin:$PATH
Ensure it is using the python installed in your home directory
$ which python ~/software/miniconda3/bin/python
Create a virtual environment for your project
$ conda create -n <yourenvname>
Install additional Python packages to the virtual environment
$ conda install -n <yourenvname> [package]
Activate the virtual environment
$ source activate <yourenvname>
At this point you should be able to use your own python with the modules you added to it.
This is the installation from scratch, from the sources obtained from Python original development web site. It does not use Conda.
First, create the directories for the new python installation and the source files:
$ cd $ mkdir software $ mkdir software/src $ cd software/src
Now, get a new python from https://www.python.org/downloads/ as "Gzipped source tarball" and unpack it.
$ wget https://www.python.org/ftp/python/3.9.5/Python-3.9.5.tgz $ tar xvf Python-3.9.5.tgz $ ls -l drwxr-xr-x 16 drozmano drozmano 4096 May 3 09:11 Python-3.9.5 -rw-r--r-- 1 drozmano drozmano 25627989 May 3 10:11 Python-3.9.5.tgz $ cd Python-3.9.5
Python is built with a tool called autoconf.
It is manifested by the presence of the
configure script in the top directory of the source tree.
You can see the build options with the
--help option for it:
$ ls aclocal.m4 config.sub Doc install-sh Mac Modules Parser Programs README.rst CODE_OF_CONDUCT.md configure Grammar Lib Makefile.pre.in netlify.toml PC pyconfig.h.in setup.py config.guess configure.ac Include LICENSE Misc Objects PCbuild Python Tools $ ./configure --help | less ... ... # press "q" to exit the text viewer.
The configure script checks for all the necessary prerequisites for the build and creates a
that controls the build process.
We have to provide all the build options we want on the command line.
In this case, we want to specify the location of our installation in our home directory,
We also want to enable stable optimization to improve python speed.
$ ./cofigure --prefix=$HOME/software/python-3.9.5 --enable-optimizations .... config.status: pyconfig.h is unchanged creating Modules/Setup.local creating Makefile # Let us check the new make file $ ls -l Makefile -rw-r--r-- 1 drozmano drozmano 80751 Jun 8 16:17 Makefile # Now we are ready to compile and build our own python (we will use 8 CPUs to do that). # This will take some time. $ make -j8 ..... ..... renaming build/scripts-3.9/idle3 to build/scripts-3.9/idle3.9 renaming build/scripts-3.9/2to3 to build/scripts-3.9/2to3-3.9 make: Leaving directory '/home/drozmano/my_software/src/Python-3.9.5' # Now, install it. $ make install ..... ..... Processing /tmp/tmpkh0d736f/pip-21.1.1-py3-none-any.whl Installing collected packages: setuptools, pip WARNING: The scripts pip3 and pip3.9 are installed in '/home/drozmano/software/python-3.9.5/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. Successfully installed pip-21.1.1 setuptools-56.0.0
Ta-da! You have your own python installed!
It is still to early to use it. While it is installed, but the system does not know about it and cannot find it yet. We have to create a short script that will initialize our python for our use.
$ cd ~/software/ # Use your favourite text editor here. $ vim init-python
Here is the
#! /bin/bash # =================================================================================================== ROOT=$HOME/software PYROOT=$ROOT/python-3.9.5 # =================================================================================================== PATH=$PYROOT/bin:$PATH LD_LIBRARY_PATH=$PYROOT/lib:$LD_LIBRARY_PATH LIBRARY_PATH=$PYROOT/lib:$LIBRARY_PATH LD_RUN_PATH=$PYROOT/lib:$LD_RUN_PATH INCLUDE=$PYROOT/include:$INCLUDE CPATH=$PYROOT/include:$CPATH MANPATH=$MPICH/share/man:$MANPATH PKG_CONFIG_PATH=$PYROOT/lib/pkgconfig:$PKG_CONFIG_PATH # =================================================================================================== export PATH export LD_LIBRARY_PATH export LD_RUN_PATH export LIBRARY_PATH export INCLUDE export CPATH export MANPATH export PKG_CONFIG_PATH # ===================================================================================================
To activate our new python we have to
source the init script into our current environment,
or into the environment of our batch job.
Let us test it.
# Before $ which python3 /usr/bin/python3 $ which pip3 /usr/bin/pip3 # Init our new python. $ source ~/software/init-python-3.9 $ which python3 ~/software/python-3.9.5/bin/python3 $ which pip3 ~/software/python-3.9.5/bin/pip3
$ python -m venv my_environment
Running Python scripts on ARC
If you have a python script (program) saved as a file,
and you want to run it as a job on the ARC cluster, then you can do this following
the example shown below.
# ========================================================================= import sys import os import socket # ========================================================================= print("\nPython code: %s" % os.path.basename(sys.argv)) print("\nPython version:\n%s" % sys.version) print("\nHost name:\n%s" % socket.gethostname()) print("\nCurrent working directory:\n%s" % os.getcwd()) print("") # =========================================================================
Once you have your python code you need create a job script that will
- request resources for the job to run on, as well as
- run the code on those resources when they become available.
#!/bin/bash # --------------------------------------------------------------------- #SBATCH --job-name=python_test #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=1gb #SBATCH --time=0-01:00:00 # --------------------------------------------------------------------- module load python/3.10.4 python3 my_code.py
Once you have both the files,
job.slurm, in the same directory,
$ ls -l -rw-r--r-- 1 drozmano drozmano 487 Dec 8 11:42 my_code.py -rw-r--r-- 1 drozmano drozmano 342 Dec 9 14:46 job.slurm
you can submit your job with the
$ sbatch job.slurm