Python: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
Line 286: Line 286:
and you want to run it as a job on the ARC cluster, then you can do this following  
and you want to run it as a job on the ARC cluster, then you can do this following  
the example shown below.
the example shown below.


<code>my_code.py</code>:
<code>my_code.py</code>:

Revision as of 15:38, 22 November 2022

General

Important Libraries

Manual: http://pandas.pydata.org/pandas-docs/stable/
Quick start: http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.


A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.


  • mpi4py: requires MPI libraries.

ML

Simple and efficient tools for data mining and data analysis
An open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.


Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.


PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.

High Performance Python

Manual: https://docs.scipy.org/doc/numpy/user/quickstart.html
Reference: https://docs.scipy.org/doc/numpy/reference/
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
-- a powerful N-dimensional array object
-- sophisticated (broadcasting) functions
-- tools for integrating C/C++ and Fortran code
-- useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.


  • Dask -- an open source library designed to provide parallelism to the existing Python stack.
Dask provides integrations with Python libraries like NumPy Arrays, Pandas DataFrames and scikit-learn
to enable parallel execution across multiple cores, processors and computers, without having to learn new libraries or languages.
  • Numba -- an open-source JIT (Just-in-time) compiler for Python.
It can translate a subset of Python and Numpy code into fast machine code.
It usually only requires the programmer to add some decorators to Python code so that efficient machine code can be produced.
In addition, some changes may have to be made to avoid using Python features which Numba cannot efficiently translate into machine code.
  • Cython -- a programming language that aims to be a superset of the Python programming language,
designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax.
Cython is a compiled language that is typically used to generate CPython extension modules.

Python and parallel programming

There are a lot of nuances to implementing parallelism in python depending on your expertise and what you are trying to achieve. The decision tree on how to implement it depends on a few things:

  • If you are comfortable writing in C and using OpenMP or MPI directly, you could write an extension to python with a library interface and you could get very good performance with only high-level logic represented in python commands and low-level operations, parallelism, and inter-process communication represented in C.
  • If you need to work in python but can use Cython then you can make use of an approach to generating the relevant C code that is done entirely in a python superset that gets precompiled. This can open up multithreading but (as far as we know) not multiprocessing. This is how much of NumPy works.
  • If you need to work entirely in python but can use common libraries that go beyond the python standard library then there are a lot of options with interesting tradeoffs. At a high level, you can automate decompositions of task graphs in parallel parts with something like DASK (which is mostly useful for task parallel problems with natural data decomposition based parallelism), this still depends on choosing a low-level engine among threading, multiprocessing, and mpi4py. You can generate a spark cluster with pyspark, which also facilitates a simple functional language for expressing the parallelism: Apache Spark on ARC You can use mpi4py to access an MPI standard based way of starting multiple python processes on multiple nodes.
  • If you need to work entirely in python and can only use the standard library then you are pretty limited in how parallel the code can be as you almost have to use the threading and multiprocessing libraries. Threading is still limited by the GIL except where it is explicitly escaped in the underlying C code (e.g. using the Py_BEGIN_ALLOW_THREADS macro or the NPY BEGIN THREADS macro). Multiprocessing is not GIL limited but carries around a lot of overhead and is, by default, bound to one node.


Just moving all of the heavy lifting to C can produce an enormous speedup. Many popular C/C++ libraries (e.g. GDAL, FSL, Tensorflow) use a python interface for writing a few lines of high-level logic and the real work happens behind the scenes. If your code is going to use an interface to a low level fast library, the performance improvement is immediately accessible, but making the code parallel may be difficult as you have little control on how the things are done inside the library calls. Also, note that the libraries may have their own parallel programming model implemented.

Installing your own Python

Using Miniconda

You can install a local copy of miniconda in your home directory on our clusters. It will give you flexibility to install packages needed for the workflow. Here are the steps to follow:

Once connected to the login node, in your SSH session, make sure you are in your home directory:

$ cd 

Create a "software" subdirectory for all custom software you are going to have:

$ mkdir software; cd software 

Download the software the latest Miniconda distribution file:

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Install the downloaded .sh file:

$ bash Miniconda3-latest-Linux-x86_64.sh

Follow the instructions (choosing ~/software/miniconda3 as the directory to create), agree to the license, decline the offer to initialize.


Every time you launch a new terminal and want to use this version of python, set the path as follows

$ export PATH=/home/<username>/software/miniconda3/bin:$PATH

Ensure it is using the python installed in your home directory

$ which python 
~/software/miniconda3/bin/python

Create a virtual environment for your project

$ conda create -n <yourenvname>

Install additional Python packages to the virtual environment

$ conda install -n <yourenvname> [package]

Activate the virtual environment

$ source activate <yourenvname>

At this point you should be able to use your own python with the modules you added to it.

From Sources

This is the installation from scratch, from the sources obtained from Python original development web site. It does not use Conda.

First, create the directories for the new python installation and the source files:

$ cd
$ mkdir software
$ mkdir software/src

$ cd software/src

Now, get a new python from https://www.python.org/downloads/ as "Gzipped source tarball" and unpack it.

$ wget https://www.python.org/ftp/python/3.9.5/Python-3.9.5.tgz

$ tar xvf Python-3.9.5.tgz 
$ ls -l

drwxr-xr-x 16 drozmano drozmano     4096 May  3 09:11 Python-3.9.5
-rw-r--r--  1 drozmano drozmano 25627989 May  3 10:11 Python-3.9.5.tgz

$ cd Python-3.9.5

Python is built with a tool called autoconf. It is manifested by the presence of the configure script in the top directory of the source tree. You can see the build options with the --help option for it:

$ ls

aclocal.m4          config.sub    Doc      install-sh  Mac              Modules       Parser   Programs       README.rst
CODE_OF_CONDUCT.md  configure     Grammar  Lib         Makefile.pre.in  netlify.toml  PC       pyconfig.h.in  setup.py
config.guess        configure.ac  Include  LICENSE     Misc             Objects       PCbuild  Python         Tools

$ ./configure --help | less
...
... 
# press "q" to exit the text viewer.

The configure script checks for all the necessary prerequisites for the build and creates a Makefile that controls the build process. We have to provide all the build options we want on the command line. In this case, we want to specify the location of our installation in our home directory, using the --prefix=... option. We also want to enable stable optimization to improve python speed.

$ ./cofigure --prefix=$HOME/software/python-3.9.5 --enable-optimizations
....
config.status: pyconfig.h is unchanged
creating Modules/Setup.local
creating Makefile

# Let us check the new make file
$ ls -l Makefile

-rw-r--r-- 1 drozmano drozmano 80751 Jun  8 16:17 Makefile

# Now we are ready to compile and build our own python (we will use 8 CPUs to do that).
# This will take some time.
$ make -j8

.....
.....
renaming build/scripts-3.9/idle3 to build/scripts-3.9/idle3.9
renaming build/scripts-3.9/2to3 to build/scripts-3.9/2to3-3.9
make[1]: Leaving directory '/home/drozmano/my_software/src/Python-3.9.5'

# Now, install it.
$ make install
.....
.....
Processing /tmp/tmpkh0d736f/pip-21.1.1-py3-none-any.whl
Installing collected packages: setuptools, pip
  WARNING: The scripts pip3 and pip3.9 are installed in '/home/drozmano/software/python-3.9.5/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-21.1.1 setuptools-56.0.0

Ta-da! You have your own python installed!

It is still to early to use it. While it is installed, but the system does not know about it and cannot find it yet. We have to create a short script that will initialize our python for our use.

$ cd ~/software/

# Use your favourite text editor here.
$ vim init-python

Here is the init-python-3.9 script:

#! /bin/bash
# ===================================================================================================
ROOT=$HOME/software
PYROOT=$ROOT/python-3.9.5

# ===================================================================================================
PATH=$PYROOT/bin:$PATH

LD_LIBRARY_PATH=$PYROOT/lib:$LD_LIBRARY_PATH
LIBRARY_PATH=$PYROOT/lib:$LIBRARY_PATH
LD_RUN_PATH=$PYROOT/lib:$LD_RUN_PATH

INCLUDE=$PYROOT/include:$INCLUDE
CPATH=$PYROOT/include:$CPATH

MANPATH=$MPICH/share/man:$MANPATH

PKG_CONFIG_PATH=$PYROOT/lib/pkgconfig:$PKG_CONFIG_PATH
# ===================================================================================================
export PATH

export LD_LIBRARY_PATH 
export LD_RUN_PATH 
export LIBRARY_PATH 

export INCLUDE
export CPATH

export MANPATH
export PKG_CONFIG_PATH
# ===================================================================================================

To activate our new python we have to source the init script into our current environment, or into the environment of our batch job.

Let us test it.

# Before
$ which python3
/usr/bin/python3

$ which pip3
/usr/bin/pip3

# Init our new python.
$ source ~/software/init-python-3.9

$ which python3
~/software/python-3.9.5/bin/python3

$ which pip3
~/software/python-3.9.5/bin/pip3

Success.

Extending Python

Virtual environments

$ python -m venv my_environment

Running Python scripts on ARC

If you have a python script (program) saved as a file, my_code.py, and you want to run it as a job on the ARC cluster, then you can do this following the example shown below.


my_code.py:

# =========================================================================
import sys
import os
import socket

# =========================================================================
print("\nPython code: %s" % os.path.basename(sys.argv[0]))

print("\nPython version:\n%s" % sys.version)

print("\nHost name:\n%s" % socket.gethostname())

print("\nCurrent working directory:\n%s" % os.getcwd())

print("")
# =========================================================================

Once you have your python code you need create a job script that will request resources for the job to run on, as well as will run the code on those resources when they become available.

job.slurm:

#!/bin/bash
# ---------------------------------------------------------------------
#SBATCH --job-name=python_test

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1gb
#SBATCH --time=0-01:00:00

# ---------------------------------------------------------------------
echo "Starting run at: `date`"
# ---------------------------------------------------------------------
module load python/3.10.4

python3 my_code.py

# ---------------------------------------------------------------------
echo "Job finished at: `date`"
# ---------------------------------------------------------------------

Once you have both the files, my_code.py and job.slurm, in the same directory, you can submit your job with the

$ sbatch job.slurm

command.