Python: Difference between revisions
Ian.percel (talk | contribs) |
Ian.percel (talk | contribs) |
||
Line 52: | Line 52: | ||
* If you need to work in python but can use Cython then you can make use of an approach to generating the relevant C code that is done entirely in a python superset that gets precompiled. This can open up multithreading but (as far as we know) not multiprocessing. This is how much of NumPy works.<br /> | * If you need to work in python but can use Cython then you can make use of an approach to generating the relevant C code that is done entirely in a python superset that gets precompiled. This can open up multithreading but (as far as we know) not multiprocessing. This is how much of NumPy works.<br /> | ||
* If you need to work entirely in python but can use common libraries that go beyond the python standard library then there are a lot of options with interesting tradeoffs. At a high level, you can automate decompositions of task graphs in parallel parts with something like DASK (which is mostly useful for task parallel problems with natural data decomposition based parallelism), this still depends on choosing a low-level engine among threading, multiprocessing, and mpi4py. You can generate a spark cluster with pyspark, which also facilitates a simple functional language for expressing the parallelism: | * If you need to work entirely in python but can use common libraries that go beyond the python standard library then there are a lot of options with interesting tradeoffs. At a high level, you can automate decompositions of task graphs in parallel parts with something like DASK (which is mostly useful for task parallel problems with natural data decomposition based parallelism), this still depends on choosing a low-level engine among threading, multiprocessing, and mpi4py. You can generate a spark cluster with pyspark, which also facilitates a simple functional language for expressing the parallelism: [[Apache Spark on ARC]] You can use mpi4py to access an MPI standard based way of starting multiple python processes on multiple nodes.<br /> | ||
* If you need to work entirely in python and can only use the standard library then you are pretty limited in how parallel the code can be as you almost have to use the threading and multiprocessing libraries. Threading is still limited by the GIL except where it is explicitly escaped in the underlying C code (e.g. using the Py_BEGIN_ALLOW_THREADS macro or the NPY BEGIN THREADS macro). Multiprocessing is not GIL limited but carries around a lot of overhead and is, by default, bound to one node. | * If you need to work entirely in python and can only use the standard library then you are pretty limited in how parallel the code can be as you almost have to use the threading and multiprocessing libraries. Threading is still limited by the GIL except where it is explicitly escaped in the underlying C code (e.g. using the Py_BEGIN_ALLOW_THREADS macro or the NPY BEGIN THREADS macro). Multiprocessing is not GIL limited but carries around a lot of overhead and is, by default, bound to one node. |
Revision as of 20:58, 17 March 2021
General
- Project site: https://www.python.org/
- Downloads: https://www.python.org/downloads/
- Python2 vs Python3: https://learn.onemonth.com/python-2-vs-python-3/
Important Libraries
- NumPy (1.16): http://www.numpy.org/
- Manual: https://docs.scipy.org/doc/numpy/user/quickstart.html
- Reference: https://docs.scipy.org/doc/numpy/reference/
- NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- -- a powerful N-dimensional array object
- -- sophisticated (broadcasting) functions
- -- tools for integrating C/C++ and Fortran code
- -- useful linear algebra, Fourier transform, and random number capabilities
- Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
- Pandas (0.24.2): https://pandas.pydata.org/
- Manual: http://pandas.pydata.org/pandas-docs/stable/
- Quick start: http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
- Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
- scikit-learn: https://scikit-learn.org/stable/
- Simple and efficient tools for data mining and data analysis
- StatsModels: http://www.statsmodels.org/stable/index.html
- A Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
- TensorFlow: https://www.tensorflow.org
- An open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.
- Keras ( The Python Deep Learning library): https://keras.io
- Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
- mpi4py: requires MPI libraries.
- Dask:
Python and parallel programming
There are a lot of nuances to implementing parallelism in python depending on your expertise and what you are trying to achieve. The decision tree on how to implement it depends on a few things:
- If you are comfortable writing in C and using OpenMP or MPI directly, you could write an extension to python with a library interface and you could get very good performance with only high-level logic represented in python commands and low-level operations, parallelism, and inter-process communication represented in C.
- If you need to work in python but can use Cython then you can make use of an approach to generating the relevant C code that is done entirely in a python superset that gets precompiled. This can open up multithreading but (as far as we know) not multiprocessing. This is how much of NumPy works.
- If you need to work entirely in python but can use common libraries that go beyond the python standard library then there are a lot of options with interesting tradeoffs. At a high level, you can automate decompositions of task graphs in parallel parts with something like DASK (which is mostly useful for task parallel problems with natural data decomposition based parallelism), this still depends on choosing a low-level engine among threading, multiprocessing, and mpi4py. You can generate a spark cluster with pyspark, which also facilitates a simple functional language for expressing the parallelism: Apache Spark on ARC You can use mpi4py to access an MPI standard based way of starting multiple python processes on multiple nodes.
- If you need to work entirely in python and can only use the standard library then you are pretty limited in how parallel the code can be as you almost have to use the threading and multiprocessing libraries. Threading is still limited by the GIL except where it is explicitly escaped in the underlying C code (e.g. using the Py_BEGIN_ALLOW_THREADS macro or the NPY BEGIN THREADS macro). Multiprocessing is not GIL limited but carries around a lot of overhead and is, by default, bound to one node.
Just moving all of the heavy lifting to C can produce an enormous speedup.
Many popular C/C++ libraries (e.g. GDAL, FSL, Tensorflow) use a python interface for writing a few lines of high-level logic and the real work happens behind the scenes.
If your code is going to use an interface to a low level fast library, the maximum performance is available rigth away,
but making the code parallel may be difficult as you have little control on how the things are done inside the library calls.
Also, note that the libraries may have their own parallel programming model implemented.
Installing your own Python
Using Miniconda
You can install a local copy of miniconda in your home directory on our clusters. It will give you flexibility to install packages needed for the workflow. Here are the steps to follow:
Once connected to the login node, in your SSH session, make sure you are in your home directory:
$ cd
Create a "software" subdirectory for all custom software you are going to have:
$mkdir software; cd software
Download the software the latest Miniconda distribution file:
$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Install the downloaded .sh
file:
$bash Miniconda3-latest-Linux-x86_64.sh
Follow the instructions (choosing ~/software/miniconda3
as the directory to create),
agree to the license, decline the offer to initialize.
Every time you launch a new terminal and want to use this version of python, set the path as follows
$ export PATH=/home/<username>/software/miniconda3/bin:$PATH
Ensure it is using the python installed in your home directory
$ which python ~/software/miniconda3/bin/python
Create a virtual environment for your project
$ conda create -n <yourenvname>
Install additional Python packages to the virtual environment
$ conda install -n <yourenvname> [package]
Activate the virtual environment
$ source activate <yourevname>
At this point you should be able to use your own python with the modules you added to it.
Extending Python
Virtual environments
$ python -m venv my_environment