This is an old revision of the document!
Python
Available versions
Currently, we have a variety of Python-Versions available as module files. To list them all run
$ module avail|& grep 'lang/Python'
Content of those modulefiles
The Python-Versions available as module files, do provide numpy
, scipy
, pandas
, cython
and more. However, especially a matplotlib
module is most likely missing. This is because our installation framework installs it separately. Hence, the matplotlib
functionality has to be loaded as an additional functionality as a module file.
The intel
versions are link against Intel's MKL. Exporting OMP_NUM_THREADS
enables multithreaded matrix handling with numpy
.
Which version should be picked?
If you intend to use Python in combination with another module, ensure that the toolchain and the toolchain version of the additional module fit with your selected Python module. With regard to the Python version, try to stay as current as possible.
If you need additional Python packages, you can easily install them yourself either "globally" in your home directory or inside of a virtual environment.
Your Personal Environment (Additional Packages)
In general, having a personal Python environment where you can install third-party packages (without needing root priviliges) yourself is very easy. The preparation steps needed on Mogon are described below.
While the first variant is already sufficient, we recommend using virtualenvs since they are a lot easier to work with.
Virtualenvs can also be shared between users if created in your groups project directory, but most importantly virtual environments bear the potential to avoid the setup hell you might otherwise experience.
Do not use any of the modules ending on -bare
as they are installed as special dependencies for particular modules (or actually installed by accident) to construct your virtual environment.
We strongly discourage using any *conda
setup on one of our clusters: It has often been a source of messing up an existing environment only to be discovered at a source of interference when switching back our modules. There actually are *conda
modules provided by us. If you try and use any *conda
related material, double check the altered environment to be sure what you are doing / what *conda
did.
Personal Setup
- First load an appropriate Python module, see the implications above.
- Then navigate to your home directory (if in doubt, type
cd
). - Create some directories in which installed packages will be placed:
$ mkdir -p ~/.local/bin $ mkdir -p ~/.local/lib/python<VERSION>/site-packages
- Now add the created
bin
directory to yourPATH
in your.bashrc
file and source it:$ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc $ source ~/.bashrc
- Next, create a configuration file for
easy_install
andpip
, the Python package management tools:$ echo -e '[easy_install]\nprefix = ~/.local' > ~/.pydistutils.cfg $ mkdir -p ~/.pip $ echo -e '[install]\nuser = true' > ~/.pip/pip.conf
If you now use easy_install
or pip
, it will automatically install packages to the correct paths in your home directory.
Using virtualenv
A so called virtualenv can be seen as an isolated, self-contained Python environment of third-party packages.
Different virtualenvs do not interfere with each other nor with the system-wide installed packages.
It is advised to make use of virtualenv in Python, especially if you intend to install different combinations or versions of various Python packages. Virtualenvs can also be shared between users if created in your groups project directory.
We need to remove the easy_install configuration file created above, since the path set there would interfere with virtualenv:
$ rm ~/.pydistutils.cfg $ rm ~/.pip/pip.conf
Now you can simply create, activate, use, deactivate and destroy as many virtualenvs as you want:
Create
Creating a virtualenv will simply set up a directory structure and install some baseline packages:
$ virtualenv <ENV> New python executable in <ENV>/bin/python Installing Setuptools...done. Installing Pip...done.
With virtualenvs, you can even make each virtualenv use its own version of the Python interpreter:
# after loading an appropriate module file $ virtualenv --python=$(which python) --system-site-packages <ENV<VERSION>>
If you want to install the pre-installed third-party packages (numpy, scipy, matplotlib, etc.) yourself, just omit the –system-site-packages
parameter when calling virtualenv.
Otherwise, append the LD_LIBRARY_PATH
of the module you are using onto the environment activation script:
# note the double quotes echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH" >> <ENV>/bin/activate
Activate
To work in a virtualenv, you first have to activate it, which sets some environment variables for you:
$ source <ENV>/bin/activate (<ENV>)$ # Note the name of the virtualenv in front of your prompt - nice, heh?
Use
Now you can use your virtualenv - newly installed packages will just be installed inside the virtualenv and just be visible to the python interpreter you start from within the virtualenv:
(<ENV>)$ easy_install requests Searching for requests Reading https://pypi.python.org/simple/requests/ Best match: requests 1.2.3 [...] Processing dependencies for requests Finished processing dependencies for requests
or
(<ENV>)$ pip install requests Downloading/unpacking requests Downloading requests-1.2.3.tar.gz (348kB): 348kB downloaded Running setup.py egg_info for package requests Installing collected packages: requests Running setup.py install for requests Successfully installed requests Cleaning up...
And now compare what happens with the python interpreter from inside the virtualenv and with the system python interpreter:
(<ENV>)$ python -c 'import requests' (>ENV>)$ /usr/bin/python -c 'import requests' Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: No module named requests
Deactivate
Deactivating a virtualenv reverts the activation step and all its changes to your environment:
(<ENV>)$ deactivate $
Destroy
To destroy a virtualenv, simply delete its directory:
$ rm ENV
virtualenvwrapper
Using multiple virtualenvs can be made much more user friendly using virtualenvwrapper.
If you are using Python 2.6.5, you can install and configure it using
$ easy_install --prefix=$HOME/.local virtualenvwrapper $ echo 'source $HOME/.local/bin/virtualenvwrapper.sh' >> ~/.bashrc
If you are using any other version of Python, virtualenvwrapper is already installed and you just need to
$ echo 'source /cluster/Apps/Python/<VERSION>/bin/virtualenvwrapper.sh' >> ~/.bashrc
Re-login to apply the changes.
Load Environment Modules (module load [mod])
To load environment modules in python:
execfile('/usr/share/Modules/init/python.py') module('load',<modulename>)
Job submission
Like with other interpreted languages, you can indicate to the desired language for interpreting the script using a shebang. Here is an example script. Obviously, you can adapt the submit()
-function for your needs (e.g. add logging functionality, account better / differently for multithreading, etc.):
#!/bin/env python #SBATCH -p nodeshort #SBATCH -A <your account> #SBATCH -N1 #SBATCH -n 32 # assuming 2-threaded daughter processes # otherwise specify do not '-c' # (will be set to 1, implicitely) #SBATCH -c 2 # number of cores per task, e.g. 2 threads #SBATCH -t 10 #SBATCH -J python-demo #SBATCH -o python-demo.%j.log import subprocess import shlex import locale import os import glob def submit(call, ignore_errors = False): n_threads = os.environ['SLURM_CPUS_PER_TASK'] os.environ['OMP_NUM_THREADS'] = n_threads if int(n_threads) > 1: call = 'srun -n 1 -c %s --hint=multithread --cpu_bind=q %s' % (n_threads, call) else: call = 'srun -n 1 %s' % call call = shlex.split(call) process = subprocess.Popen(call, stdout=subprocess.PIPE, stderr=subprocess.PIPE) out, err = process.communicate() out = out.decode(locale.getdefaultlocale()[1]) err = err.decode(locale.getdefaultlocale()[1]) if (not ignore_errors) and (process.returncode): print("call failed, call was: %s" % ' '.join(call)) print("Message was: %s" % str(out)) print("Error code was %s, stderr: %s" % (process.returncode, err)) return process.returncode, out, err if __name__== '__main__': print(os.getcwd()) for fname in glob.glob('*.input'): call = "your application --threads=2 --infile=%s" % fname submit(call)
For multinode scripts, ensure that the environment is set remotely (for most cases srun
takes care of it).
Scripts employing mpi4py
should not submit themselves. Scripts employing Python's onboard multiprocessing
module do not need the submit()
-function, obviously.
Performance Hints
Many of the hints are inspired by O'Reilly's Python Cookbook chapter on performance. We only discuss very little here explicitly, it is worth reading this chapter. If you need help getting performance out of Python scripts contact us.
Profiling and Timing
Better than guessing is to profile, how much time a certain program or task within this program takes. Guessing bottlenecks is a hard task, profiling often worth the effort. The above mentioned Cookbook covers this chapter.
Regular Expressions
Avoid them as much you can. If you have to use them, compile them, prior to any looping, e.g.:
import re myreg = re.compile('\d') for stringitem in list: re.search(myreg, stringitem) # or myreg.search(stringitem)
Use Functions
A little-known fact is that code defined in the global scope like this runs slower than code defined in a function. The speed difference has to do with the implementation of local versus global variables (operations involving locals are faster). So, if you want to make the program run faster, simply put the scripting statements in a function (also: see O'Reilly's Python Cookbook chapter on performance).
The speed difference depends heavily on the processing being performed.
Selectively Eliminate Attribute Access
Every use of the dot (.) operator to access attributes comes with a cost. Under the covers, this triggers special methods, such as getattribute()
and getattr()
, which often lead to dictionary lookups.
You can often avoid attribute lookups by using the from module import name
form of import as well as making selected use of bound methods. See the illustration in O'Reilly's Python Cookbook chapter on performance.
Too many print statements
To avoid constant flushing (particularly in Python 2.x) and use buffered output instead, either use Python's logging
module instead as it supports buffered output. An alternative is to write to sys.stdout
and only flush in the end of a logical block.
In Python 3.x the print()
-function comes with a keyword argument flush
, which defaults to False
. However, use of the logging module is still recommended.
Working with Scalars in Numerics Code
Any constant scalar is best not calculated in any loop - regardless of the programming language. Compilers might(!) optimize this away, but are not always capable of doing so.
One example (timings for the module tools/IPython/6.2.1-foss-2017a-Python-3.6.4
on Mogon I, results on Mogon II may differ, the message will hold):
Every trivial constant is re-computed, if the interpreter is asked for this:
In [1]: from math import pi In [2]: %timeit [1*pi for _ in range(1000)] ...: 149 µs ± 6.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [3]: %timeit [pi for _ in range(1000)] 87.1 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The effect is more pronounced, if division is involved:
In [4]: some_scalar = 300 In [5]: pi_2 = pi / 2 In [6]: %timeit [some_scalar / (pi / 2) for _ in range(1000)] 249 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [7]: %timeit [some_scalar / pi_2 for _ in range(1000)] 224 µs ± 5.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Solution: Some evaluations are best placed outside of loops and bound to a variable.
Compile Code!!!
Remember that every Python Module on Mogon comes with Cython. Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language.
While we cannot give a comprehensive intro in this wiki document, we recommend using Cython whenever possible and give this little example:
Imaging you have a (tested) script, you need to call frequently. Then create modules your main script can import and write a setup script like this:
# script: setup.py #!/usr/bin/env python import os from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext named_extension = Extension( "name of your extension", ["directory_of_your_module/<module_name1>.pyx", "directory_of_your_module/<module_name2>.pyx"], extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'], include_path = os.environ['CPATH'].split(':') ) setup( name = "some_name", cmdclass = {'build_ext': build_ext}, ext_modules = [named_extension] )
Replace named_extension
with a name of your liking, and fill-in all place holders. You can now call the setup-skript like this:
$ python ./setup.py build_ext --inplace
This will create a file directory_of_your_module/<module_name1>.c
and a file directory_of_your_module/<module_name1>.so
will be the result of a subsequent compilation step.
In Cython you can release the global interpreter lock (GIL), see this document (scroll down a bit), when not dealing with pure python objects.
In particular Cython works with ''numpy''.
Memory Profiling
Profiling memory is a special topic on itself. There is, however, the Python module "memory profiler", which is really helpful if you have an idea where to look. There is also Pympler, yet another such module.
Things to consider
Python is an interpreted language. As such it should not be used for lengthy runs in an HPC environment. Please use the availability to compile your own modules with Cython; consult the relevant Cython documentation. If you do not know how to start, attend a local Python course or schedule a meeting at our local HPC workshop.
Special packages
Please note that we have already installed numpy, scipy and matplotlib in the versions of Python that we provide additionally.
NumPY
When installing NumPY, the first installation attempt fails at exit. Don't worry, the installation is already finished then, but to be sure, you can simply run the command again to see it exiting cleanly.
Note that NumPY can also be linked against the Intel Math Kernel Library or the AMD Core Math Library: