Python in TCM
Python is a remarkably useful programming language, and also a remarkably frustrating one.
One set of frustrations arises from the language itself. It is easy to write code in python. But it is also easy to write very slow, inefficient code accidentally, and it is easy to write code which almost no-one else will ever understand. And python's use of a Global Interpreter Lock makes writing threaded code unusually hard. Of course, other languages are imperfect too.
The other set comes from the reliance of most python scripts on non-standard modules. The core python language contains a number of standard modules in its "standard library": math, random, statistics, urllib, datetime, re, zlib, os, sys, shutil and glob to name a few. But most scripts rely on modules outside of this core collection.
This means that a python script will run only in an environment which can provide all of the extra modules that it needs, and which can provide sufficiently-recent versions of those modules. Given that many python modules evolve rapidly, both gaining new features and removing old features, this can cause difficulties. Often one needs access to an old version of a module for a particular script, whereas a different script requires a newer version. And whilst sensible people try to avoid dependencies in their scripts which might cause trouble, one can find oneself collaborating with people who are not sensible.
TCM provides whatever python is provided by the Linux distribution it is using. To this it adds a modest number of modules as provided by that distribution. Currently the list includes ase, matplotlib, networkx, numba, numpy, pandas, scipy, h5py and sklearn. In total we currently have over 160 python3 packages installed on our Ubuntu 20.04 machines, which is almost exactly 5% of the number which Ubuntu offers!
This is sufficient to run many, many python scripts, but what are the alternatives if it is not?
Ask for an additional package to be installed
If Ubuntu supplies a suitable package, and its dependencies do not conflict with anything else we have installed (which mostly means that it does not require MPI), then ask [email protected], and it might well be installed on all machines.
Install a personal copy of a package using pip
See our pip page.
Using anaconda / miniconda
These maintain python installations that are completely independent
of any already provided by the OS. They are large. An initial
install of miniconda is about 300MB and 22,000 files, and of
anaconda about 3GB and 160,000 files. Both are capable of growing
significantly with use. So they are best installed to
local /scratch
disks, and certainly not to one's home
directory. (Anything with a large number of small files will be slow
on a remote directory, and the condas do not need to enter our
backup system. Operations such as conda create
can be ten times faster on a local disk,
and conda remove
forty times faster or more!)
Given that they are complete python installations, one does not need to download the one corresponding to the installed version of python. Here version 3.9 from the miniconda download page is used.
pc00:/scratch/spqr1$ wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.9.2-Linux-x86_64.sh pc00:/scratch/spqr1$ bash Miniconda3-py39_4.9.2-Linux-x86_64.sh -p /scratch/${USER}/miniconda3
and accept the default for every question save the last, "Do you
wish the installer to initialize Miniconda3 by running conda init?" to
which one should probably answer "yes". This will update
one's ~/.bashrc
file to make conda available every
time you log in. You will notice this as the prompt will change to
include your current conda environment.
Of course the above will install conda on a single PC only. If one wishes to copy an installation to another PC, then first check that the target has sufficient space:
pc00:/scratch/spqr1$ du -sh miniconda3 305M miniconda3 pc00:/scratch/spqr1$ ssh pc99 df -h /scratch Filesystem Size Used Avail Use% Mounted on /dev/sda6 1.6T 1.3T 250G 84% /scratch pc00:/scratch/spqr1$ rsync -aAH --delete miniconda3/ pc99:/scratch/${USER}/miniconda3/
The use of rsync
will be familiar to most for
synchronising two directories whilst minimising data transfer, and
is commonly used for backing up between computers, especially
laptops. Those too old to know about rsync
might
prefer
pc00:/scratch/spqr1$ tar -cf - miniconda3 | ssh pc99 tar -C /scratch/${USER} -xf -
and others might try
pc00:/scratch/spqr1$ scp -r miniconda3 pc99:/scratch/${USER}/
Note that neither tar
nor scp
will delete
files which appear on the destination but not the source,
whereas rsync
will. Also scp
will not
preserve hard links, and as conda uses them
extensively, scp
really cannot be recommended here.
If one manages to keep one's miniconda environment small, and one
frequently uses many different computers, and one is not worried by
the performance penalty on some operations, then it might be
reasonable to consider installing it to /rscratch
instead, so that it is trivially available on all PCs. But there
were many ifs prefixing the above.
Using venv
Python3 includes the module venv
for creating virtual
environments, and, at first glance, it is quite attractive. An empty
virtual environment is about 8MB and 600 files, so a fraction of the
size of miniconda. But it is far from ideal. It achieves this small
size by simply linking to the python binary supplied by one's
OS. This is fine, until the OS on one's computer is upgraded (or
simply the version of python), at which point everything is likely
to stop working if either of the first two parts of python's version
number change. And the small size is also achieved by installing no
packages at all. If packages are requested, using pip, they are
always installed afresh, and not linked to copies already existing,
so the size of the virtual environment quickly grows.
If only for the quality of the illustrations, I should mention this Guide to Python's Virtual Environments (best viewed in a private window unless one is a member of Medium). There is also a Primer at realpython.com.