What are containers and software environments and how can I become a more paranoid bioinformatician?
Slides: envs-primer.netlify.app
Michael Hall - Iqbal Group
Installing software sucks
Installing software sucks even more on a cluster
Work smarter, not harder
Makes it much easier to work with others on the same code/project
Create an isolated environment where you effectively get a "clean" python installation
venv
module in the standard library creates virtual environments
Read the PEP for the gory details
$ mkdir paranoid && cd paranoid
$ python3 -c "import numpy; print(numpy.__version__)"
1.18.3 # i.e. whatever is installed in the system wide python
$ python3 -m venv .venv # create new env
$ which python3 # shows the location of our system python
$ source .venv/bin/activate # makes our new env the "active" python
$ which python3 # shows the location of our environment python
$ python3 -c "import numpy; print(numpy.__version__)"
Traceback (most recent call last):
File "np.py", line 2, in <module>
import numpy
ModuleNotFoundError: No module named 'numpy'
$ pip install numpy==1.19.3
$ python3 np.py
1.19.3
$ deactivate
Allows you to change the per-user global python version
Allows you to set a per-project python version
Ridiculously easy installation of any python version
Automatic activation of virtual environments
For local machine installation, refer to
https://github.com/pyenv/pyenv#installation
$ export PYENV_INSTALL=/path/to/install/dir # suggest NOT $HOME
$ export PYENV_ROOT=${PYENV_INSTALL}/.pyenv
$ git clone https://github.com/pyenv/pyenv.git "$PYENV_ROOT"
$ git clone https://github.com/pyenv/pyenv-virtualenv.git ${PYENV_ROOT}/plugins/pyenv-virtualenv
Add the following to the end of your shell configuration file (i.e. ~/.bashrc
)
export PYENV_INSTALL=/path/to/install/dir
export PYENV_ROOT=${PYENV_INSTALL}/.pyenv
# add pyenv to path
export PATH="$PYENV_ROOT/bin:$PATH"
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
fi
Log out and then back in
$ pyenv versions
* system (set by /home/vagrant/.pyenv/version)
$ pyenv install --list
Available versions:
2.1.3
2.2.3
2.3.7
2.4.0
2.4.1
...
$ pyenv install 3.9.1
Downloading Python-3.9.1.tar.xz...
...
Installed Python-3.9.1 to /home/vagrant/.pyenv/versions/3.9.1
$ pyenv versions
* system (set by /home/vagrant/.pyenv/version)
3.9.1
$ python -V
Python 2.7.17
$ pyenv global 3.9.1 # log out and back in
$ python -V
Python 3.9.1
$ pyenv install 3.5.10
$ mkdir foo && cd foo && ls -la
$ python -V
3.9.1
$ pyenv local 3.5.10
$ python -V
3.5.10
$ ls -la
drwxrwxr-x 2 vagrant vagrant 4096 Jan 13 04:53 .
drwxrwxr-x 4 vagrant vagrant 4096 Jan 13 04:53 ..
-rw-rw-r-- 1 vagrant vagrant 7 Jan 13 04:53 .python-version
$ cat .python-version
3.5.10
$ mkdir myproject && cd myproject
$ pyenv virtualenv myproject
$ pyenv versions
system
3.5.10
* 3.9.1 (set by /home/vagrant/.pyenv/version)
3.9.1/envs/myproject
myproject
$ pyenv local myproject # or 3.9.1/envs/myproject
$ cat .python-version
myproject
$ pyenv versions
system
3.5.10
3.9.1
3.9.1/envs/myproject
* myproject (set by /home/vagrant/tmp/myproject/.python-version)
$ pip install pyjokes
$ pyjoke
There are 10 types of people: those who understand binary and those who don't.
$ cd .. && pyjoke
pyenv: pyjoke: command not found
The `pyjoke' command exists in these Python versions:
3.9.1/envs/myproject
myproject
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
Python packaging and dependency management
Strongly recommended if developing a python package
Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.
Why should I use it over virtual envs?
Support for languages other than Python - it's probably more like apt
than pip
You want to use newer language standards, such as C++17
Ability to install pre-compiled software - i.e. nodejs, curl etc.
Bioconda - i.e. samtools, minimap2 etc.
You use Windows...
Not all of PyPI is available (mostly obscure packages though)
"Heavier" way of managing Python environments/packages
Dependency resolution can be slow
More limited in Python version flexibility
Not isolated from your OS - C compiler can sometimes cause problems (see containers in the next section)
Non-pyenv
installation can be found in the docs
Or, install with pyenv
$ pyenv install miniconda3-4.7.10
$ mkdir condaproj && cd condaproj
$ pyenv local miniconda3-4.7.10
$ conda config --add channels defaults # channel order VERY important
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda create --name snp_paper samtools=1.11 bcftools=1.11
$ conda activate snp_paper
$ bcftools --version
bcftools 1.11
...
$ conda info --envs
# conda environments:
base /home/vagrant/.pyenv/versions/miniconda3-4.7.10
snp_paper * /home/vagrant/.pyenv/versions/miniconda3-4.7.10/envs/snp_paper
$ conda search snakemake
$ conda install snakemake=5.30.2
$ snakemake --version
5.30.2
$ conda deactivate
A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. That includes: files, environment variables, dependencies and libraries
Gold-standard for reproducibility
For those who would like more detailed information about what containers are, please refer to this fantastic slide deck from Josep Moscardo.
Much easier to use than Docker
Can seamlessly use Docker images
We have it on the cluster
Supported by major workflow management systems
In it’s most basic form, you can execute a software program, via a container, even though you may not have that program installed on the system you are running it on.
$ module load singularity/3.5.0
$ wget https://github.com/mbhall88/eipp-2019-singularity/raw/master/data/toy.bam
$ img="docker://quay.io/biocontainers/samtools:1.9--h10a08f8_12"
$ singularity -s exec "$img" samtools view -h toy.bam
singularity -s exec
tells Singularity to execute a given command inside a
given container (and only print errors).
"$img"
specifies the container for Singularity to operate on. We will look at this component in more detail soon.
samtools view -h data/toy.bam
is the command we want Singularity to execute inside the container. Notice how we can specify files that exist on our local file system?!
Remote container registries
Build one locally
See this tutorial I ran for some predocs for a more detailed explanation of the above options
BioContainers is a project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g conda) and containers (e.g docker, singularity). BioContainers is based on the popular frameworks Conda, Docker and Singularity.
All Bioconda packages automagically get a Biocontainer built on https://quay.io
tool="bwa"
tag="0.7.3a--h84994c4_4"
URI="docker://quay.io/biocontainers/${tool}:${tag}"
Use pyenv to manage Python versions and environments
Use conda for managing environments with other language requirements
Use containers/environments for all analyses
Always be explicit with versions
Your worst collaborator is yourself six months ago because you weren't explicit enough and you don't reply to emails
It pays to be paranoid!
Slides: envs-primer.netlify.app
Slides repo: https://github.com/mbhall88/envs-primer
EMBL BioIT Singularity course
Pyenv Docs
Poetry Docs
Conda Docs
Bioconda Docs
Singularity Docs
EIPP 2019 Singularity group project
Contact me on the EBI Predoc Slack