Reproducibility (paranoia) for Predocs

What are containers and software environments and how can I become a more paranoid bioinformatician?

Slides: envs-primer.netlify.app

Michael Hall - Iqbal Group

Bird’s-eye view

Why?

Installing software sucks

Installing software sucks even more on a cluster

Work smarter, not harder

Makes it much easier to work with others on the same code/project

(Python) virtual environments

Create an isolated environment where you effectively get a "clean" python installation

venv module in the standard library creates virtual environments

Read the PEP for the gory details

(Python) virtual environments

$ mkdir paranoid && cd paranoid

$ python3 -c "import numpy; print(numpy.__version__)"
1.18.3 # i.e. whatever is installed in the system wide python

$ python3 -m venv .venv  # create new env

$ which python3  # shows the location of our system python

$ source .venv/bin/activate  # makes our new env the "active" python

$ which python3  # shows the location of our environment python

$ python3 -c "import numpy; print(numpy.__version__)"
Traceback (most recent call last):
  File "np.py", line 2, in <module>
    import numpy
ModuleNotFoundError: No module named 'numpy'

$ pip install numpy==1.19.3

$ python3 np.py
1.19.3
$ deactivate

Simple Python Version Management: pyenv

https://github.com/pyenv/pyenv

pyenv

Allows you to change the per-user global python version

Allows you to set a per-project python version

Ridiculously easy installation of any python version

Automatic activation of virtual environments

Install - local

For local machine installation, refer to
https://github.com/pyenv/pyenv#installation

Install - cluster

$ export PYENV_INSTALL=/path/to/install/dir  # suggest NOT $HOME

$ export PYENV_ROOT=${PYENV_INSTALL}/.pyenv

$ git clone https://github.com/pyenv/pyenv.git "$PYENV_ROOT"

$ git clone https://github.com/pyenv/pyenv-virtualenv.git ${PYENV_ROOT}/plugins/pyenv-virtualenv

Add the following to the end of your shell configuration file (i.e. ~/.bashrc)

export PYENV_INSTALL=/path/to/install/dir
export PYENV_ROOT=${PYENV_INSTALL}/.pyenv
# add pyenv to path
export PATH="$PYENV_ROOT/bin:$PATH"

if command -v pyenv 1>/dev/null 2>&1; then
    eval "$(pyenv init -)"
    eval "$(pyenv virtualenv-init -)"
fi

Log out and then back in

Basic Usage

$ pyenv versions
* system (set by /home/vagrant/.pyenv/version)
$ pyenv install --list
Available versions:
  2.1.3
  2.2.3
  2.3.7
  2.4.0
  2.4.1
  ...
$ pyenv install 3.9.1
Downloading Python-3.9.1.tar.xz...
...
Installed Python-3.9.1 to /home/vagrant/.pyenv/versions/3.9.1
$ pyenv versions
* system (set by /home/vagrant/.pyenv/version)
  3.9.1
$ python -V
Python 2.7.17
$ pyenv global 3.9.1  # log out and back in
$ python -V
Python 3.9.1

Real Usage

$ pyenv install 3.5.10

$ mkdir foo && cd foo && ls -la

$ python -V
3.9.1

$ pyenv local 3.5.10

$ python -V
3.5.10

$ ls -la
drwxrwxr-x 2 vagrant vagrant 4096 Jan 13 04:53 .
drwxrwxr-x 4 vagrant vagrant 4096 Jan 13 04:53 ..
-rw-rw-r-- 1 vagrant vagrant    7 Jan 13 04:53 .python-version

$ cat .python-version
3.5.10

Virtual Env Usage

$ mkdir myproject && cd myproject

$ pyenv virtualenv myproject

$ pyenv versions
system
3.5.10
* 3.9.1 (set by /home/vagrant/.pyenv/version)
3.9.1/envs/myproject
myproject

$ pyenv local myproject  # or 3.9.1/envs/myproject

$ cat .python-version
myproject

$ pyenv versions
system
3.5.10
3.9.1
3.9.1/envs/myproject
* myproject (set by /home/vagrant/tmp/myproject/.python-version)

$ pip install pyjokes

$ pyjoke
There are 10 types of people: those who understand binary and those who don't.

$ cd .. && pyjoke
pyenv: pyjoke: command not found

The `pyjoke' command exists in these Python versions:
  3.9.1/envs/myproject
  myproject

Note: See 'pyenv help global' for tips on allowing both
      python2 and python3 to be found.

Poetry

https://python-poetry.org/

Python packaging and dependency management

Strongly recommended if developing a python package

Conda

https://conda.io

Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Conda advantages

Why should I use it over virtual envs?

Support for languages other than Python - it's probably more like apt than pip

You want to use newer language standards, such as C++17

Ability to install pre-compiled software - i.e. nodejs, curl etc.

Bioconda - i.e. samtools, minimap2 etc.

You use Windows...

Conda disadvantages

Not all of PyPI is available (mostly obscure packages though)

"Heavier" way of managing Python environments/packages

Dependency resolution can be slow

More limited in Python version flexibility

Not isolated from your OS - C compiler can sometimes cause problems (see containers in the next section)

Setup

Non-pyenv installation can be found in the docs

Or, install with pyenv

$ pyenv install miniconda3-4.7.10

Usage

$ mkdir condaproj && cd condaproj

$ pyenv local miniconda3-4.7.10
$ conda config --add channels defaults  # channel order VERY important
$ conda config --add channels bioconda
$ conda config --add channels conda-forge

$ conda create --name snp_paper samtools=1.11 bcftools=1.11

$ conda activate snp_paper

$ bcftools --version
bcftools 1.11
...

$ conda info --envs
# conda environments:
base                     /home/vagrant/.pyenv/versions/miniconda3-4.7.10
snp_paper             *  /home/vagrant/.pyenv/versions/miniconda3-4.7.10/envs/snp_paper

$ conda search snakemake

$ conda install snakemake=5.30.2

$ snakemake --version
5.30.2

$ conda deactivate

Containers

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. That includes: files, environment variables, dependencies and libraries

Gold-standard for reproducibility

For those who would like more detailed information about what containers are, please refer to this fantastic slide deck from Josep Moscardo.

Singularity

Much easier to use than Docker

Can seamlessly use Docker images

We have it on the cluster

Supported by major workflow management systems

What can I do with a container?

In it’s most basic form, you can execute a software program, via a container, even though you may not have that program installed on the system you are running it on.

Example on EBI cluster

$ module load singularity/3.5.0

$ wget https://github.com/mbhall88/eipp-2019-singularity/raw/master/data/toy.bam

$ img="docker://quay.io/biocontainers/samtools:1.9--h10a08f8_12"

$ singularity -s exec "$img" samtools view -h toy.bam

singularity -s exec tells Singularity to execute a given command inside a given container (and only print errors).

"$img" specifies the container for Singularity to operate on. We will look at this component in more detail soon.

samtools view -h data/toy.bam is the command we want Singularity to execute inside the container. Notice how we can specify files that exist on our local file system?!

How do I get a container image?

Remote container registries

Build one locally

See this tutorial I ran for some predocs for a more detailed explanation of the above options

BioContainers

https://biocontainers.pro

BioContainers is a project that provides the infrastructure and basic guidelines to create, manage and distribute bioinformatics packages (e.g conda) and containers (e.g docker, singularity). BioContainers is based on the popular frameworks Conda, Docker and Singularity.

Retrieving a Biocontainer

All Bioconda packages automagically get a Biocontainer built on https://quay.io

Retrieving a Biocontainer

Retrieving a Biocontainer


tool="bwa"
tag="0.7.3a--h84994c4_4"
URI="docker://quay.io/biocontainers/${tool}:${tag}"

Summary

Use pyenv to manage Python versions and environments

Use conda for managing environments with other language requirements

Use containers/environments for all analyses

Always be explicit with versions

Summary

Your worst collaborator is yourself six months ago because you weren't explicit enough and you don't reply to emails

It pays to be paranoid!

Resources

Slides: envs-primer.netlify.app
Slides repo: https://github.com/mbhall88/envs-primer
EMBL BioIT Singularity course
Pyenv Docs
Poetry Docs
Conda Docs
Bioconda Docs
Singularity Docs
EIPP 2019 Singularity group project
Contact me on the EBI Predoc Slack