9. Docker and Reproducibility

As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike. Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge.

Studies focusing on code that has been made available with scientific publications regularly find the same common issues that pose substantial barriers to reproducing the original results or building on that code:

  1. “Dependency Hell”
  2. Imprecise documentation
  3. Code rot
  4. Barriers to adoption and reuse in existing solutions

9.1. Docker

Docker is an open source project that builds on many long familiar technologies from operating systems research: LXC containers, virtualization of the OS, and a hash-based or git-like versioning and differencing system, among others.

  1. Docker images: resolving “Dependency Hell”
  2. Dockerfiles: Resolving imprecise documentation
  3. Tackling code-rot with image versions
  4. Barriers to adoption and re-use
Containers are a way to package software in a format that can run isolated on a shared operating system. Unlike VMs, containers do not bundle a full operating system - only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed

Docker is a wonderful tool for many things. A few of them are;

  • As a version control system for your entire app’s operating system by storing each configuration as a central build.
  • To distribute/collaborate our app’s operating system with our team.
  • To run your code on your laptop in the same environment as you have on your server.

We are going to explore Docker images. Why we should use them, and how to go about it.

9.2. When to build a Docker image

Docker images are used to configure and distribute application states. Think of it as a template with which to create the container.

With a Docker image, we can quickly spin up containers with the same configuration. We can then share these images with our team, so we will all be running containers which all have the same configuration.

There are several ways to create Docker images, but the best way is to create a Dockerfile and use the docker build command.

9.3. Distributing a Docker Image

There are several ways to distribute our images.

  • With a tar archive
  • Using Docker Hub
  • Other methods
    • If our Dockerfile is hosted on Github or Bitbucket, we can configure an automated build on Docker Hub.
    • We can deploy our own docker registry and push/pull images to it instead of Docker Hub.

9.4. Running a Docker image

Once we have our image on the machine, we can then run it using

docker pull fpsom/jupyter-kernels
docker run --name=jupyter fpsom/jupyter-kernels

The above command will spin up a new container based on our image and run it. If we do not pass the --name parameter, Docker will pick a random name for our container.

Many images require some extra parameters to be passed to the run command, so take some time to read through the documentation of an image before you use it.

Also, take some time to read the full documentation of the docker run command.

We can see our running containers using the docker ps command. To see ALL containers, we add the -a flag - docker ps -a.

9.5. Jupyter, Docker, MyBinder

For our case, we will use MyBinder as the host of our Jupyter Notebook, so that anyone can run (and therefore reproduce) our code.

Binder allows you to create custom computing environments that can be shared and used by many remote users.

Binder makes it simple to generate reproducible computing environments from a GitHub repository. Binder uses the BinderHub technology to generate a Docker image from this repository. The image will have all the components that you specify along with the Jupyter Notebooks inside. You will be able to share a URL with users that can immediately begin interacting with this environment via the cloud.

If you or another Binder user clicks on a Binder link, the mybinder.org deployment will run the linked repository. While running, you are guaranteed to have at least 1G of RAM. There is an upper-limit of 4GB (if you use more than 4GB your kernel will be restarted).

By default, Binder works on Python; however, given that our notebook is based on R, we need to define the execution environment. This will be done through a dedicated Dockerfile that will setup R and the R Kernel, as well as install the necessary libraries (the version below is based on the Dockerfile provided by binder).

FROM rocker/tidyverse:3.4.2

RUN apt-get update && \
    apt-get -y install python3-pip && \
    pip3 install --no-cache-dir notebook==5.2 && \
    apt-get purge && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

ENV NB_USER rstudio
ENV NB_UID 1000
ENV HOME /home/rstudio
WORKDIR ${HOME}

USER ${NB_USER}

# Set up R Kernel for Jupyter
RUN R --quiet -e "install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest', 'vegan'))"
RUN R --quiet -e "devtools::install_github('IRkernel/IRkernel')"
RUN R --quiet -e "IRkernel::installspec()"

# Make sure the contents of our repo are in ${HOME}
COPY . ${HOME}
USER root
RUN chown -R ${NB_UID}:${NB_UID} ${HOME}
USER ${NB_USER}

# Run install.r if it exists
RUN if [ -f install.r ]; then R --quiet -f install.r; fi

Copy the above commands in an empty file named Dockerfile - you can also download this file directly from here. Then add this file to the GitHub repository, using the git commands (i.e. git add, git commit and git push).

Finally, go to the MyBinder URL, paste the URL of your GitHub repository in the field GitHub repo or URL and click launch. In a few minutes, your custom Jupyter environment will be launched.