8. Versioning with Git

8.1. Software Carpentry source lesson

8.2. Automated Version Control

We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:

Piled Higher and Deeper by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1531

“Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com

We’ve all been in this situation before: it seems ridiculous to have multiple nearly-identical versions of the same document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, or LibreOffice’s Recording and Displaying Changes.

Version control systems start with a base version of the document and then save just the changes you made at each step of the way. You can think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.

Changes Are Saved Sequentially

Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes onto the base document and getting different versions of the document. For example, two users can make independent sets of changes based on the same document.

Different Versions Can be Saved

Unless there are conflicts, you can even play two sets of changes onto the same base document.

Multiple Versions Can be Merged

A version control system is a tool that keeps track of these changes for us and helps us version and merge our files. It allows you to decide which changes make up the next version, called a commit and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers facilitating collaboration among different people.

The Long History of Version Control Systems

Automated version control systems are nothing new. Tools like RCS, CVS, or Subversion have been around since the early 1980s and are used by many large companies. However, many of these are now becoming considered as legacy systems due to various limitations in their capabilities. In particular, the more modern systems, such as Git and Mercurial are distributed, meaning that they do not need a centralized server to host the repository. These modern systems also include powerful merging tools that make it possible for multiple authors to work within the same files concurrently.

8.3. How can version control help me make my work more open?

The opposite of “open” isn’t “closed”. The opposite of “open” is “broken”.

— John Wilbanks

Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:

  • A scientist collects some data and stores it on a machine that is occasionally backed up by her department.
  • She then writes or modifies a few small programs (which also reside on her machine) to analyze that data.
  • Once she has some results, she writes them up and submits her paper. She might include her data—a growing number of journals require this—but she probably doesn’t include her code.
  • Time passes.
  • The journal sends her reviews written anonymously by a handful of other people in her field. She revises her paper to satisfy them, during which time she might also modify the scripts she wrote earlier, and resubmits.
  • More time passes.
  • The paper is eventually published. It might include a link to an online copy of her data, but the paper itself will be behind a paywall: only people who have personal or institutional access will be able to read it.

For a growing number of scientists, though, the process looks like this:

  • The data that the scientist collects is stored in an open access repository like figshare or Zenodo, possibly as soon as it’s collected, and given its own Digital Object Identifier (DOI). Or the data was already published and is stored in Dryad.
  • The scientist creates a new repository on GitHub to hold her work.
  • As she does her analysis, she pushes changes to her scripts (and possibly some output files) to that repository. She also uses the repository for her paper; that repository is then the hub for collaboration with her colleagues.
  • When she’s happy with the state of her paper, she posts a version to arXiv or some other preprint server to invite feedback from peers.
  • Based on that feedback, she may post several revisions before finally submitting her paper to a journal.
  • The published paper includes links to her preprint and to her code and data repositories, which makes it much easier for other scientists to use her work as starting point for their own research.

This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book.

This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:

  • The conceptual stages of your work are documented, including who did what and when. Every step is stamped with an identifier (the commit ID) that is for most intents and purposes unique.
  • You can tie documentation of rationale, ideas, and other intellectual work directly to the changes that spring from them.
  • You can refer to what you used in your research to obtain your computational results in a way that is unique and recoverable.
  • With a distributed version control system such as Git, the version control repository is easy to archive for perpetuity, and contains the entire history.

Making Code Citable

This short guide from GitHub explains how to create a Digital Object Identifier (DOI) for your code, your papers, or anything else hosted in a version control repository.

8.4. Storing our newly created Jupyter file to GitHub

8.4.1. Creating a repository

The folder that currently contains our Jupyter notebook and the data file should look like this:

$ ls -la
total 18756
drwxr-xr-x 1 fpsom 197609        0 Nov 16 14:09 ./
drwxr-xr-x 1 fpsom 197609        0 Nov 16 14:08 ../
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609   152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb

Then we tell Git to make this folder a repository — a place where Git can store versions of our files:

git init

If we use ls to show the directory’s contents, it appears that nothing has changed. But if we add the -a flag to show everything, we can see that Git has created a hidden directory within planets called .git:

total 18760
drwxr-xr-x 1 fpsom 197609        0 Nov 16 14:11 ./
drwxr-xr-x 1 fpsom 197609        0 Nov 16 14:08 ../
drwxr-xr-x 1 fpsom 197609        0 Nov 16 14:11 .git/
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609   152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb

Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project’s history.

We can check that everything is set up correctly by asking Git to tell us the status of our project. It shows that there are two new files that are currently not tracked (meaning that any changes there will not be monitored).

git status

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)


nothing added to commit but untracked files present (use "git add" to track)

8.4.2. Our first commit

The untracked files message means that there’s a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

git add Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv Reproducible-analysis-and-Research-Transparency.ipynb

and then check that the right thing happened:

git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

        new file:   Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
        new file:   Reproducible-analysis-and-Research-Transparency.ipynb

Git now knows that it’s supposed to keep track of these two files, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

git commit -m "Let's add the two files"

[master (root-commit) 8dde99b] Let's add the two files
2 files changed, 67898 insertions(+)
create mode 100644 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
create mode 100644 Reproducible-analysis-and-Research-Transparency.ipynb

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is f22b25e (Your commit may have another identifier.)

We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.

Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.

If we run git status now:

git status

On branch master
nothing to commit, working tree clean

This is the first steps in maintaining versions. There are a few more commands that you should be aware of, such as git diff and git log, but for the purposes of this exercise, this is sufficient.

8.4.3. Pushing our Jupyter notebook to GitHub

Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.

Systems like Git allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab to hold those master copies; we’ll explore the pros and cons of this in the final section of this lesson.

Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the icon in the top right corner to create a new repository called reproducibilityWorkshop. As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository.

The next step is to connect the two repositories; the local and the one we just created on GitHub. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:

  • Click on the ‘HTTPS’ link to change the protocol from SSH to HTTPS.
  • Copy that URL from the browser, go into the local repository, and run this command:
git remote add origin https://github.com/fpsom/reproducibilityWorkshop.git

Make sure to use the URL for your repository rather than mine: the only difference should be your username instead of fpsom.

We can check that the command has worked by running git remote -v:

git remote -v
origin  https://github.com/fpsom/reproducibilityWorkshop.git (fetch)
origin  https://github.com/fpsom/reproducibilityWorkshop.git (push)

The name origin is a local nickname for your remote repository. We could use something else if we wanted to, but origin is by far the most common choice.

Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:

git push origin master

Counting objects: 4, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 1.02 MiB | 338.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0)
To https://github.com/fpsom/reproducibilityWorkshop.git
* [new branch]      master -> master

Excellent job! You now have both the remote and the local repositories in sync!

Exercise: Make a change to one of the two local files, commit, and push.