We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:
“Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com
We’ve all been in this situation before: it seems ridiculous to have multiple nearly-identical versions of the same document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, or LibreOffice’s Recording and Displaying Changes.
Version control systems start with a base version of the document and then save just the changes you made at each step of the way. You can think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.
Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes onto the base document and getting different versions of the document. For example, two users can make independent sets of changes based on the same document.
Unless there are conflicts, you can even play two sets of changes onto the same base document.
A version control system is a tool that keeps track of these changes for us and helps us version and merge our files. It allows you to decide which changes make up the next version, called a commit and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers facilitating collaboration among different people.
The Long History of Version Control Systems
Automated version control systems are nothing new. Tools like RCS, CVS, or Subversion have been around since the early 1980s and are used by many large companies. However, many of these are now becoming considered as legacy systems due to various limitations in their capabilities. In particular, the more modern systems, such as Git and Mercurial are distributed, meaning that they do not need a centralized server to host the repository. These modern systems also include powerful merging tools that make it possible for multiple authors to work within the same files concurrently.
The opposite of “open” isn’t “closed”. The opposite of “open” is “broken”.
— John Wilbanks
Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:
For a growing number of scientists, though, the process looks like this:
This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book.
This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:
Making Code Citable
This short guide from GitHub explains how to create a Digital Object Identifier (DOI) for your code, your papers, or anything else hosted in a version control repository.
The folder that currently contains our Jupyter notebook and the data file should look like this:
$ ls -la
total 18756
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:09 ./
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:08 ../
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609 152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb
Then we tell Git to make this folder a repository — a place where Git can store versions of our files:
git init
If we use ls
to show the directory’s contents, it appears that nothing has changed. But if we add the -a
flag to show everything, we can see that Git has created a hidden directory within planets called .git
:
total 18760
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:11 ./
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:08 ../
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:11 .git/
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609 152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb
Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project’s history.
We can check that everything is set up correctly by asking Git to tell us the status of our project. It shows that there are two new files that are currently not tracked (meaning that any changes there will not be monitored).
git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
Reproducible-analysis-and-Research-Transparency.ipynb
nothing added to commit but untracked files present (use "git add" to track)
The untracked files message means that there’s a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add
:
git add Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv Reproducible-analysis-and-Research-Transparency.ipynb
and then check that the right thing happened:
git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
new file: Reproducible-analysis-and-Research-Transparency.ipynb
Git now knows that it’s supposed to keep track of these two files, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:
git commit -m "Let's add the two files"
[master (root-commit) 8dde99b] Let's add the two files
2 files changed, 67898 insertions(+)
create mode 100644 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
create mode 100644 Reproducible-analysis-and-Research-Transparency.ipynb
When we run git commit
, Git takes everything we have told it to save by using git add
and stores a copy permanently inside the special .git
directory. This permanent copy is called a commit
(or revision
) and its short identifier is f22b25e
(Your commit may have another identifier.)
We use the -m
flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit
without the -m
option, Git will launch nano
(or whatever other editor we configured as core.editor
) so that we can write a longer message.
Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.
If we run git status
now:
git status
On branch master
nothing to commit, working tree clean
This is the first steps in maintaining versions. There are a few more commands that you should be aware of, such as git diff
and git log
, but for the purposes of this exercise, this is sufficient.
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git
allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab to hold those master copies; we’ll explore the pros and cons of this in the final section of this lesson.
Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the icon in the top right corner to create a new repository called reproducibilityWorkshop
. As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository.
The next step is to connect the two repositories; the local and the one we just created on GitHub. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:
‘HTTPS’
link to change the protocol from SSH
to HTTPS
.git remote add origin https://github.com/fpsom/reproducibilityWorkshop.git
Make sure to use the URL for your repository rather than mine: the only difference should be your username instead of fpsom
.
We can check that the command has worked by running git remote -v
:
git remote -v
origin https://github.com/fpsom/reproducibilityWorkshop.git (fetch)
origin https://github.com/fpsom/reproducibilityWorkshop.git (push)
The name origin
is a local nickname for your remote repository. We could use something else if we wanted to, but origin
is by far the most common choice.
Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:
git push origin master
Counting objects: 4, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 1.02 MiB | 338.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0)
To https://github.com/fpsom/reproducibilityWorkshop.git
* [new branch] master -> master
Excellent job! You now have both the remote and the local repositories in sync!
Exercise: Make a change to one of the two local files, commit, and push.