This workshop was part of the Open Science Tools, Data & Technologies for Efficient Ecological & Evolutionary Research Symposium, organized by NIOO-KNAW and DANS-KNAW on 7 & 8 December 2017 at the Amsterdam Science Park.
Transparency, open sharing, and reproducibility are core values of science, but not always part of daily practice. This workshop provided an overview of current status in reproducible analysis in order to provide transparency in research. The workshop covered methodological topics (such as the use of the Open Science Framework and reporting guidelines) as well as software tools (such as Git
, Docker
, RMarkdown
/ knitr
and Jupyter
). Going beyond simple listing and presentations, the workshop focused on hands-on skill building, with exercises and tutorials covering most of the software aspects. Specifically, the agenda of the workshop was the following:
Contents:
If Git is not already available on your machine you can try to install it via your distro’s package manager. For Debian/Ubuntu run sudo apt-get install git
and for Fedora run sudo yum install git
Windows
In any case, create also an account on GitHub - it will be useful for the hand-on exercises.
Jupyter notebook is an interactive web application that allows you to type and edit lines of code and see the output. The software requires Python installation, but currently supports interaction with over 40 languages.
For new users, installation of Anaconda is highly recommended. Anaconda conveniently installs Python, the Jupyter Notebook, and other commonly used packages for scientific computing and data science.
Use the following installation steps:
Install on UNIX machines:
First we will install Anaconda, a package manager for Python libraries.
curl -OL https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.0-Linux-x86_64.sh
bash Anaconda2-2.4.0-Linux-x86_64.sh
Install on Mac OS X:
First we will install Anaconda, a package manager for Python libraries.
curl -OL http://repo.continuum.io/archive/Anaconda2-4.1.1-MacOSX-x86_64.sh
bash Anaconda2-4.1.1-MacOSX-x86_64.sh -b
(If you are prompted, type Enter
, then press Enter through the instructions. Be careful not to keep pressing Enter without reading otherwise you will end up saying No
. You will need to type ‘Yes’ to continue with the installation.)
After Anaconda install has finished, type:
source ~/.bashrc
conda install jupyter
conda install -c r r r-essentials
This will install packages allowing you to open either a new Python .ipynb or an R .ipynb.
Navigate to the directory on your computer with files you want to explore. Then type:
jupyter notebook
This will open your browser with a list of files. Click on the “New” and bring down the pull-down menu. Under ‘Notebooks’, click on either R or Python language to start your new notebook!
You should see the files in the repository.
The main keyboard command to remember is how to execute the code from a cell. Type code into a cell and then hit Shift-Enter
.
If you’re in Python 2, type:
print "Hello World!"
or for Python 3:
print("Hello World!")
Then press Shift-Enter
For more instructions, the Help menu has a good tour and detailed information. Notebooks can be downloaded locally by going to the File menu, then selecting Download and choosing a file type to download.
R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.
Windows Install R by downloading and running the correct installer file from CRAN. Also, please install the RStudio IDE. Note that if you have separate user and admin accounts, you should run the installers as administrator (right-click on .exe file and select “Run as administrator” instead of double-clicking). Otherwise problems may occur later, for example when installing R packages.
Linux You can download the binary files for your distribution from CRAN. Or you can use your package manager (e.g. for Debian/Ubuntu run sudo apt-get install r-base and for Fedora run sudo yum install R). Also, please install the RStudio IDE.
install.packages(c("dplyr", "tidyr", "vegan", "ggplot2"));
Jupyter by default supports python, but this functionality can be expanded by using additional kernels. For our case, we will use also the R kernel. After launching R, run the following commands:
install.packages('devtools')
devtools::install_github('IRkernel/IRkernel')
IRkernel::installspec() # to register the kernel in the current R installation
That’s it, all done! You have now all the tools in place!
The sciences have with reproducibility problem (Nature article Many published studies cannot be reproduced)
Peng (2009) Reproducible research and Biostatistics. Biostatistics 10: 405-408
The information presented here is based on information found in these sources:
Numerous problems threaten the integrity, credibility, and utility of research. Improving reproducibility will ensure that research is as efficient and productive as possible. The following figure (retrieved from the report of the symposium, “Reproducibility and reliability of biomedical research”, organised by the Academy of Medical Sciences, BBSRC, MRC and Wellcome Trust in April 2015. The full report is available from here) summarizes aspects of the conduct of research that can cause irreproducible results, and potential strategies for counteracting poor practice in these areas. Overarching factors can further contribute to the causes of irreproducibility, but can also drive the implementation of specific measures to address these causes. The culture and environment in which research takes place is an important ‘top-down’ overarching factor. From a ‘bottom-up’ perspective, continuing education and training for researchers can raise awareness and disseminate good practice.
Here we will address the three main pillars, i.e. Transparency, Reproducibility and Openness, along with some strategies best outlined in the following figure.
Questions to ask yourself:
Researcher degrees of freedom (Wicherts et al. 2017)
Biases A researcher performs the data-analysis:
Fugelsanget al. (2004)
Schimmack(2012)
While understanding the full complement of factors that contribute to reproducibility is important, it can also be hard to break down these factors into steps that can immediately be adopted into an existing research program and immediately improve its reproducibility. One of the first steps to take is to assess the current state of affairs, and to track improvement as steps are taken to increase reproducibility even more.
Goodman, Fanelli, & Ioannidis (2016) note that in epidemiology, computational biology, economics, and clinical trials, reproducibility is often defined as:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
This is distinct from replicability:
which refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.
Reproducibility can be assessed at several different levels: at the level of an individual project (e.g., a paper, an experiment, a method or a dataset), an individual researcher, a lab or research group, an institution, or even a research field. Slightly different kinds of criteria and points of assessment might apply to these different levels. For example, an institution upholds reproducibility practices if it institutes policies that reward researchers who conduct reproducible research. Meanwhile, a research field might be considered to have a higher level of reproducibility if it develops community-maintained resources that promote and enable reproducible research practices, such as data repositories, or common data-sharing standards.
Automation of the research process means that the main steps in the project: transformations of the data – various processing steps and calculations – as well as the visualization steps that lead to the important inferences, are encoded in software and documented in such a way that they can reliably and mechanically be replicated. In other words, the conclusions and illustrations that appear in the article are the result of a set of computational routines, or scripts that can be examined by others, and re-run to reproduce these results.
The public availability of the data and software are key components of computational reproducibility. To facilitate its evaluation, we suggest that researchers consider the following series of questions.
Crucial to reproducing a study is providing sufficient details about its execution through reports, papers, lab notebooks, etc. Researchers usually aim to publish their results in journals (or conference proceedings) with the aim to broadly distribute their discoveries. However, the choice of a journal may affect the availability and accessibility of their findings. Open access journals allow readers to access articles (usually online) without requiring any subscription or fees. While open access can take many forms, there are two common types of open access publication:
Clearly gold access journals provide the easiest and most reliable access to the article. However, since there are no subscription fees to cover publishing costs at gold open access journals and articles, the author is required to pay. Often the amount is over a thousand dollars per article. As a compromise, journals sometimes have an embargo on open access (delayed open access), i.e. there is a period of time during which the article cannot be freely accessed, after which either the journal automatically makes the article available or the authors are allowed to self-archive it.
Green open access is an attractive approach to making articles openly available because it is affordable to both readers and authors. According to a study of a random sample of articles in 2009 (Björk, Welling, Laakso, Majlender, & Guðnason, 2010), approximately 20% of the articles were freely accessible (9.8 % on publishers’ websites and 11.9% elsewhere through search). A more recent larger study (Archambault et al., 2013) indicates that 43% of Scopus indexed papers between 2008 and 2011 were freely available by the end of 2012. It has been also shown that there is a substantial growth in the proportion of available articles. However, there are still many articles which have been given a green light for access, but they have not been self-archived. Thus it is important for authors to understand the journal’s publishing policy and use the available resources (within their field, institution, and beyond) to make their work accessible to a wide audience. Many research-intensive universities, usually via the libraries, provide services to help researchers self-archive their publications.
There are many other methods for sharing research online at different stages of the work (before final results are even available). Preregistration of the hypotheses that are being tested in a study can prevent overly flexible analysis practices and HARKing (hypothesizing after results are known (Kerr, 1998)), which reduce the reproducibility of the results reported. Regular public updates can be achieved through electronic lab notebooks, wiki pages, presentation slides, blog posts, technical reports, preprints, etc. Sharing progress allows for quick dissemination of ideas, easy collaboration, and early detection and correction of flaws. Storing preliminary results and supplementary materials in centralized repositories (preregistration registries, public version control repositories, institutional reports) have potential to improve the discoverability and the availability lifespan of the works. Some important questions researchers can ask when evaluating publishing solutions include:
Taking into account the sustainability and the ease of access of these solutions in the decision process is integral to improving the research reproducibility. There is also empirical evidence that publication in open access promotes the downstream use of the scientific findings, as evidenced by an approximately 10% increase in citations (Hajjem, Harnad, & Gingras, 2006) (and see also http://opcit.eprints.org/oacitation-biblio.html).
Sources for all the above details include the following:
(information from the Data CampL Jupyter And R Markdown)
In his talk, J.J Allaire, confirms that the efforts in R
itself for reproducible research, the efforts of Emacs
to combine text code and input, the Pandoc
, Markdown
and knitr
projects, and computational notebooks have been evolving in parallel and influencing each other for a lot of years. He confirms that all of these factors have eventually led to the creation and development of notebooks for R
.
Firstly, computational notebooks have quite a history: since the late 80s, when Mathematica’s front end was released, there have been a lot of advancements. In 2001, Fernando Pérez started developing IPython, but only in 2011 the team released the 0.12 version of IPython was realized. The SageMath project began in 2004. After that, there have been many notebooks. The most notable ones for the data science community are the Beaker (2013), Jupyter (2014) and Apache Zeppelin (2015).
Then, there are also the markup languages and text editors that have influenced the creation of RStudio’s notebook application, namely, Emacs, Markdown, and Pandoc. Org-mode was released in 2003. It’s an editing and organizing mode for notes, planning and authoring in the free software text editor Emacs. Six years later, Emacs org-R was there to provide support for R users. Markdown, on the other hand, was released in 2004 as a markup language that allows you to format your plain text in such a way that it can be converted to HTML or other formats. Fast forward another couple of years, and Pandoc was released. It’s a writing tool and as a basis for publishing workflows.
Lastly, the efforts of the R community to make sure that research can be reproducible and transparent have also contributed to the rise of a notebook for R. 2002, Sweave was introduced in 2002 to allow the embedding of R code within LaTeX documents to generate PDF files. These pdf files combined the narrative and analysis, graphics, code, and the results of computations. Ten years later, knitr was developed to solve long-standing problems in Sweave and to combine features that were present in other add-on packages into one single package. It’s a transparent engine for dynamic report generation in R. Knitr allows any input languages and any output markup languages.
Also in 2012, R Markdown was created as a variant of Markdown that can embed R code chunks and that can be used with knitr to create reproducible web-based reports. The big advantage was and still is that it isn’t necessary anymore to use LaTex, which has a learning curve to learn and use. The syntax of R Markdown is very similar to the regular Markdown syntax but does have some tweaks to it, as you can include, for example, LaTex equations.
The OSF gives you free accounts for collaboration around files and other research artifacts.
Create an account and log in at: http://osf.io/
Each account can have up to 5 GB of files without any problem, and it remains private until you make it public.
You can add other authorized users as well.
Once you are ready to make the project public, you can do so in two ways:
You can also cut DOIs for resources so that you can put them in papers.
The OSF is archival (they have a sustainability fund that guarantees their data will remain around for something like 30 years) and many publications will accept them as a place to make data public if you use a registration as above.
OSF Meetings - free poster and presentation sharing
OSF for institutions - branded for institutions, with logins/passwords tied into your institutional login.
The Dryad Digital Repository (aka Dryad) is a curated resource that makes the data underlying scientific publications discoverable, freely reusable, and citable. Dryad provides a general-purpose home for a wide diversity of datatypes.
Dryad’s vision is to promote a world where research data is openly available, integrated with the scholarly literature, and routinely re-used to create knowledge. They provide the infrastructure for, and promote the re-use of, data underlying the scholarly literature.
Dryad is governed by a nonprofit membership organization. Membership is open to any stakeholder organization, including but not limited to journals, scientific societies, publishers, research institutions, libraries, and funding organizations.
Publishers are encouraged to facilitate data archiving by coordinating the submission of manuscripts with submission of data to Dryad. Learn more about submission integration.
Dryad originated from an initiative among a group of leading journals and scientific societies in evolutionary biology and ecology to adopt a joint data archiving policy (JDAP) for their publications, and the recognition that easy-to-use, sustainable, community-governed data infrastructure was needed to support such a policy.
Zenodo is derived from Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history.
The OpenAIRE project, in the vanguard of the open access and open data movements in Europe was commissioned by the EC to support their nascent Open Data policy by providing a catch-all repository for EC funded research. CERN, an OpenAIRE partner and pioneer in open source, open access and open data, provided this capability and Zenodo was launched in May 2013.
Jupyter notebook is an interactive web application that allows you to type and edit lines of code and see the output. The software requires Python installation, but currently supports interaction with over 40 languages.
Literate programming is the basic idea behind dynamic documents and was proposed by Donald Knuth in 1984. Originally, it was for mixing the source code and documentation of software development together. Today, we will create dynamic documents in which program or analysis code is run to produce output (e.g. tables, plots, models, etc) and then are explained through narrative writing.
The 3 steps of Literate Programming:
So that leaves us, the writers, with 2 steps which includes writing:
Note #1: R Notebooks and Jupyter notebooks are very similar! They are two sides of the same coin. We suggest that you adopt which ever one makes more sense to you and is in a layout that has a lower barrier for you to learn.
Markdown is a system for writing simple, readable text that is easily converted to HTML. Markdown essentially is two things:
Main goal of Markdown:Make the syntax of the raw (pre-HTML) document as readable possible.
Would you rather read this code in HTML?
<body>
<section>
<h1>Fresh Berry Salad Recipe</h1>
<ul>
<li>Blueberries</li>
<li>Strawberries</li>
<li>Blackberries</li>
<li>Raspberries</li>
</ul>
</section>
</body>
Or this code in Markdown?
# Fresh Berry Salad Recipe
* Blueberries
* Strawberries
* Blackberries
* Raspberries
If you are human, the Markdown code is definitely easier to read! Let us take a moment to soak in how much easier our lives are/will be because Markdown exists! Thank you John Gruber and Aaron Swartz (RIP) for creating Markdown in 2004!
In this case, “notebook” or “notebook documents” denote documents that contain both code and rich text elements, such as figures, links, equations, etc. Because of the mix of code and text elements, these documents are the ideal place to bring together an analysis description and its results as well as they can be executed perform the data analysis in real time. These documents are produced by the Jupyter Notebook App.
As a fun note, “Jupyter” is a loose acronym meaning Julia, Python, and R. These programming languages were the first target languages of the Jupyter application, but nowadays, the notebook technology also supports many other languages.
The main components of the whole Jupyter environment are, on one hand, the notebooks themselves and the application. On the other hand, you also have a notebook kernel (that is the language interpreter that will be executing the code in the background) and a notebook dashboard.
And there you have it: the Jupyter Notebook - there are also several examples of Jupyter notebooks that you can see/browse.
There are the official and detailed installation notes, and you can also have a quick look at step-by-step guide and some references here.
Generally, you’ll need to install Python (which is a prerequisite). The general recommendation is that you use the Anaconda distribution to install both Python and the notebook application.
After installation, the only thing necessary is to actually start the notebook. This can be done at command line using the following command:
jupyter notebook
After running the command, you will see a bunch of information on the command line window, and at the same time, a new page will open on your browser that will look like the following:
There are three main tabs Files
, Running
and Clusters
. You’ll be mostly using the first two (when not in the actual notebook):
Files
: is the listing of your current working directory. When you first launch the notebook, the directory is the same where you launched the app.Running
: is a list of all active notebooks, i.e. notebooks that have been running commands through one of the available kernels.Clusters
: this is a listing of all clusters that are available for a back-end execution (will be empty, unless you have connected the Jupyter Notebook app to a cluster)Creating a Notebook is as straightforward as clicking on the New
button on the top right, and selecting the kernel (i.e. the engine that will be interpreting our commands).
Note Jupyter really shines for Python and Julia notebooks. R users usually go the RMarkdown, which is much more optimized for R (as opposed to Jupyter). Eventually however, it all comes down to personal preference (or lab inheritance…)
All notebooks look like this in the beginning:
You’ll notice the following points:
In []
section in the middle. This is called a cell
and is essentially an interface where you can put your code, text, markdown, etc. Simply put, every cell that is an input is marked as In
followed by an increasing number that corresponds to the relative order that the particular cell was executed.After writing some code/text in the currently active cell, the main keyboard command to remember is how to execute the code.
Shift-Enter
: Executes the code and creates a new cell underneath.Ctrl-Enter
: Executes the code without creating a new cell.Type the following code into a cell and then hit Shift-Enter
.
2 + 3
You will see the following screen (or similar):
First of all, you may have briefly seen an asterisk after In
, i.e. In [ * ]
. The asterisk means that the kernel is currently trying to run the code, so you should be waiting for the output. After successful execution, the *
will change to the next number of the Cell (1
in our instance), and the output of the command will be visible below (5
in our case). Finally, as we executed the code with Shift-Enter
, a brand new cell has been created for us.
At this point, we can rename our Notebook, by clicking on the Untitled
entry, and let’s rename it to Jupyter-is-fun
Well done! You’ve just created you first Jupyter Notebook!
A Jupyter Notebook can support multiple Kernels at the same time. Let’s try and run a cell using Python 3. Change the cell type to Python 3
(or Python 2
if you have a different version installed), and type:
print("Hello World!")
(if you’re in Python 2, type print "Hello World!"
)
You’ll notice that the output now is Hello World!
, while at the same time, the numbering of the cell has reset to 1
- this is because each kernel recognizes it’s own cells with the order of execution.
Finally, let’s do a quick loop:
elements = ['oxygen', 'nitrogen', 'argon']
for char in elements:
print(char)
Here, we have created a list of 3 elements, and we assigned the list to a variable aptly named elements
. Next, we created a loop
structure, where each of the elements in the elements
list is assigned to the variable char
in turn, and the commands in the loop (i.e. print(char)
) is executed for that element. You should see a final output similar to this screen:
One of the most powerful things in Jupyter is the fact that you can write both text and code in the same notebook - much like a real Lab notebook where you have your text notes and your equations/figures/etc.
Let’s try and put some text in our notebook. To do that, we need to tell Jupyter that the cell should be interpreted as text (Markdown-formatted) and not as code. Click on the empty cell (it should have a green outline), and then go to Cell
-> Cell Type
-> Markdown
. You will notice that the In [ ]
indicator just disappeared, as there will be no need to execute something (and therefore no output will be produced).
Let’s copy the following text into the cell:
# Writing Notebooks
We can write lots of formatted text here, using the [Markdown syntax](https://en.wikipedia.org/wiki/Markdown). It is an easy way to write pretty text easily and efficiently.
## Formatting
It does support several common formatting styles:
- It can do **bold**
- It can do _italics_
- It can also do sub lists
* with items one
* two
* and three
It also allows to write [LaTex](https://www.latex-project.org) equations, like that:
$$c = \sqrt{a^2 + b^2}$$
Pretty neat, right?
If you press Shift-Enter
after putting this text, it should look like that:
You’ll also notice that, by default, Jupyter has changed the type of the new cell to R
, so you won’t have to change types constantly, but only when needed.
For more instructions, the Help
menu has a good tour and detailed information. Notebooks can be downloaded locally by going to the File
menu, then selecting Download
and choosing a file type to download, and it supports both pdf
and html
as file type choices.
You can also share the entire file that you have just created (there should be a file named Jupyter-is-fun.ipynb
in your working directory). You can even grab the one we created right now from here.
That’s it!
According to the people that created and support Jupyter, it has now ~3 million users worldwide, and over 500k Notebooks on GitHub - so huge success!
The same team is now working on JupyterLab (currently in alpha), essentially the next generation of the Jupyter Notebook application. You can read more here and also have a look at their recent talk at SciPy2016. So, stay tuned!
In this session we assume that there is a basic knowledge of R and it’s syntax. There is also a parallel session running on Data Carpentry for Ecology that provides an in-depth presentation of R. The material presented here is based on the following lessons:
We will use our new Jupyter environment to analyze the bird ringing data Netherlands 1960-1990 part 1, led by Henk van der Jeugd. The csv
file containing this data available though this workshop is here, and the original source is here.
Let’s try and have an general analysis of the data we downloaded earlier. First, we’ll need to import some libraries that will help us with the analysis process.
# Data Analysis Libraries
library(dplyr)
library(tidyr)
# [Community Ecology Package](https://cran.r-project.org/web/packages/vegan/index.html)
library(vegan)
# Visualization Libraries
library(ggplot2)
For our first step, we will load the data and then view the top records as well as a summary of all variables included.
dansDataSet <- read.csv(file = "Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv", header = TRUE)
head(dansDataSet)
summary(dansDataSet)
We observe that even though the data was loaded correctly, they are not used in the best possible way. For example, Ringnumber
, CatchDate
and Age
are used as words rather than as numeric values. Also, missing values are defined as NULL
which is not recognized as such by R (the correct value would be NA
). The next block tidies the data, so that that each attribute is treated as originally intended.
dansDataSet <- data.frame(lapply(dansDataSet, function(x) { gsub("NULL", NA, x) }))
dansDataSet$Ringnumber <- as.numeric(dansDataSet$Ringnumber)
dansDataSet$CatchDate <- as.Date(dansDataSet$CatchDate)
dansDataSet$Age <- as.numeric(dansDataSet$Age)
dansDataSet$Broodsize <- as.numeric(dansDataSet$Broodsize)
dansDataSet$PullusAge <- as.numeric(dansDataSet$PullusAge)
dansDataSet$CatchDate <- as.Date(dansDataSet$CatchDate)
head(dansDataSet)
summary(dansDataSet)
We can see that the data is much more better formatted and useful for further analysis.
Let’s now create a few subsets of the original data. Subset #1 dansDataSet_Castricum
will contain all the unique records for which Location
is Castricum, Noord-Holland, NL
. Then we will group the records by species and catch date, and calculate number of each species in the particular catch date.
dansDataSet_Castricum <- dansDataSet %>%
filter(Location == "Castricum, Noord-Holland, NL") %>%
select(unique.RingID = RingID, Species, CatchDate) %>%
group_by(Species, CatchDate) %>%
summarise(count = n())
We could further filter this subset for a particular species. For example, the code below will retrieve all unique observations of Northern Lapwing in Castricum, Noord-Holland, NL.
dansDataSet_lapwing <- dansDataSet %>%
filter(Location == "Castricum, Noord-Holland, NL") %>%
select(unique.RingID = RingID, Species, CatchDate) %>%
group_by(Species, CatchDate) %>%
filter(as.POSIXct(CatchDate) >= as.POSIXct("1970-01-01 00:00:01")) %>%
filter(Species == "Northern Lapwing") %>%
summarise(count = n())
Our second subset will create a matrix of the distribution of unique species across the different locations. This will consequently allow us to calculate some diversity indexes.
dansDataSet_distribution <- dansDataSet %>%
select(unique.RingID = RingID, Species, Location) %>%
group_by(Species, Location) %>%
summarise(count = n()) %>%
filter(count > 0) %>%
na.omit()
# spread(data, key, value)
# data: A data frame
# key: The (unquoted) name of the column whose values will be used as column headings.
# value:The (unquoted) names of the column whose values will populate the cells
dansDataSet_distribution_matrix <- dansDataSet_distribution %>%
spread(Location, count)
We can also create a more specific subset, i.e. of species that have at least 100 unique observations in a given location. This will allow for a cleaner figure.
dansDataSet_distribution_min100 <- dansDataSet %>%
select(unique.RingID = RingID, Species, Location) %>%
group_by(Species, Location) %>%
summarise(count = n()) %>%
filter(count > 100) %>%
na.omit()
vegan
package¶We will now use the vegan
package to calculate the diversity in the locations.
vegan
requirements¶dansDataSet_distribution_zero <- dansDataSet_distribution_matrix
dansDataSet_distribution_zero[is.na(dansDataSet_distribution_zero)] <- 0
dansDataSet_distribution_zero <- t(dansDataSet_distribution_zero[,2:length(dansDataSet_distribution_zero)])
For each of these indexes, we are going to call the corresponding function from vegan, using the default parameters:
Shannon or Shannon–Weaver (or Shannon–Wiener) index is defined as:
$$H = -\sum_{n=1}^{R} p_i ln_b(p_i) = 1$$
where $p_i$ is the proportional abundance of species $i$ and $b$ is the base of the logarithm. It is most popular to use natural logarithms, but some argue for base $b = 2$.
Both variants of Simpson’s index are based on $D = \sum_{n=1}^{R}p_i^2$. Choice simpson returns $1-D$ and invsimpson returns $\frac{1}{D}$.
Hshannon <- diversity(dansDataSet_distribution_zero, index = "shannon", MARGIN = 1, base = exp(1))
simp <- diversity(dansDataSet_distribution_zero, "simpson", MARGIN = 1)
invsimp <- diversity(dansDataSet_distribution_zero, "inv", MARGIN = 1)
The function rarefy
gives the expected species richness in random subsamples of size sample from the community. The size of sample should be smaller than total community size, but the function will silently work for larger sample as well and return non-rarefied species richness (and standard error equal to 0). If sample is a vector, rarefaction
is performed for each sample size separately. Rarefaction can be performed only with genuine counts of individuals. The function rarefy is based on Hurlbert’s (1971) formulation, and the standard errors on Heck et al. (1975).
r.2 <- rarefy(dansDataSet_distribution_zero, 2)
fisher.alpha
¶This function estimates the $a$ parameter of Fisher’s logarithmic series. The estimation is possible only for genuine counts of individuals. The function can optionally return standard errors of $a$. These should be regarded only as rough indicators of the accuracy: the confidence limits of $a$ are strongly non-symmetric and the standard errors cannot be used in Normal inference.
alpha <- fisher.alpha(dansDataSet_distribution_zero)
Species richness (S) is calculated by specnumber
which finds the number of species. If MARGIN is set to 2, it finds frequencies of species. Pielou’s evenness (J) is calculated by $\frac{H_shannon}{log(S)}$.
S <- specnumber(dansDataSet_distribution_zero, MARGIN = 1) ## rowSums(BCI > 0) does the same...
J <- Hshannon/log(S)
In order to have all these indeces together, we will put them in a single data frame as follows:
metrics <- data.frame(
H_Shannon = Hshannon,
H_Simp = simp,
H_Inv_Simp = invsimp,
rarefy = r.2,
a = alpha,
richness = S,
evenness = J
)
Finally, let’s also create some plots. First of all, let’s create a plot based on our first subset, showing for each species and capture dates, the average age of the species captured.
png("files/figs/subset1a1.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_Castricum, aes(x=CatchDate, y=Species, color=count)) +
geom_point(aes(size=count))
dev.off()
png("files/figs/subset1a2.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_Castricum, aes(x=CatchDate, y=count, colour=Species)) +
geom_line()
dev.off()
We can do the same plots for the single species that we looked into earlier (Northern Lapwing in Castricum, Noord-Holland, NL).
png("files/figs/subset1b1.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_lapwing, aes(x=CatchDate, y=Species, color=count)) +
geom_point(aes(size=count))
dev.off()
This is not really easy to interpret. However, we can now have a more interesting plot with the lines
command, including a smoothing curve to show the overall trend:
png("files/figs/subset1b2.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_lapwing, aes(x=CatchDate, y=count, colour=Species)) +
geom_point(aes(x = CatchDate, y = count, colour = Species), size = 3) +
stat_smooth(aes(x = CatchDate, y = count), method = "lm", formula = y ~ poly(x, 3), se = FALSE)
dev.off()
We can also create a plot based on the second subset. In this case, let’s see how the distribution of species across the seven locations looks like.
lvls <- unique(as.vector(dansDataSet_distribution$Location))
png("files/figs/subset2a.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_distribution, aes(x=Species, y=Location, color=Species)) +
geom_point(aes(size=count)) +
theme(text=element_text(family="Arial", size=12*(81/169)),
axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.3)) +
scale_y_discrete(breaks=lvls[seq(1,length(lvls),by=10)]) #scale_y_discrete(labels = abbreviate)
dev.off()
This is a very “dense” figure, so let’s use the filtered version to see the most highly populated species.
png("files/figs/subset2b.png", width = 4000, height = 2000, res = 300, pointsize = 5)
ggplot(data=dansDataSet_distribution_min100, aes(x=Species, y=Location, color=Species)) +
geom_point(aes(size=count)) +
theme(text=element_text(family="Arial", size=12*(81/169)),
axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.3))
dev.off()
Finally, let’s have a figure showing all 5 indexes together.
png("files/figs/metrics.png", width = 4000, height = 2000, res = 300, pointsize = 5)
plot(metrics, pch="+", col="blue")
dev.off()
We could also show the most diverse sites (i.e. richness index over 10).
top10_site_metrics <- metrics %>%
tibble::rownames_to_column() %>%
filter(richness >= 10) %>%
arrange(desc(richness))
top10_site_metrics
During this lesson, you’ll learn how to use RMarkdown for reproducible data analysis. We will work with the same data as with the Jupyter session: bird ringing data Netherlands 1960-1990 part 1, led by Henk van der Jeugd, with the csv
file containing this data here, and the original source here.
This lesson is heavily based on the Exploratory RNAseq data analysis using RMarkdown lesson from the workshop ANGUS: Analyzing High Throughput Sequencing Data. As a session, it will get you started with RMarkdown, but if you want more, here is a great tutorial.
RMarkdown is a variant of Markdown that makes it easy to create dynamic documents, presentations and reports within RStudio. It has embedded R
(originally), python
, perl
, shell
code chunks to be used with knitr (an R package) to make it easy to create reproducible reports in the sense that they can be automatically regenerated when the underlying code it modified.
RMarkdown renders many different types of files including:
Briefly, to make a report:
.Rmd
file.Overview of the steps RMarkdown takes to get to the rendered document:
.Rmd
report that includes R code chunks and and markdown narratives (as indicated in steps above.)..Rmd
file to knitr
to execute the R code chunks and create a new .md
file..md
file to pandoc, which will create the final rendered document (e.g. html, Microsoft word, pdf, etc.)..Rmd
) to another (in this case: HTML)While this may seem complicated, we can hit the knit
button at the top of the page. Knitting is the verb to describe the combining of the code chunks, inline code, markdown and narrative.
Note: Knitting is different from rendering! Rendering refers to the writing of the final document, which occurs after knitting.
.Rmd
File¶It’s go time! Let’s start working with RMarkdown! In the menu bar, click File -> New File -> RMarkdown
4 main components:
YAML headers
Narrative/Description of your analysis
Code
a. Inline Codeb. Code Chunks
YAML stands for “Yet Another Markup Language” or “Yaml ain’t markup language” and is a nested list structure that includes the metadata of the document. It is enclosed between two lines of three dashes ---
and as we saw above is automatically written by RStudio. A simple example:
---
title: "Reproducible analysis and Research Transparency"
Author: "Fotis E. Psomopoilos"
date: "December 8th, 2017"
output: html_document
---
The above example will create an HTML document. However, the following options are also available.
html_document
pdf_document
word_document
beamer_presentation
(pdf slideshow)ioslides_presentation
(HTML slideshow)Today, we will create HTML files. Presentation slides take on a slightly different syntax (e.g. to specify when one slide ends and the next one starts) and so please note that there is a bit of markdown syntax specific to presentations.
For this section of the document, you will use markdown to write descriptions of whatever the document is about. For example, you may write your abstract, introduction, or materials and methods to set the stage for the analysis to come in code chunks later on.
There are 2 ways to embed code within an RMarkdown document.
Inline code is created by using a back tick (the key next to the #1) (`) and the letter r followed by another back tick.
Imagine that you’re reporting a p-value and you do not want to go back and add it every time the statistical test is re-run. Rather, the p-value is 0.0045
.
This is really helpful when writing up the results section of a paper. For example, you may have ran a bunch of statistics for your scientific questions and this would be a way to have R save that value in a variable name.
Cool, huh?!
Code chunks can be used to render code output into documents or to display code for illustration. The code chunks can be in shell/bash, python, Rcpp, SQL, or Stan.
The Anatomy of a code chunk:
To insert an R code chunk, you can type it manually by typing ```{r}
followed by ```
on the next line. This will produce the following code chunk:
```{r}
n <- 10
seq(n)
```
Name the code chunk something meaningful as to what it is doing. Below I have named the code chunk 10_random_numbers
:
```{r 10_random_numbers}
n <- 10
seq(n)
```
The code chunk input and output is then displayed as follows:
n = 10
seq(n)
Always name/label your code chunks!
Chunk labels must be unique IDs in a document and are good for:
.Rmd
documents.When naming the code chunk: Use -
or _
in between words for code chunks labels instead of spaces. This will help you and other users of your document to navigate through.
Chunk labels must be unique throughout the document (if not there will be an error) and the label should accurately describe what’s happening in the code chunk.
Pressing tab when inside the braces will bring up code chunk options.
results = "asis"
stands for “as is” and will output a non-formatted version.collapse
is another chunk option which can be helpful. If a code chunk has many short R expressions with some output, you can collapse the output into a chunk.There are too many chunk options to cover here. After the workshop take a look around at the options.
Great website for exploring Knitr Chunk Options.
Knitr makes producing figures really easy. If analysis code within a chunk is supposed to produce a figure, it will just print out into the document.
Some knitr chunk options that relate to figures:
fig.width
and fig.height
fig.width = 7
, fig.height = 7
fig.align
: How to align the figure"left"
, "right"
, and "center"
fig.path
: A file path to the directory to where knitr should store the graphic output created by the chunk.'figure/'
fig.retina
(only for HTML output) for higher figure resolution with retina displays.You may wish to have the same chunk settings throughout your document and so it might be nice to type options once instead of always re-typing it for each chunk. To do so, you can set global chunk options at the top of the document.
knitr::opts_chunk$set(echo = FALSE,
eval = TRUE,
message = FALSE,
warning = FALSE,
fig.path = "Figures/",
fig.width = 12,
fig.height = 8)
For example, if you’re working with a collaborator who does not want to see the code - you could set eval = TRUE
and echo = FALSE
so the code is evaluated but not shown. In addition, you may want to use message = FALSE
and warning = FALSE
so your collaborator does not see any messages or warnings from R.
If you would like to save and store figures within a sub directory within the project, fig.path = "Figures/"
. Here, the "Figures/"
denotes a folder named Figures within the current directory where the figures produced within the document will be stored. Note: by default figures are not saved.
Global chunk options will be set for the rest of the document. If you would like to have a particular chunk be different from the global options, specify at the beginning of that particular chunk.
Hand writing tables in Markdown can get tedious. We will not go over this here, however, if you’d like to learn more about Markdown tables check out the documentation on tables at the RMarkdown v2 website.
In his Knitr in a Knutshell, Dr. Karl Broman introduces: kable
, pander
, and xtable
and many useRs like the first two:
kable
: Within the knitr package - not many options but looks nice with ease.pander
: Within the pander package - has many more options and customization. Useful for bold-ing certain values (e.g. values below a threshold).You should also check out the DT
package for interactive tables. Check out more details here http://www.htmlwidgets.org/showcase_datatables.html
It’s also possible to include a bibliography file in the YAML header. Bibliography formats that are readable by Pandoc include the following:
To create a bibliography in RMarkdown, two files are needed:
An example YAML header with a bibliography and a citation style language (CSL) file:
output: html_document
bibliography: bibliography.bib
csl: nature.csl
Check out the very helpful web page by the R Core team on bibliographies and citations.
If you would like to cite R packages, knitr even includes a function called write_bib()
that creates a .bib
entries for R packages. It will even write it to a file!
write_bib(file = "r-packages.bib") # will write all packages
write_bib(c("knitr", "ggplot2"), file = "r-packages2.bib") # Only writes knitr and ggplot2 packages
Automatically the bibliography will be placed at the end of the document. Therefore, you should finish your .Rmd
document with # References
so the bibliography comes after the header for the bibliography.
final words...
# References
Citation Style Language (CSL) is an XML-based language that identifies the format of citations and bibliographies. Reference management programs such as Zotero, Mendeley and Papers all use CSL.
Search for your favorite journal and CSL in the Zotero Style Repository, which currently has >8,000 CSLs. Is there a style that you’re looking for that is not there?
output: html_document
bibliography: bibliography.bib
csl: nature.csl
Citations go inside square brackets [ ]
and are separated by semicolons ;
. Each citation must have a key, composed of @ + the citation identifier
from the database, and may optionally have a prefix, a locator, and a suffix. To check what the citation key is for a reference, take a look at the .bib
file. Here in this file, you can also change key for each reference. However, be careful that each ID is unique!
Once you make a beautiful dynamic document you may wish to share it with others. One option to share it with the world is to host it on RPubs. With RStudio, this makes it very easy! Do the following:
.Rmd
document.knit
button to render your HTML document to be published.publish
button and follow the directions.Yay!
We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:
“Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com
We’ve all been in this situation before: it seems ridiculous to have multiple nearly-identical versions of the same document. Some word processors let us deal with this a little better, such as Microsoft Word’s Track Changes, Google Docs’ version history, or LibreOffice’s Recording and Displaying Changes.
Version control systems start with a base version of the document and then save just the changes you made at each step of the way. You can think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.
Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes onto the base document and getting different versions of the document. For example, two users can make independent sets of changes based on the same document.
Unless there are conflicts, you can even play two sets of changes onto the same base document.
A version control system is a tool that keeps track of these changes for us and helps us version and merge our files. It allows you to decide which changes make up the next version, called a commit and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers facilitating collaboration among different people.
The Long History of Version Control Systems
Automated version control systems are nothing new. Tools like RCS, CVS, or Subversion have been around since the early 1980s and are used by many large companies. However, many of these are now becoming considered as legacy systems due to various limitations in their capabilities. In particular, the more modern systems, such as Git and Mercurial are distributed, meaning that they do not need a centralized server to host the repository. These modern systems also include powerful merging tools that make it possible for multiple authors to work within the same files concurrently.
The opposite of “open” isn’t “closed”. The opposite of “open” is “broken”.
— John Wilbanks
Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:
For a growing number of scientists, though, the process looks like this:
This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book.
This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:
Making Code Citable
This short guide from GitHub explains how to create a Digital Object Identifier (DOI) for your code, your papers, or anything else hosted in a version control repository.
The folder that currently contains our Jupyter notebook and the data file should look like this:
$ ls -la
total 18756
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:09 ./
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:08 ../
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609 152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb
Then we tell Git to make this folder a repository — a place where Git can store versions of our files:
git init
If we use ls
to show the directory’s contents, it appears that nothing has changed. But if we add the -a
flag to show everything, we can see that Git has created a hidden directory within planets called .git
:
total 18760
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:11 ./
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:08 ../
drwxr-xr-x 1 fpsom 197609 0 Nov 16 14:11 .git/
-rw-r--r-- 1 fpsom 197609 19034567 Oct 22 03:18 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
-rw-r--r-- 1 fpsom 197609 152229 Nov 16 13:52 Reproducible-analysis-and-Research-Transparency.ipynb
Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project’s history.
We can check that everything is set up correctly by asking Git to tell us the status of our project. It shows that there are two new files that are currently not tracked (meaning that any changes there will not be monitored).
git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
Reproducible-analysis-and-Research-Transparency.ipynb
nothing added to commit but untracked files present (use "git add" to track)
The untracked files message means that there’s a file in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add
:
git add Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv Reproducible-analysis-and-Research-Transparency.ipynb
and then check that the right thing happened:
git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
new file: Reproducible-analysis-and-Research-Transparency.ipynb
Git now knows that it’s supposed to keep track of these two files, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:
git commit -m "Let's add the two files"
[master (root-commit) 8dde99b] Let's add the two files
2 files changed, 67898 insertions(+)
create mode 100644 Export_DANS_Parels_van_Datasets_Vogeltrekstation.csv
create mode 100644 Reproducible-analysis-and-Research-Transparency.ipynb
When we run git commit
, Git takes everything we have told it to save by using git add
and stores a copy permanently inside the special .git
directory. This permanent copy is called a commit
(or revision
) and its short identifier is f22b25e
(Your commit may have another identifier.)
We use the -m
flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit
without the -m
option, Git will launch nano
(or whatever other editor we configured as core.editor
) so that we can write a longer message.
Good commit messages start with a brief (<50 characters) summary of changes made in the commit. If you want to go into more detail, add a blank line between the summary line and your additional notes.
If we run git status
now:
git status
On branch master
nothing to commit, working tree clean
This is the first steps in maintaining versions. There are a few more commands that you should be aware of, such as git diff
and git log
, but for the purposes of this exercise, this is sufficient.
Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another.
Systems like Git
allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, BitBucket or GitLab to hold those master copies; we’ll explore the pros and cons of this in the final section of this lesson.
Let’s start by sharing the changes we’ve made to our current project with the world. Log in to GitHub, then click on the icon in the top right corner to create a new repository called reproducibilityWorkshop
. As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository.
The next step is to connect the two repositories; the local and the one we just created on GitHub. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the string we need to identify it:
‘HTTPS’
link to change the protocol from SSH
to HTTPS
.git remote add origin https://github.com/fpsom/reproducibilityWorkshop.git
Make sure to use the URL for your repository rather than mine: the only difference should be your username instead of fpsom
.
We can check that the command has worked by running git remote -v
:
git remote -v
origin https://github.com/fpsom/reproducibilityWorkshop.git (fetch)
origin https://github.com/fpsom/reproducibilityWorkshop.git (push)
The name origin
is a local nickname for your remote repository. We could use something else if we wanted to, but origin
is by far the most common choice.
Once the nickname origin is set up, this command will push the changes from our local repository to the repository on GitHub:
git push origin master
Counting objects: 4, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 1.02 MiB | 338.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0)
To https://github.com/fpsom/reproducibilityWorkshop.git
* [new branch] master -> master
Excellent job! You now have both the remote and the local repositories in sync!
Exercise: Make a change to one of the two local files, commit, and push.
As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike. Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge.
Studies focusing on code that has been made available with scientific publications regularly find the same common issues that pose substantial barriers to reproducing the original results or building on that code:
Docker is an open source project that builds on many long familiar technologies from operating systems research: LXC containers, virtualization of the OS, and a hash-based or git-like versioning and differencing system, among others.
Containers are a way to package software in a format that can run isolated on a shared operating system. Unlike VMs, containers do not bundle a full operating system - only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed
Docker is a wonderful tool for many things. A few of them are;
We are going to explore Docker images. Why we should use them, and how to go about it.
Docker images are used to configure and distribute application states. Think of it as a template with which to create the container.
With a Docker image, we can quickly spin up containers with the same configuration. We can then share these images with our team, so we will all be running containers which all have the same configuration.
There are several ways to create Docker images, but the best way is to create a Dockerfile
and use the docker build
command.
There are several ways to distribute our images.
Github
or Bitbucket
, we can configure an automated build on Docker Hub.Once we have our image on the machine, we can then run it using
docker pull fpsom/jupyter-kernels
docker run --name=jupyter fpsom/jupyter-kernels
The above command will spin up a new container based on our image and run it. If we do not pass the --name
parameter, Docker will pick a random name for our container.
Many images require some extra parameters to be passed to the run command, so take some time to read through the documentation of an image before you use it.
Also, take some time to read the full documentation of the docker run
command.
We can see our running containers using the docker ps
command. To see ALL containers, we add the -a
flag - docker ps -a
.
For our case, we will use MyBinder
as the host of our Jupyter Notebook, so that anyone can run (and therefore reproduce) our code.
Binder allows you to create custom computing environments that can be shared and used by many remote users.
Binder makes it simple to generate reproducible computing environments from a GitHub repository. Binder uses the BinderHub technology to generate a Docker image from this repository. The image will have all the components that you specify along with the Jupyter Notebooks inside. You will be able to share a URL with users that can immediately begin interacting with this environment via the cloud.
If you or another Binder user clicks on a Binder link, the mybinder.org deployment will run the linked repository. While running, you are guaranteed to have at least 1G of RAM. There is an upper-limit of 4GB (if you use more than 4GB your kernel will be restarted).
By default, Binder works on Python; however, given that our notebook is based on R, we need to define the execution environment. This will be done through a dedicated Dockerfile
that will setup R and the R Kernel, as well as install the necessary libraries (the version below is based on the Dockerfile
provided by binder).
FROM rocker/tidyverse:3.4.2
RUN apt-get update && \
apt-get -y install python3-pip && \
pip3 install --no-cache-dir notebook==5.2 && \
apt-get purge && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV NB_USER rstudio
ENV NB_UID 1000
ENV HOME /home/rstudio
WORKDIR ${HOME}
USER ${NB_USER}
# Set up R Kernel for Jupyter
RUN R --quiet -e "install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest', 'vegan'))"
RUN R --quiet -e "devtools::install_github('IRkernel/IRkernel')"
RUN R --quiet -e "IRkernel::installspec()"
# Make sure the contents of our repo are in ${HOME}
COPY . ${HOME}
USER root
RUN chown -R ${NB_UID}:${NB_UID} ${HOME}
USER ${NB_USER}
# Run install.r if it exists
RUN if [ -f install.r ]; then R --quiet -f install.r; fi
Copy the above commands in an empty file named Dockerfile
- you can also download this file directly from here. Then add this file to the GitHub repository, using the git commands (i.e. git add
, git commit
and git push
).
Finally, go to the MyBinder URL, paste the URL of your GitHub repository in the field GitHub repo or URL
and click launch. In a few minutes, your custom Jupyter environment will be launched.