yDiv/HIGRADE Workshop Day 1

yDiv/HIGRADE Workshop Day 1: Skills and techniques in biodiversity research

When: 10.11.2015
Where: Leipzig, iDiv/BioCity, Deutscher Platz 5

For more information on the contents of each workshop and sessions please see the descriptions below.

Room/Time	Red Queen	Symbiosis in Interim II
09.30 - 12.30	Joona Lehtomäki: *Version control for science: Introduction to git and GitHub*
12.30 - 13.30	Lunch break (no catering)	Lunch break (no catering)
13.30 - 15.30	Christian Krause: *Linux command line basics*	Tilo Arnold: *Why should I be working with media?*
15.30 - 15.45	Break (no catering)	Break (no catering)
15.45 - 17.45	Joona Lehtomäki: *Simplifying data manipulation with R*	Felix May: *Integrating R and C++*

Session descriptions

Version control for science: Introduction to git and GitHub

10.11., 9:30-12:30

Dr. Joona Lehtomäki - Post-doctoral researcher at the Conservation Biodiversity Informatics Group, University of Helsinki

Objectives

Introduce the concept of version control and why it is useful for scientists
Introduce git version control tool and its commandline interface
Demonstrate the basics of git
Demonstrate the basics of GitHub web service

Description

A version control system is a tool for managing changes to a set of files. Version control is better than mailing files back and forth:

Nothing that is committed to version control is ever lost. Since all old versions of files are saved, it is always possible to go back in time to see exactly who wrote what on a particular day.
With full change history, we know who to ask if we have questions later on, and, if need be, revert to a previous version, much like the „undo“ feature in an editor.
When several people collaborate in the same project, it is possible to accidentally overlook or overwrite someone’s changes: the version control system automatically notifies users whenever there is a conflict between one person’s work and another’s.

Version control is the lab notebook of the digital world: professionals use it to keep track of what they have done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it is not just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

Git has quickly become the most popular version control software around. It is a very versatile and fast tool, which unfortunately has quite a steep learning curve. However, because of its popularity, there is plenty of good and easy-to-approach documentation available.
GitHub is a web-based service that puts a lot of emphasis on the social aspect of code and content sharing. GitHub is a company that hosts Git repositories in the web and provides a web interface to interact with repos they host.

Prerequisites

Please bring a laptop (Windows/Mac/Linux) and a working installation of git. More complete installation instructions will be distributed later. Registering to GitHub before the workshop is highly recommended!
In this hands-on workshop we will be using git on command line. There are several graphical user interfaces available for git, but the command line interface is far better for learning the conceptual underpinnings.

Recommended reading

Hampton S.E., Anderson S.S., Bagby S.C., Gries C., Han X., Hart E.M., Jones M.B., Lenhardt C., MacDonald A., Michener W.K., Mudge J., Pourmokhtarian A., Schildhauer M.P., Woo K.H., & Zimmerman N. (2015) The Tao of open science for ecology. Ecosphere, 6, 1–13.
Ram, K. (2013): Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine, 8(1), 1-14.

Why should I be working with media?

10.11., 13:30-15:30

Tilo Arnold - Science Journalist, iDiv

More and more funded projects are asking for outreach activities and expecting a list of media coverage at the end in the project report. But on the other hand media is often described as one of the worst ways to explain science because of its fast turnover, deadlines and limited space. However, there are very good reasons for using newspapers, radio, television or online media as a medium to get your messages from science to society. The better you know the system the better you can use it and avoid frustrating experiences. This workshop will tell you hence some basics about how media works, why this can be helpful for your scientific career and what is important if you are speaking with journalists about your research.

Linux command line basics

10.11., 13:30-15:30

Christian Krause - HPC (High-Performance Computing) Cluster Administrator, iDiv

Target audience: Linux beginners and people who have never used Linux
Prerequisites/Materials: Bring your laptop!
Content: Learning the Linux Command Line - Why Bother?

Why do you need to learn the command line anyway? Graphical user interfaces (GUIs) are helpful for many tasks, but they are not good for all tasks. Most computers today feel like they are not powered by electricity, but by the motion of the mouse! Computers were supposed to free us from manual labor, but how many times have you performed some task you felt sure the computer should be able to do but you ended up doing the work yourself by tediously working the mouse? Pointing and clicking, pointing and clicking ...
Children “read” books by looking at pictures. When they grow up, they learn how to read and write. Welcome to Computer Literacy 101.

Simplifying data manipulation with R

10.11., 15:45-17:45

Dr. Joona Lehtomäki - Post-doctoral researcher at the Conservation Biodiversity Informatics Group, University of Helsinki

Objectives

Introduce the concept of „tidy data”and demonstrate how data can easily be reshaped using package tidyr.
Show how data manipulation in R can be made more efficient using package dplyr.The outcome of this short workshop is for the participants to be aware of these options and can start exploring the topics and packages in more detail.

Content

For a good while, R has been the to-go computational environment for anyone us in need of statistical tools. While R does offer a dizzying variety of analytic tools, much less effort has gone into simplifying the data wrangling parts that most of us actually spend the most of our time struggling with. Luckily, things have gotten a lot better lately specifically due to two packages: tidyr and dplyr.
tidyr is a new package that makes it easy to „tidy“ your data.Tidy data is data that is easy to work with: it is easy to munge (with dplyr), visualise (with ggplot2or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

Each column is a variable.
Each row is an observation.

dplyr package offers a simple, clear and efficient way of working with your data. The package makes the most common data manipulation steps as fast and easy as possible by:

Elucidating the most common data manipulation operations, so that your options are helpfully constrained when thinking about how to tackle a problem.
Providing simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.
Using efficient data storage backends, so that you spend as little time waiting for the computer as possible.

Prerequisites

A working knowledge in R is assumed. If you have written more than 10 lines of code, then you can probably follow what’s going on. We will be running some example code (to be distributed in advance) in the workshop, so bring your own laptop. Pairing up with someone with a laptop is also a good option. Make sure you have R (>=3.1.0), tidyr (>+0.2.0) and dplyr (>+0.4.1) installed. We will not have time to install these in the workshop! RStudio is highly recommended but not required.

Recommended reading

White, E. P., Baldridge, E., Brym, Z., Locey, K., McGlinn, D., & Supp, S. (2013): Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution, 6(2), 1–10.
Wickham, Hadley (2014): Tidy data. Journal of Statistical Software. 59(10). pp. 1-23. 1 Additional links
“Data Wrangling with R” presentation by Garrett Grolemund (RStudio)
“Data Wrangling with dplyr and tidyr” cheat sheet by RStudio

Integrating Rand C++

10.11., 15:45-17:45

Dr. Felix May - Post-doctoral researcher, Biodiversity Synthesis Group, iDiv

Content and Objectives

You want to do large computations in R, but your task is large and takes too long in R? You work primarily in R, but for certain algorithms you have only code in C++? In these cases the integration of R and C++ might help to solve your problems!
In this workshop you will learn how to combine the flexibility and versatility of R with the speed and computational power of C++. For this purpose we will use the package Rcpp. The course includes a few lectures and a lots exercises. You will learn (very) basic C++, how to call C++ functions from R and how to exchange data between R and C++. We will conclude the course with an outlook on further resources and readings on Rcpp.

Prerequisites

You should have basic knowledge of programming in R (writing functions, flow control for-loops, if-statements). If refreshment is need check out sections 9 and 10 in the R introduction (https://cran.r-project.org/doc/manuals/R-intro.html). Knowledge of C++ or another low-level programming language will be helpful, but is not expected.

Preparation and Material

Bring your laptop and make sure you have the newest version of R installed: R 3.2.2 (https://cran.r-project.org/) and you downloaded and installed the package Rcpp.
RStudio is highly recommended, but not necessary (https://www.rstudio.com/)
In addition you’ll also need a working C++ compiler.

To get it

On Windows, install Rtools33 (https://cran.r-project.org/bin/windows/Rtools/) and appropriately adapt your PATH variable
On Mac, install Xcode from the app store.
On Linux, sudo apt-get install r-base-dev or similar
If you have more questions on the compiler installation consult the section 1.3 in the The Rcpp FAQ (https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-FAQ.pdf)

These preparations are really important. Otherwise you cannot properly work in the course. I can only provide advice on the installation and setup in Windows. So especially if you use Mac or Linux make sure your setup is correct.

When: 10.11.2015 Where: Leipzig, iDiv/BioCity, Deutscher Platz 5

Red Queen

Symbiosis in Interim II

Version control for science: Introduction to git and GitHub

10.11., 9:30-12:30

Dr. Joona Lehtomäki - Post-doctoral researcher at the Conservation Biodiversity Informatics Group, University of Helsinki

Why should I be working with media?

10.11., 13:30-15:30

Tilo Arnold - Science Journalist, iDiv

Linux command line basics

10.11., 13:30-15:30

Christian Krause - HPC (High-Performance Computing) Cluster Administrator, iDiv

Simplifying data manipulation with R

10.11., 15:45-17:45

Dr. Joona Lehtomäki - Post-doctoral researcher at the Conservation Biodiversity Informatics Group, University of Helsinki

Integrating Rand C++

10.11., 15:45-17:45

Dr. Felix May - Post-doctoral researcher, Biodiversity Synthesis Group, iDiv

When: 10.11.2015
Where: Leipzig, iDiv/BioCity, Deutscher Platz 5