Reference

Key Points

Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • GitHub needs an SSH key to allow access

Creating a Repository
  • git clone creates a local copy of a repository from a URL.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log --decorate lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Remote Repositories
  • Git can easily synchronise your local repository with a remote one

  • GitHub needs an SSH key to allow access

  • Git can resolve ‘conflicting’ modifications to text files

Branching
  • Branches are parallel versions of a repository

  • You can easily switch between branches, and merge their changes

  • Branches help with code sharing and collaboration

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Survey
Reference
Lesson Schedule
Day 1: Starting with Data
  • Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics.

  • Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more).

  • Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value.

  • A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters.

  • Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test).

  • Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv().

  • Data frames are a data structure for most tabular data, and what we use for statistics and plotting.

  • It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers.

  • Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.

Day 2: Manipulating Data
  • dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package.

  • A subset of columns from a dataframe can be selected using select().

  • To choose rows based on a specific criterion, use filter().

  • Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

  • To create new columns based on the values in existing columns, use mutate().

  • Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions.

  • Dates can be formatted using the package ‘lubridate’.

  • To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package.

  • Export data from a dataframe to a csv file using write_csv().

Day 3: Visualising Data
  • ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.

  • Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph.

  • Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot.

  • Faceting splits one plot into multiple plots based on a factor from the dataset.

  • Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.

  • The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange().

  • Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.

Survey
Reference

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1
explanation 1
key word 2
explanation 2