Reference – Quantitative Cell Biology Computational Research Skills

Key Points

Lesson Schedule
What is Version Control	Version control is like an unlimited ‘undo’. Version control also allows many people to work in parallel.
Setting Up Git	Use `git config` with the `--global` option to configure a user name, email address, editor, and other preferences once per machine. GitHub needs an SSH key to allow access
Creating a Repository	`git clone` creates a local copy of a repository from a URL. Git stores all of its repository data in the `.git` directory.
Tracking Changes	`git status` shows the status of a repository. Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded). `git add` puts files in the staging area. `git commit` saves the staged content as a new commit in the local repository. Write commit messages that accurately describe your changes. `git log --decorate` lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.
Exploring History	`git diff` displays differences between commits. `git checkout` recovers old versions of files.
Remote Repositories	Git can easily synchronise your local repository with a remote one GitHub needs an SSH key to allow access Git can resolve ‘conflicting’ modifications to text files
Branching	Branches are parallel versions of a repository You can easily switch between branches, and merge their changes Branches help with code sharing and collaboration
Ignoring Things	The `.gitignore` file tells Git what files to ignore.
Survey
Reference
Lesson Schedule
Day 1: Starting with Data	Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics. Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more). Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value. A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters. Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test). Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv(). Data frames are a data structure for most tabular data, and what we use for statistics and plotting. It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers. Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.
Day 2: Manipulating Data	dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package. A subset of columns from a dataframe can be selected using select(). To choose rows based on a specific criterion, use filter(). Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. To create new columns based on the values in existing columns, use mutate(). Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions. Dates can be formatted using the package ‘lubridate’. To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package. Export data from a dataframe to a csv file using write_csv().
Day 3: Visualising Data	ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph. Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot. Faceting splits one plot into multiple plots based on a factor from the dataset. Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort. The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange(). Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.
Survey
Reference

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1: explanation 1
key word 2: explanation 2