Reference

Key Points

Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • GitHub needs an SSH key to allow access

Creating a Repository
  • git clone creates a local copy of a repository from a URL.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log --decorate lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Remote Repositories
  • Git can easily synchronise your local repository with a remote one

  • GitHub needs an SSH key to allow access

  • Git can resolve ‘conflicting’ modifications to text files

Branching
  • Branches are parallel versions of a repository

  • You can easily switch between branches, and merge their changes

  • Branches help with code sharing and collaboration

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Survey
Reference
Lesson Schedule
Introduction
  • OpenRefine is a powerful and free, open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.

Opening and Exploring Data
  • Faceting can identify errors or outliers in data.

Transforming Data
  • Clustering can identify outliers in data and help us fix errors in bulk.

  • GREL (General Refine Expression Language) is a powerful tool for transforming data.

Filtering and Sorting Data
  • OpenRefine provides various ways to sort and filter data without affecting the raw data.

Exporting Data Cleaning Steps
  • All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis.

  • Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Further Resources on OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Survey
Reference
Lesson Schedule
Day 1: Starting with Data
  • Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics.

  • Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more).

  • Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value.

  • A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters.

  • Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test).

  • Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv().

  • Data frames are a data structure for most tabular data, and what we use for statistics and plotting.

  • It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers.

  • Dates can be formatted using the package ‘lubridate’.

Day 2: Manipulating Data
  • Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.

  • dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package.

  • A subset of columns from a dataframe can be selected using select().

  • To choose rows based on a specific criterion, use filter().

  • Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

  • To create new columns based on the values in existing columns, use mutate().

  • Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions.

  • To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package.

  • Export data from a dataframe to a csv file using write_csv().

Day 3: Visualising Data
  • ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.

  • Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph.

  • Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot.

  • Faceting splits one plot into multiple plots based on a factor from the dataset.

  • Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.

  • The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange().

  • Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.

Survey
Reference
Lesson Schedule
Introducing the Shell
  • The shell lets you define repeatable workflows.

  • The shell is available on systems where graphical interfaces are not.

Files and Directories
  • The file system is responsible for managing information on the disk.

  • Information is stored in files, which are stored in directories (folders).

  • Directories can also store other directories, which then form a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory; ls on its own lists the current working directory.

  • pwd prints the user’s current working directory.

  • / on its own is the root directory of the whole file system.

  • Most commands take options (flags) that begin with a -.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • .. means ‘the directory above the current one’; . on its own means ‘the current directory’.

Creating Things
  • Command line text editors let you edit files in the terminal.

  • You can open up files with either command-line or graphical text editors.

  • nano [path] creates a new text file at the location [path], or edits an existing one.

  • cat [path] prints the contents of a file.

  • rmdir [path] deletes an (empty) directory.

  • rm [path] deletes a file, rm -r [path] deletes a directory (and contents!).

  • mv [old_path] [new_path] moves a file or directory from [old_path] to [new_path].

  • mv can be used to rename files, e.g. mv a.txt b.txt.

  • Using . in mv can move a file without renaming it, e.g. mv a/file.txt b/..

  • cp [original_path] [copy_path] creates a copy of a file at a new location.

Finding Things
  • find finds files with specific properties that match patterns.

  • grep selects lines in files that match patterns.

  • --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.

  • man [command] displays the manual page for a given command.

  • $([command]) inserts a command’s output in place.

Pipes and Filters
  • wc counts lines, words, and characters in its inputs.

  • cat displays the contents of its inputs.

  • sort sorts its inputs.

  • head displays the first 10 lines of its input.

  • tail displays the last 10 lines of its input.

  • command > [file] redirects a command’s output to a file (overwriting any existing content).

  • command >> [file] appends a command’s output to a file.

  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.

  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

Shell Scripts
  • Save commands in files (usually called shell scripts) for re-use.

  • bash [filename] runs the commands saved in a file.

  • $@ refers to all of a shell script’s command-line arguments.

  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

  • Place variables in quotes if the values might have spaces in them.

  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Loops
  • A for loop repeats commands once for every thing in a list.

  • Every for loop needs a variable to refer to the thing it is currently operating on.

  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.

  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.

  • Use Ctrl+R to search through the previously entered commands.

  • Use history to display recent commands, and ![number] to repeat a command by number.

Additional Exercises
  • date prints the current date in a specified format.

  • Scripts can save the output of a command to a variable using $(command)

  • basename removes directories from a path to a file, leaving only the name

  • cut lets you select specific columns from files, with -d',' letting you select the column separator, and -f letting you select the columns you want.

Survey
Reference
Lesson Schedule
Introduction
  • Good data organisation is the foundation of any research project.

Organising data in spreadsheets
  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data in a plain text file.

  • Organise your data according to tidy data principles.

  • Record metadata in a separate plain text file (such as README.txt) in your project root folder or folder with data.

Common spreadsheet errors
  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs.

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Do not use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

Dates as data
  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.

Quality assurance and control
  • Always copy your original spreadsheet file and work with a copy so you do not affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.

  • Use sorting to check for invalid data.

Exporting data
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Survey
Reference

Glossary

cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
metadata
data which describes other data
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)