This lesson is still being designed and assembled (Pre-Alpha version)

From Code to Concepts: Introduction to Data Science: Compendium of functions

Key Points

Introduction to R and RStudio
  • Create an ‘RStudio Project’ whenever you are initiating a new analysis.

  • When you come back to work on the project, open the RProj file to resume your work.

  • Ensure your project directory is well structured, for example with directories for scripts and data.

  • To document your analysis write and save your code in scripts.

  • You should try to comment your code. Use # to write comments in your scripts.

  • Use install.packages() to install (or update) packages.

Basic objects and data types in R
  • Assign values to objects using <-

  • Functions perform operations on objects: they take inputs (arguments) and return outputs (values).

  • The basic data structure in R is called a vector, which you construct with the c() function.

  • The main types of vector values are: numeric (or double), integer, character and logical.

  • To subset vectors use []

  • When doing vector operations R will ‘recycle’ shorter vectors if it needs to.

  • Missing data is supported by functions and is represented by the special value NA

  • Vectors can only contain one type of value. If there are mixed types of values in a vector, R will coerce those values into a single type according to the following hierarchy: character > numeric > logical

Working with Tabular Data
  • Use library() to load a library into R. You need to do this every time you start a new R session.

  • Read data using the read_*() family of functions (read_csv() and read_tsv() are two common types for comma- and tab-delimited values, respectively).

  • In R tabular data is stored in a data.frame object.

  • Columns in a data.frame are vectors. Therefore, a data.frame is a list of vectors of the same length.

  • A vector can only contain data of one type (e.g. all numeric, or all character). Therefore, each column of a data.frame can only be of one type also (although different columns may be of different types).

Data visualisation with `ggplot2`
  • To build a ggplot2 graph you need to define: data, aesthetics, geometries (and scales).

  • To change an aesthetic of our graph based on data, include it inside aes().

  • To manually change an aesthetic regardless of data then it goes outside aes().

  • You can overlay multiple geometries in the same graph, and control their aesthetics individually.

  • Adjust scales of your graph using scale_* family of functions.

  • You can custommise your graphs using pre-defined themes (e.g. theme_classic()) or more finely with the theme() function.

  • To save graphs use the ggsave() function.

Manipulating variables (columns) with `dplyr`
  • Use dplyr::select() to select columns from a table.

  • Select a range of columns using :, columns matching a string with contains(), and unselect columns by using -.

  • Rename columns using dplyr::rename().

  • Modify or update columns using dplyr::mutate().

  • Chain several commands together with %>% pipes.

Manipulating observations (rows) with `dplyr`
  • Order rows in a table using arrange(). Use the desc() function to sort in descending order.

  • Retain unique rows in a table using distinct().

  • Choose rows based on conditions using filter().

  • Conditions can be set using several operators: >, >=, <, <=, ==, !=, %in%.

  • Conditions can be combined using & and |.

  • The function is.na() can be used to identify missing values. It can be negated as !is.na() to find non-missing values.

  • Use the ifelse() function to define two different outcomes of a condition.

Grouped operations using `dplyr`
  • Use summarise() to calculate summary statistics in your data (e.g. mean, median, maximum, minimum, quantiles, etc.).

  • Chain together group_by() %>% summarise() to calculate those summaries across groups in the data (e.g. countries, years, world regions).

  • Chain together group_by() %>% mutate() or group_by() %>% filter() to apply these functions based on groups in the data.

  • As a safety measure, always remember to ungroup() tables after using group_by() operations.

Working with categorical data + Saving data
  • Use functions from the stringr package to manipulate strings. All these functions start with str_, making them easy to identify.

  • Use factors to encode ordinal variables, ensuring the levels are set in a logical order.

Joining tables
  • Use full_join(), left_join(), right_join() and inner_join() to merge two tables together.

  • Specify the column(s) to match between tables using the by option.

  • Use anti_join() to identify the rows from the first table which do not have a match in the second table.

Data reshaping: from wide to long and back
  • Use pivot_wider() to reshape a table from long to wide format.

  • Use pivot_longer() to reshape a table from wide to long format.

  • To figure out which data format is more suited for a given analysis, it can help to think about what visualisation you want to make with ggplot: any aesthetics needed to build the graph should exist as columns of your table.

Data visualisation with `ggplot2` - part II
  • Use labs() to customise the labels of your plot’s aesthetics (e.g. x, y, colour, fill, size, etc.).

  • Use annotate() to freely add text, segments or rectangles to your plot.

  • Use built-in theme_*() functions to change the overall look of your graphs.

  • Use theme() to change the look of every single element of the graph.

  • Use set_theme() to change the theme for the rest of your R session.

  • Use the patchwork package to compose graphs, using the | and / operators to place two plots side-by-side or top-and-bottom, respectively.

  • The plot_layout() function can be used to adjust your plot arrangement. Useful options are widths and heights to adjust the relative size of the panels, and guides = 'collect' to make a single legend common to the whole figure.

  • Use plot_annotation(tag = 'A') to automatically add a letter tag to each panel.

Extra practice exercises
  • The initial exploration of data is crucial to detect any data quality issues that need fixing.

Compendium of functions

to be added…

Useful keyboard shortcuts

Glossary

Arguments (functions)
Assignment operator
the assignment operator in R is <-
Comment
Comments are preceded by # symbol and are used to add information to your code. Your comments could describe the objective of a particular piece of code.
Data frame
This is the basic type of object in R that stores tabular data. A tibble is a variant of this type of object.
Function
Object
Vector
One of the basic types of object in R. The word vector refers to atomic vectors, which are one-dimensional collections of values. They are created with the c() function.
R projects
Working directory