--- title: "CSSS 569 Visualizing Data and Models" subtitle: "Lab 1: Supplemental R resource" author: "Brian Leung" institute: "Department of Political Science, UW" date: \today output: beamer_presentation: incremental: yes bibliography: datavis.bib link-citations: yes linkcolor: blue editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(width=50) ``` # Useful \texttt{R} resources - \texttt{R} - *R for Data Science* [@grolemund_r_2016] - *Quantitative Social Science : An Introduction* [@imai_quantitative_2017] - DataCamp: - R cheat sheets: - \texttt{R Markdown} - *R Markdown: The Definitive Guide* [@xie_r_2019] - Data visualization - *Data Visualization: A Practical Introduction* [@healy_data_2018] - *Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures* [@wilke_fundamentals_2019] - Others - Stack Overflow: - TidyTuesday Project: # \texttt{R} boot camp - \texttt{R} is a language and environment for statistical computing and graphics - *Object-oriented* style of programming - System-supplied or user-defined functionality as *functions* - Extended via *packages* - \texttt{RStudio} is an integrated development environment for \texttt{R}, which includes: - a console to run \texttt{R} code - an editor to write code and text - tools for plotting, history, debugging and workspace management - Let's open \texttt{RStudio} and a plain \texttt{R Script} # Running \texttt{R} code and operators \small ```{r} # Arithmetic Operators 1 + 1 2 * 8 9 / 3 2^3 ``` # Running \texttt{R} code and operators \small ```{r} # Relational Operators 10 > 8 7 <= 6 (2 * 5) == 10 1 != 2 ``` # Objects in \texttt{R}: vectors and assignment \small ```{r} # Concatenate vectors into a new vector c(1, 2, 3) # Assign them to a new object for manipulation x <- c(1, 2, 3) print(x) # or simply, x # Operators on vector x + 1 x == 1 ``` # Objects in \texttt{R}: vectors and functions \small ```{r} # Use an object as input to a function x <- c(1, 2, 3) class(x) length(x) mean(x) ``` # Objects in \texttt{R}: three beginner tips 1. Unless you assign (`<-` ) some operations or transformations to an object, those chances will not be registered \small ```{r} x <- c(1, 2, 3) print(x + 1) print(x) x <- x + 1 print(x) ``` # Objects in \texttt{R}: three beginner tips 2. New assignment will overwrite the original values if you assign some values to an existing object. It is a **major** source of errors. One advise is to keep distinct object names \small ```{r} x <- c(1, 2, 3) length(x) x <- c(1, 2, 3, 4, 5) length(x) ``` # Objects in \texttt{R}: three beginner tips 3. When using functions, we often bump into unexpected outputs, or error messages: \small ```{r} y <- c(1, 2, 3, NA) mean(y) # It's essential to know how to seek help: help(mean) ?mean # Specify appropriate arguments for functions: mean(y, na.rm = TRUE) ``` # Objects in \texttt{R}: atomic vectors - What are vectors exactly? - (Atomic) vectors are the most basic units of data in \texttt{R} - Most common types of atomic vectors: **numeric (integer, double)**, **logical**, **character** # Objects in \texttt{R}: atomic vectors - Most common types of atomic vectors: **numeric (integer, double)**, **logical**, **character** \small ```{r} x <- c(1, 2, 3) class(x) y <- c(TRUE, FALSE, FALSE) class(y) names <- c("Peter", "Paul", "Mary") class(names) ``` # Objects in \texttt{R}: atomic vectors - You can also coerce one type of vector into another: \small ```{r} x <- c(1, 2, 3) x <- as.character(x) print(x) class(x) ``` # Objects in \texttt{R}: matrix and data frame - To deal with massive data, we need efficient data structures to store and manipulate vectors: **matrices** and **data frames** # Objects in \texttt{R}: matrix and data frame - To create a matrix: \small ```{r} # Create a vector numbers <- 1:12 print(numbers) # Store it as a matrix matrix1 <- matrix(data = numbers, nrow = 3, byrow = TRUE) print(matrix1) ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # Basic information class(matrix1) dim(matrix1) # dimensions ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # We can change the row/column names of matrices rownames(matrix1) rownames(matrix1) <- c("row1", "row2", "row3") print(matrix1) ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # Automate any repetitive process col_names <- paste0("column", 1:4) print(col_names) colnames(matrix1) <- col_names print(matrix1) ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # To augment the matrix with new column column5 <- c(13, 14, 15) matrix1 <- cbind(matrix1, column5) print(matrix1) ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # To augment the matrix with new row row4 <- c("a", "b", "c", "d", "e") matrix1 <- rbind(matrix1, row4) print(matrix1) ``` Why do all vectors become characters? # Objects in \texttt{R}: matrix and data frame - Matrices vs. data frames - Matrices can only contain one **homogenous** type of vectors - Data frames can contain **heterogeneous** types of vectors, and thus are more flexible # Objects in \texttt{R}: matrix and data frame - Data frames can contain **heterogeneous** types of vectors, and thus are more flexible \small ```{r} df1 <- data.frame( names = c("Peter", "Paul", "Mary"), age = c(14, 15, 16), female = c(FALSE, FALSE, TRUE), stringsAsFactors = FALSE ) print(df1) ``` # Objects in \texttt{R}: matrix and data frame \small ```{r} # Basic information class(df1) dim(df1) str(df1) ``` # Objects in \texttt{R}: subsetting data - There are several ways to subset data: row/column indices, variable names, or evaluations \small ```{r} # 1) Subsetting by row/column indices # For the element in row 1, column 1 df1[1, 1] # For all elements in row 1, regardless of columns df1[1, ] # For all elements in column 1, regardless of rows df1[, 1] ``` # Objects in \texttt{R}: subsetting data \small ```{r} # 2) Subsetting by variable names df1$names df1$age df1$female ``` # Objects in \texttt{R}: subsetting data \small ```{r} # 3) Subsetting by evaluations df1[df1$age >= 15, ] df1[df1$female == TRUE, ] df1[df1$name %in% c("Peter", "Paul"), ] ``` # Objects in \texttt{R}: creating new variable in data frame \small ```{r} print(df1) df1$edu df1$edu <- c("hs", "col", "phd") print(df1) ``` # Summary of data structures in \texttt{R} ----------------------------------------- Homogeneous Heterogeneous ---- ---------------- ----------------- 1d Atomic vector List 2d Matrix Data frame nd Array ----------------------------------------- - Another important data structure: \texttt{factor} for categorical data, which will be important for visualization purpose # Vector practices - Create the following objects: 1. vector1: {a1, a2, a3, b1, b2, b3, c1, c2, c3 ... z1, z2, z3} - Hint: break downs the question into two parts; check out function \texttt{rep(..., times = ..., each = ...)} 2. vector2: The sequence from 1 to 49 by an increment of 2 - Hint: check out function \texttt{seq(...)} - Subset the 3rd, 16th, and 25th elements of the vector - Subset those elements whose values are either smaller than 10, or greater than 40 # Vector practices \small ```{r} # Q1 chr <- rep(letters, each = 3) print(chr) num <- rep(1:3, times = length(letters)) print(num) ``` # Vector practices \small ```{r} # Q1 vector1 <- paste0(chr, num) print(vector1) ``` # Vector practices \small ```{r} # Q2 vector2 <- seq(from = 1, to = 49, by = 2) print(vector2) vector2[c(3, 16, 25)] vector2[vector2 < 10 | vector2 > 40] ``` # Vector practices 3. matrix1: a 5 by 5 matrix containing values from vector2 - Assign the row names: row_a, row_b, row_c, row_d, row_e - Assign the column names: col1, col2, col3, col4, col5 - Multiply the values in the first column of matrix 1 by 100; overwrite the original column 4. df1: a dataframe with two variables: - country = {US, UK, CA, FR, IT} - pop = {327, 66, 37, 67, 60} - Subset top-three observations in term of the level of population - Hint: check out function \texttt{order(...)} # Vector practices \small ```{r} # Q3 matrix1 <- matrix(data = vector2, nrow = 5, ncol = 5) rownames(matrix1) <- paste("row", letters[1:5], sep = "_") colnames(matrix1) <- paste0("col", 1:5) matrix1[, 1] <- matrix1[, 1] * 100 print(matrix1) ``` # Vector practices \scriptsize ```{r} # Q4 df1 <- data.frame(country = c("US", "UK", "CA", "FR", "IT"), pop = c(327, 66, 37, 67, 60)) print(df1) order(df1$pop, decreasing = TRUE) top3 <- order(df1$pop, decreasing = TRUE)[1:3] df1[top3, ] ``` # Workflow in \texttt{R} - Usual workflow for data anlaysis [@grolemund_r_2016]: ```{r, echo = FALSE, out.width='85%', fig.align='center'} #knitr::include_graphics("C:/Users/ak915/git/MethodRA/data-science-explore.png") knitr::include_graphics("data-science-explore.png") ``` # Tidyverse and tidy data - \texttt{Tidyverse} is a collection of packages designed for data science with unified grammar and data structures - *Tidy data*: - Each **variable** must have its own **column** - Each **observation** must have its own **row** - Each value must have its own cell # Tidyverse and tidy data >- To install \texttt{Tidyverse} package, run: \small ```{r, eval=FALSE} install.packages("tidyverse") ``` >- To load a package, run (usually at the top of your R document): \small ```{r, eval=FALSE} library(tidyverse) ``` # Importing data in \texttt{R} \small ```{r, message=F, warning=F} # Load package library(tidyverse) ``` \small ```{r} # Load econ.csv econ <- read_csv("http://staff.washington.edu/kpleung/vis/data/econ.csv") # tibble (tbl) is a special class of data frame class(econ) ``` # Importing data in \texttt{R} \scriptsize ```{r} # Get a sense of the dataset glimpse(econ) head(econ) ``` # Basic data wrangling - Below are just scratching the surface; check out - \href{https://www.datacamp.com/courses/introduction-to-the-tidyverse}{Introductory course to tidyverse at DataCamp} - \href{https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf}{Cheat sheet for data wrangling} - \href{https://r4ds.had.co.nz}{\textit{R for Data Science}} # Basic data wrangling: `count()` Count number of rows in each group: \scriptsize ```{r} econ %>% count(country) ``` # Basic data wrangling: `%>%` >- What is `%>%` ("pipe")? > - `x %>% fun(y)` is equivalent to `fun(x, y)` > - Its advantage will be apparent when you perform numerous steps of manipulation \scriptsize ```{r} count(econ, country) # Equivalent to econ %>% count(country) ``` # Basic data wrangling: `arrange()` Order rows by values of column(s) from low to high: \scriptsize ```{r} econ %>% count(country) %>% arrange(n) # Rather than: arrange(count(econ, country), n) ``` # Basic data wrangling: `arrange()` Order rows by values of column(s) from high to low: \scriptsize ```{r} econ %>% count(country) %>% arrange(desc(n)) ``` # Basic data wrangling: `filter()` Extract rows that meet logical criteria: \scriptsize ```{r} econ %>% filter(country == "Brazil") ``` # Basic data wrangling: `filter()` Extract rows that meet **multiple** logical criteria: \scriptsize ```{r} econ %>% filter( country == "Brazil" | country == "Russia (Soviet Union)" | country == "India" | country == "China" ) ``` # Basic data wrangling: `filter()` Alternatively: \scriptsize ```{r} econ %>% filter(country %in% c("Brazil", "Russia (Soviet Union)", "India", "China")) ``` # Basic data wrangling: `select()` Extract columns (variables): \scriptsize ```{r} econ %>% select(country, year, gdpPercap) ``` # Basic data wrangling: `filter()` & `select()` Filter USA observations from 2000 to 2010 with `year` and `gdpPercap` as the only variables: \scriptsize ```{r} USAdata <- econ %>% filter(country == "United States of America", year %in% 2000:2010) %>% select(year, gdpPercap) print(USAdata) ``` # Basic data wrangling: `summarize()` Compute table of summaries: \small ```{r} USAdata %>% summarize(avg_gdpPercap = mean(gdpPercap)) ``` What if we want to calculate the average GDP per capita for all countries in our data set? # Basic data wrangling: `group_by()` & `summarize()` >- Create a grouped version of the table with `group_by()` > - Subsequent functions will manipulate each group *separately* \scriptsize ```{r} econ %>% group_by(country) %>% summarize(avg_gdpPercap = mean(gdpPercap)) %>% arrange(desc(avg_gdpPercap)) ``` # Basic data wrangling: more `summarize()` What if we want to know the numbers of distinct countries and years in the data set? \small ```{r} econ %>% summarize_at(c("country", "year"), n_distinct) ``` # Basic data wrangling: `mutate()` Compute new columns (variables): \scriptsize ```{r} econ %>% mutate( id = row_number(), decade = year %/% 10 * 10 ) %>% select(id, country, GWn, year, decade, gdpPercap) ``` # Basic data wrangling: `group_by()` & `summarize()` What if we want to know countries' average GDP per capita over decades? \scriptsize ```{r} econ %>% mutate(decade = year %/% 10 * 10) %>% group_by(country, decade) %>% summarize(decAvg_gdp = mean(gdpPercap)) ``` # Saving wrangled data When you save the wrangled data, don't overwrite the original data with the same file name: \small ```{r, eval = FALSE} write_csv(econ, "econ_wrangled.csv") ``` # Intermediate data wranggling: second data set ```{r, include = F} options(width=150) ``` \scriptsize ```{r, message=F} pop <- read_csv("http://staff.washington.edu/kpleung/vis/data/pop.csv") head(pop) # Compare with econ head(econ) ``` # Intermediate data wranggling: `join` family How do we combine two data sets such that: \scriptsize ```{r,echo=FALSE} econ %>% left_join(pop, by = c("GWn", "year")) %>% select(-country.y) %>% rename(country = country.x) ``` # Intermediate data wranggling: `join` family Family of `join` functions: `inner_join`, `left_join`, `right_join`, `full_join`... \scriptsize ```{r} data <- econ %>% left_join(pop, by = c("GWn", "year")) %>% select(-country.y) %>% rename(country = country.x) ``` \scriptsize ```{r, echo = F} print(data) ``` # Intermediate data wranggling: `separate` (or `Regex`) How to separate the `region` column into `continent` and `sub_region`? \scriptsize ```{r, echo = F} data %>% separate(region, into = c("continent", "sub_region"), sep = ": ") ``` # Intermediate data wranggling: `separate` (or `Regex`) How to separate the `region` column into `continent` and `sub_region`? \scriptsize ```{r} data %>% separate(region, into = c("continent", "sub_region"), sep = ": ") ``` # Intermediate data wranggling: `separate` (or `Regex`) How to separate the `region` column into `continent` and `sub_region`? \scriptsize ```{r} # Or using regular expression data %>% mutate(continent = str_extract(region, ".*(?=: )"), sub_region = str_extract(region, "(?<=: ).*")) %>% select(-region) ``` ```{r, include=F} data <- data %>% separate(region, into = c("continent", "sub_region"), sep = ": ") ``` # Intermediate data wranggling: `case_when` - How to convert `pop` into a new categorical variable, called `popCat`: - Countries with `pop` value lower than the first quartile of all `pop` is classified as "low" - Countries with `pop` value equal to or higher than the first quartile, but lower than the third quartile is classified as "middle" - Countries with `pop` value equal to or higher than the third quartile is classified as "high" # Intermediate data wranggling: `case_when` \scriptsize ```{r} Qts <- quantile(data$pop, prob = c(0.25, 0.75), na.rm = TRUE) print(Qts) Q1 <- Qts[1] Q3 <- Qts[2] data <- data %>% mutate(popCat = case_when(pop < Q1 ~ "low", pop >= Q1 & pop < Q3 ~ "middle", pop > Q3 ~ "high")) ``` \tiny ```{r, echo = F} print(data) ``` # Intermediate data wranggling: `mutate` and `lag` Focus on USA data again. How to create a variable, named `growth`, thats computes the percentage change in `gdpPercap` compared to the immediate last year? \scriptsize ```{r, echo = F} data %>% filter(country == "United States of America") %>% mutate(gdpPercap_lag = lag(gdpPercap), growth = (gdpPercap - gdpPercap_lag) / gdpPercap_lag) %>% select(country, year, gdpPercap, growth) ``` # Intermediate data wranggling: `mutate` and `lag` \scriptsize ```{r} # Extract USA data USAdata <- data %>% filter(country == "United States of America") %>% select(country, year, gdpPercap) # Use `lag` to create a column of gdpPercap in past year USAdata <- USAdata %>% mutate(gdpPercap_lag1 = lag(gdpPercap, n = 1)) print(USAdata) ``` # Intermediate data wranggling: `mutate` and `lag` \scriptsize ```{r} USAdata <- USAdata %>% mutate(growth = (gdpPercap - gdpPercap_lag1) / gdpPercap_lag1) print(USAdata) ``` # References \scriptsize