---
title: "lab1_code"
author: "Kai Ping (Brian) Leung"
date: "9/27/2019"
output:
pdf_document: default
html_document:
df_print: paged
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Prerequiste
```{r, message=FALSE}
rm(list = ls()) # Clear memory
library(tidyverse) # Load package
```
# Working directory
Check if your working directory is correct (where you have saved `lab1_data.csv` and `lab1_survey.csv`)
```{r}
#getwd()
#setwd()
```
# Data Frames Practice 1
## 1. Load Lab1data.csv in R
```{r}
# Load data
data <- read.csv("lab1_data.csv", header = TRUE, stringsAsFactors = FALSE)
```
## 2. What is the data structure? What does that tell us about type?
```{r}
# Check structure
dim(data)
class(data)
is.data.frame(data)
is.matrix(data)
# Alternatively
str(data)
```
## 3. Check the names and summary statistics of the data. Fix any names that are less than good.
```{r}
# Check and fix names
names(data)
names(data)[3] <- "gdp.per.cap"
names(data) # Check again
# Summary Statistics
summary(data)
```
## 4. Remove observations with missing values
```{r}
# Remove NAs
dataClean <- na.omit(data) # listwise deletion!!
dim(data)
dim(dataClean)
```
## 5. Calculate the average GDP per capita for Brazil for the observed period. Repeat the calculation for all countries.
```{r}
# Base R
mean(dataClean[dataClean$country == "Brazil", "gdp.per.cap"])
# Tidy way
dataClean %>%
filter(country == "Brazil") %>%
summarize(mean(gdp.per.cap))
# Average gdp.per.cap for all countries
dataClean %>%
group_by(country) %>%
summarize(mean(gdp.per.cap))
```
## 6. Plot GDP per capita (on the x-axis) and polity2 (on the y-axis)
```{r}
# Base Graphics
plot(x = dataClean$gdp.per.cap,
y = dataClean$polity2)
# Try logging GDP
plot(x = log(dataClean$gdp.per.cap),
y = dataClean$polity2,
xlab = "Logged GDP per capita",
ylab = "Polity2")
# ggplot2
ggplot(dataClean, aes(y = polity2, x = log(gdp.per.cap))) +
geom_point() +
labs(x = "Logged GDP per capita", y = "Polity2") +
theme_classic()
```
## 7. Create a new variable called "democracy". Assign 0 to countries with negative value or zero polity2 score, and assign 1 to countries with positive score.
```{r}
# Create a variable called "democracy"
dataClean$democracy <- NA
head(dataClean)
# You can subset data based on a logical statement
dataClean$polity2 <= 0
dataClean[dataClean$polity2 <= 0, ]
# Take advantage of this: Assign values to "democracy" based on polity2 values
dataClean$democracy[dataClean$polity2 <= 0] <- 0
# Do the same for positive Polity2 score
dataClean$democracy[dataClean$polity2 > 0] <- 1
# Tidy way
dataClean %>%
mutate(democracy = case_when(polity2 <= 0 ~ 0,
TRUE ~ 1))
```
## 8. Use a loop to do the same recoding
```{r}
dataClean$democracy <- NA
n <- nrow(dataClean)
for (i in 1:n) {
if (dataClean$polity2[i] <= 0) dataClean$democracy[i] <- 0
else dataClean$democracy[i] <- 1
}
```
# Data Frames Practice 2
## 1. Read in the data "lab1_survey.csv"
```{r}
# Clear and load data
rm(list = ls())
survey_data <- read.csv(file = "lab1_survey.csv")
```
## 2. Inspect the data. What format are they in? What values do the data take, and how do those values correspond with the survey?
```{r}
str(survey_data)
```
## 3. Generate some summary statistics.
```{r}
summary(survey_data)
mean(survey_data$R)
mean(survey_data$latex)
median(survey_data$R)
median(survey_data$latex)
sd(survey_data$R)
sd(survey_data$latex)
# Tidy way
survey_data %>%
summarize_all(funs(mean, median, sd, min, max))
# %>% gather(key = "stat")
```
## 4. How are these two variables related to each other (assuming equal intervals b/w categories)?
```{r}
cor1 <- cor(survey_data$R, survey_data$latex)
```
The correlation b/w R knowledge and LaTeX knowledge is `r cor1`, or more nicely, `r round(cor1, 2)`.
## 5. Are there any problems with the way the data are coded? (Think about lecture yesterday.)
## 6. Recode the data
```{r}
survey_data %>%
mutate(# Recode R into categories
R_cat = case_when(R == 0 ~ "What's that?",
R == 1 ~ "I've heard of it",
R == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"),
# Recode latex into categories
latex_cat = case_when(latex == 0 ~ "What's that?",
latex == 1 ~ "I've heard of it",
latex == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"))
# We're repeating ourselves... Must be a faster way
survey_data <-
survey_data %>%
mutate_at(vars(R, latex),
function(x) case_when(x == 0 ~ "What's that?",
x == 1 ~ "I've heard of it",
x == 2 ~ "I can use it or apply it",
TRUE ~ "I understand it well"))
```
## 7. Why is this coding method better?
## 8. Generate some plots of the data: bar charts are good here, scatterplots even better.
```{r}
# Bar charts
ggplot(survey_data, aes(x = R)) +
geom_bar() +
labs(x = "R knowledge")
ggplot(survey_data, aes(x = latex)) +
geom_bar() +
labs(x = "LaTeX knowledge")
# Scatter plot
ggplot(survey_data, aes(x = R, y = latex)) +
geom_jitter(alpha = .7, height = .2, width = .2) +
labs(x = "R knowledge", y = "LaTeX knowledge") +
theme_classic()
##### Something is wrong? #####
# Convert two variables into factors
knowledge_levels <- c("What's that?",
"I've heard of it",
"I can use it or apply it",
"I understand it well")
survey_data <-
survey_data %>%
mutate(R = factor(R, levels = knowledge_levels),
latex = factor(latex, levels = knowledge_levels)
)
# Redo the scatter plot
ggplot(survey_data, aes(x = R, y = latex)) +
geom_jitter(alpha = .7, height = .2, width = .2) +
labs(x = "R knowledge", y = "LaTeX knowledge") +
scale_x_discrete(limits = knowledge_levels) +
theme_classic()
```
# LaTex in R Markdown
$$
1 + 1 = 2
$$
$$
11 \times 11 = 121 \\
$$
$$
E = mc^2
$$
I think it's Einstein who proposed $E = mc^2$.
$$
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
$$
$$
\begin{split}
X & = (x+a)(x-b) \\
& = x(x-b) + a(x-b) \\
& = x^2 + x(a-b) - ab
\end{split}
$$