--- title: "R and SQL" author: "Philip M. Hurvitz" # date: '`r format(Sys.time(), "%Y-%m-%d %H:%M")`' header-includes: #allows you to add in your own Latex packages - \usepackage{float} #use the 'float' package - \floatplacement{figure}{H} #make every figure with caption = h output: html_document: number_sections: true self_contained: true code_folding: hide toc: true toc_float: collapsed: true smooth_scroll: false pdf_document: number_sections: true toc: true fig_cap: yes keep_tex: yes urlcolor: blue --- ```{r setup, message=FALSE} # key packages library(RPostgreSQL) library(RSQLite) library(sqldf) library(sf) library(kableExtra) library(dplyr) library(tidyr) library(knitr) library(ggplot2) library(readstata13) library(DT) library(rgdal) # captions library(captioner) table_nums <- captioner(prefix = "Table") figure_nums <- captioner(prefix = "Figure") knitr::opts_chunk$set(warning = FALSE, message = FALSE) # db connection script source("../scripts/dbconnect.R") # establish the connection rsql <- connectdb(dbname = "rsql", username = "csde") ```
[R andSQL databases](http://staff.washington.edu/phurvitz/r_sql){target="_blank"} # Simple database interaction ## Showing connection parameters Sometimes we want to see the database connection parameters. The basic connection parameters are shown with the `dbGetInfo(conn)` function. The connection parameters below: ```{r getinfo} dbGetInfo(rsql) ``` show that the connection is to the `localhost` local server on port `5432`, connecting as user `csde` to the database `rsql`, and that the server is running PostgreSQL version `12.0.2` (though your version may vary.) ## Test that the connection actually works Here we will run a simple function to show that we are able to access data in the database. The `dbListTables()` function shows the typically user-accessible tables in the database search path. See `r table_nums(name = "dbt", display = "cite")`. _`r table_nums(name = "dbt", caption = "Current database tables")`_ ```{r q1} dbt <- data.frame(table_name = dbListTables(conn = rsql)) kable(x = dbt, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, fixed_thead = T, position = "left") ``` Currently the only table in the database (other than the internal tables that users typically do not use directly) is the one set up by the PostGIS extension. If this Rmd file is rendered after we have added any tables to the database, there should be more tables listed. # Disconnecting To disconnect from the database, we can use the custom function `disconnectdb()`. Although we will not disconnect yet ... there's still work to be done. ```{r disconnect, eval=FALSE} disconnectdb(conn = rsql) ```
# Writing to the database ## `dbWriteTable()` Here we are going to read the the [Add Health](https://www.cpc.unc.edu/projects/addhealth/documentation/publicdata){target="_blank"} data in as a data frame using the `readstata13::read.dta13()` function. The table is large (6504 r $\times$ 2794 c). Because PostgreSQL has a limit of 1600 columns per table, we will melt this into a long format using `tidyr::pivot_longer()` and finally write the table into the database. The result of `dbWriteTable()` is a Boolean value indicating whether the table was written. ```{r read_dta} # check whether the table exists -- if it does not, then run this if(!dbTableExists(conn = rsql, table_schema = "addhealth", table_name = "data_21600_0001")){ # unzip if necessary if(!file.exists("../data/21600-0001-Data.dta")){ x <- unzip(zipfile = "../data/21600-0001-Data.zip", exdir = "../data") } # read the file in, raw (do not convert values to factors) ah <- readstata13::read.dta13("../data/21600-0001-Data.dta", convert.factors = FALSE) # lowercase the column names, better for PostgreSQL colnames(ah) <- tolower(colnames(ah)) # variables are numeric ah <- ah %>% mutate_at(., -1, as.integer) # make long ah1 <- ah %>% pivot_longer(-aid, names_to = "varname", values_to = "val") # write to the database -- may take awhile because there are 18 million records. # also use row.names = FALSE to avoid making a "row.names" column success <- dbWriteTable(conn = rsql, name = c("addhealth", "data_21600_0001"), value = ah1, row.names = FALSE, overwrite = TRUE) } ``` ## SQL and psql Here we will use a large table of GPS records collected from my phone since Sept 10, 2019. We first read in the large table of points records to examine the column names and values. Seeing that some of the columns contain unusable data, we drop some columns, save a new CSV, and then write to the database using the psql `\copy` statement. Constructing a Windows system command in R is no fun at all. We are jumping the gun a bit with respect to running SQL commands (e.g., `CREATE TABLE` and `SELECT`) but we do need to get some data into the databases and verify that the data were written. ```{r gps} # only run if gps.tracks does not exist if(!dbTableExists(conn = rsql, table_schema = "gps", table_name = "tracks")){ # unzip if(!file.exists("../data/gpslogger.csv")){ unzip("../data/gpslogger.csv.zip", exdir = "../data") } # read in a bit of the file -- so we can see the column names gps <- read.csv("../data/gpslogger.csv", as.is = TRUE) # there are a few columns worth dropping because they have no data gps$ageofdgpsdata <- gps$dgpsid <- gps$activity <- gps$annotation <- NULL # save the GPS data minus dropped columns write.csv(x = gps, file = "../data/gpslogger_dropped_cols.csv", row.names = FALSE, na = "") # an SQL query to define the table sql <- " --drop the table if it exists drop table if exists gps.tracks; --create the table create table gps.tracks (fname text , time_gps timestamptz , calendar_date date , lat float , lon float , elevation float , accuracy float , bearing float , speed float , satellites integer , provider text , hdop float , vdop float , pdop float , geoidheight float , battery integer);" # run the query to make the table res <- dbGetQuery(conn = rsql, statement = sql) # write to the database from the CSV # test OS if(Sys.info()['sysname'] == "Windows"){ # formulate a horrible windows command string mycmd <- "cmd /c C:\\PROGRA~1\\PostgreSQL\\12\\bin\\psql -U csde -c \"\\copy gps.tracks from '../data/gpslogger_dropped_cols.csv' with csv header\" rsql" # run the "\copy" system(mycmd) } else { # formulate a nicer looking POSIX command string -- still need to escape internal quotes and backslashes. mycmd <- "psql -U csde -c \"\\copy gps.tracks from '../data/gpslogger_dropped_cols.csv' with csv header\" rsql" # run the "\copy" system(mycmd) } } ``` It does take some time for the table to read and write, but the command succeeds. We can verify: ```{r gpsverify} dbGetQuery(conn = rsql, statement = "select count(*) from gps.tracks") ```
# Queries ## Race from Add Health To get the count of respondents who identified as White and/or African American we will perform a simple query. First, to speed up the query we will make a few indexes on the table. Indexes can greatly increase the speed of queries: ```{r idx1} X <- dbGetQuery(conn = rsql, statement = "create index if not exists idx_addhealth_aid on addhealth.data_21600_0001 using btree(aid);") X <- dbGetQuery(conn = rsql, statement = "create index if not exists idx_addhealth_varname on addhealth.data_21600_0001 using btree(varname);") X <- dbGetQuery(conn = rsql, statement = "create index if not exists idx_addhealth_val on addhealth.data_21600_0001 using btree(val);") ``` The query itself is fairly simple: we just want the total number of unique `aid`s and the count of records where `h1gi6a` = 1 and the count of records where `h1gi6b` = 1. We will be using common table expressions to simplify the steps. ```{r query_addhealth} # establish the connection rsql <- connectdb(dbname = "rsql", username = "csde") sql <- "with --count of distinct IDs nrec as (select count(distinct aid) as n_total from addhealth.data_21600_0001) --count of self-identified White , nwhite as (select count(*) as n_white from addhealth.data_21600_0001 where varname = 'h1gi6a' and val = 1) --count of self-identified African Americal , naa as (select count(*) as n_afram from addhealth.data_21600_0001 where varname = 'h1gi6b' and val = 1) --combine and calculate percentages select n_total , n_white --count of White / total * 100, rounded to 2 places as percet , round((n_white / n_total::numeric) * 100, 2) as pct_white , n_afram , round((n_afram / n_total::numeric) * 100, 2) as pct_afram from nrec, nwhite, naa;" # only run once per R session if(!exists("addhealthsumdat")){ t0 <- Sys.time() addhealthsumdat <- dbGetQuery(conn = rsql, statement = sql) t1 <- Sys.time() elapsed <- round(difftime(t1, t0, units = "secs"), 1) } kable(x = addhealthsumdat, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, fixed_thead = T, position = "left") ``` In a way this is a silly example because it may very well have taken less time in base R to run this summary. It took `r elapsed` seconds to run on my laptop (16 GB RAM). However, one needs to also consider the time necessary for reading a large `dta` file into R memory. For the SQL process, the data have already been loaded into the database, which only needs to happen once. ## Count of GPS data per day Here we already have the column calendar_date in the table. We will make a simple count of records based on that. ```{r gpsdate} sql <- "select calendar_date, count(*) from gps.tracks group by calendar_date order by calendar_date;" gpsdate <- dbGetQuery(conn = rsql, statement = sql) DT::datatable(data = gpsdate) ``` Suppose we also wanted to know the mean and standard deviation of the `accuracy` variable, ordered by accuracy in descending order: ```{r gpsdate_accuracy} sql <- "select calendar_date , count(*) , avg(accuracy) as mean_accuracy , stddev(accuracy) as sd_accuracy from gps.tracks group by calendar_date order by avg(accuracy) desc;" gpsdate_acc <- dbGetQuery(conn = rsql, statement = sql) DT::datatable(data = gpsdate_acc) ```
## Spatial SQL with PostGIS This analysis will count the number of GPS points within each census blog group. The census block groups are from a tabular dump from another PostGIS databases. First we will restore the dump file to the database. Ths creates the table `census.acs5_2018_bg`. ```{r census} # does the table exist? if(!dbTableExists(conn = rsql, table_schema = "census", table_name = "acs5_2018_bg")){ # unzip file if necessary if(!file.exists("../data/census.acs5_2018_bg.dump") & file.exists("../data/census.acs5_2018_bg.dump.zip")){ unzip(zipfile = "../data/census.acs5_2018_bg.dump.zip", exdir = "../data") } # read in a dump file if(Sys.info()['sysname'] == "Windows"){ # formulate a horrible windows command string mycmd <- "cmd /c C:\\PROGRA~1\\PostgreSQL\\12\\bin\\psql -U csde -f ../data/census.acs5_2018_bg.dump rsql" # restore system(mycmd) } else { # formulate a nicer looking POSIX command string mycmd <- "psql -U csde -c -f ../data/census.acs5_2018_bg.dump rsql" # restore system(mycmd) } } ``` Let's check that the table exists in the database: ```{r list_tables_spatial} dbListTables(conn = rsql) ``` Before we can measure the points, we need to create a geometry column within the GPS point table: ```{r gpspoint} sql <- " --add the column alter table gps.tracks add column if not exists geom geometry(point, 4326); --update the value update gps.tracks set geom = st_setsrid(st_makepoint(lon, lat), 4326); --create the spatial index create index if not exists idx_gps_tracks on gps.tracks using gist(geom); " res <- dbGetQuery(conn = rsql, statement = sql) ``` Now we formulate a query to create a new table with the count the number of GPS points per census block group, using `count()` to count the number of points aggregated using `group_by` based on containment tested by `st_contains()`: ```{r pip} sql <- " --create a new table create table if not exists gps.tracks_bg as --select the census BG ID and count of points select geoid, count(*) as n_points --from a subquery from (select c.geoid, g.time_gps --aliasing block groups as c from census.acs5_2018_bg as c --aliasing GPS as g , gps.tracks as g --the secret sauce, where the block group contains the GPS point where st_contains(c.geom, g.geom)) as foo --aggregate count by block group group by geoid --order by number of points per BG order by n_points desc; --select the records select * from gps.tracks_bg; " # run the query gps.tracks_bg <- dbGetQuery(conn = rsql, statement = sql) # print the table DT::datatable(gps.tracks_bg) ``` This provided one simple example of how GIS functions can be used within the database. PostGIS provides a plethora of analytic methods for processing spatially referenced data. See [PostGIS Reference](https://postgis.net/docs/reference.html){target="_blank"}.
# SQLite ## Create an SQLite database ```{r sqlite1} # make the connection if it does not exist. If the file does not exist, it will be created. if(exists("mydb")){ if(!dbIsValid(mydb)){ mydb <- dbConnect(RSQLite::SQLite(), "../data/rsql.sqlite") } } ``` ## Copy some data into the SQLite database We write the GPS data into the database. If the table already exists, we will get an error, so a test first will help. ```{r sqlite2} if(!any(grepl(pattern = "gps_tracks", x = dbListTables(conn = mydb)))){ gps <- read.csv("../data/gpslogger_dropped_cols.csv") res <- dbWriteTable(conn = mydb, name = "gps_tracks", value = gps, row.names = FALSE) } ``` ## Verify data ```{r sqlite3} dbListTables(conn = mydb) ``` ## Summarize data ```{r sqlite4} gps_caldate_sum <- dbGetQuery(conn = mydb, statement = "select calendar_date, count(*) as n_records from gps_tracks group by calendar_date order by calendar_date;") DT::datatable(gps_caldate_sum) ``` # Disconnect Now that we are done, we will disconnect all connections. ```{r disconnectall} # our custom PostgreSQL function disconnectall(verbose = TRUE) dbDisconnect(mydb) ```