Section 6 Querying the database

Queries are the workhorses of SQL databases. Queries can be made to select records from tables, make tabular joins, update records, alter table structure, and more. For this workshop we will focus on running a few simple queries that demonstrate the capability of R to direct the database to perform some function, optionally returning a data frame to R.

Queries can be made on the database using the syntax

res <- dbGetQuery(conn, statement)

where conn is an existing database connection and statement is a valid SQL statement. Again, the statement can specify data management, selection, summarization, join, etc.

For example, to select the first 5 records from the GPS data we would use

gps5 <- dbGetQuery(conn = rsql, statement = "select * from gps.tracks;")

One needs to take espeical care when using dbGetQuery() as a user with high level privileges, because queries can be run that delete records, tables, and even entire databases.

6.1 Summarizing (tabulation)

Handling large data sets can often be unwieldy in R due to the file size. In this exercise we will perform a summary on the Add Health and GPS data that were written to the database.

6.1.1 Race from Add Health

The Add Health variable that represents being White is H1GI6A and being African American is H1GI6B. We will make a count of responses for each (where the value of 1 indicates the respondent identified with that racial category).

6.1.2 Count of GPS records per calendar date

The GPS data have a column calendar_date that will be used to tabulate the count of records per date. We will also summarize accuracy by calendar date.


While some of these examples are a bit silly, they point to the utility of pre-loading the data into the database. It took quite some time to load the GPS data from a CSV file into a data frame; that time can be discounted from the time needed to perform the summary query on the table.

Particularly for very large tables, the gain from pre-loading into a database can be substantial. Nevertheless, one should always consider trade-offs. One of the benefits of PostgreSQL or other enterprise databases is that they can be set up in client-server mode, allowing multiple users

Add these lines to the end of your Rmd file:

<!--<start rmd_add_04.txt>-->
# Queries

## Race from Add Health

To get the count of respondents who identified as White and/or African American we will perform a simple query.

First, to speed up the query we will make a few indexes on the table. Indexes can greatly increase the speed of queries:

```{r idx1}
X <- dbGetQuery(conn = rsql, statement = "create index if not exists idx_addhealth_aid on addhealth.data_21600_0001 using btree(aid);")
X <- dbGetQuery(conn = rsql, statement = "create index if not exists idx_addhealth_varname on addhealth.data_21600_0001 using btree(varname);")
```

The query itself is fairly simple: we just want the total number of unique `aid`s and the count of records where `h1gi6a` = 1 and the count of records where `h1gi6b` = 1. We will be using common table expressions to simplify the steps.

```{r query_addhealth}
# establish the connection
rsql <- connectdb(dbname = "rsql", username = "csde")

sql <- "with
--count of distinct IDs
nrec as (select count(distinct aid) as n_total from addhealth.data_21600_0001)
--count of self-identified White
, nwhite as (select count(*) as n_white from addhealth.data_21600_0001 where varname = 'h1gi6a' and val = 1)
--count of self-identified African Americal
, naa as (select count(*) as n_afram from addhealth.data_21600_0001 where varname = 'h1gi6b' and val = 1)
--combine and calculate percentages
select n_total
, n_white
--count of White / total * 100, rounded to 2 places as percet
, round((n_white / n_total::numeric) * 100, 2) as pct_white
, n_afram
, round((n_afram / n_total::numeric) * 100, 2) as pct_afram
from nrec, nwhite, naa;"

# only run once per R session
if(!exists("addhealthsumdat")){
    t0 <- Sys.time()
    addhealthsumdat <- dbGetQuery(conn = rsql, statement = sql)
    t1 <- Sys.time()
    elapsed <- round(difftime(t1, t0, units = "secs"), 1)
}
kable(x = addhealthsumdat, format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, fixed_thead = T, position = "left")

```

In a way this is a silly example because it may very well have taken less time in base R to run this summary. It took `r elapsed` seconds to run on my laptop (16 GB RAM). However, one needs to also consider the time necessary for reading a large `dta` file into R memory. For the SQL process, the data have already been loaded into the database, which only needs to happen once.

## Count of GPS data per day
Here we already have the column calendar_date in the table. We will make a simple count of records based on that.

```{r gpsdate}
sql <- "select calendar_date, count(*) from gps.tracks group by calendar_date order by calendar_date;"

gpsdate <- dbGetQuery(conn = rsql, statement = sql)

DT::datatable(data = gpsdate)
```

Suppose we also wanted to know the mean and standard deviation of the `accuracy` variable, ordered by accuracy in descending order:

```{r gpsdate_accuracy}
sql <- "select
calendar_date
, count(*)
, avg(accuracy) as mean_accuracy
, stddev(accuracy) as sd_accuracy
from gps.tracks
group by calendar_date
order by avg(accuracy) desc;"

gpsdate_acc <- dbGetQuery(conn = rsql, statement = sql)

DT::datatable(data = gpsdate_acc)
```

<hr class="new4">
<!--<end rmd_add_04.txt>-->

and then render the file.

rmarkdown::render("rmd/r_sql.rmd")