Section 5 Importing to a database using R

There are a number of different ways of “pushing” data frames from R to PostgreSQL. dbWriteTable() is the most straightforward, but SQL and psql statements can be written to define new table structures and \copy table_schema.table_name from 'systemfile.csv' with csv header.

Copy the following code at the end of your Rmd file:

<!--<start rmd_add_03.txt>-->
# Writing to the database

## `dbWriteTable()`

Here we are going to read the the [Add Health](https://www.cpc.unc.edu/projects/addhealth/documentation/publicdata){target="_blank"} data in as a data frame using the `readstata13::read.dta13()` function. The table is large (6504 r $\times$ 2794 c). Because PostgreSQL has a limit of 1600 columns per table, we will melt this into a long format using `tidyr::pivot_longer()` and finally write the table into the database. The result of `dbWriteTable()` is a Boolean value indicating whether the table was written.

```{r read_dta}
# check whether the table exists -- if it does not, then run this
if(!dbTableExists(conn = rsql, table_schema = "addhealth", table_name = "data_21600_0001")){
    # unzip if necessary
    if(!file.exists("../data/21600-0001-Data.dta")){
        x <- unzip(zipfile = "../data/21600-0001-Data.zip", exdir = "../data")
    }
    # read the file in, raw (do not convert values to factors)
    ah <- readstata13::read.dta13("../data/21600-0001-Data.dta", convert.factors = FALSE)

    # lowercase the column names, better for PostgreSQL
    colnames(ah) <- tolower(colnames(ah))

    # variables are numeric
    ah <- ah %>% mutate_at(., -1, as.integer)

    # make long
    ah1 <- ah %>% pivot_longer(-aid, names_to = "varname", values_to = "val")

    # write to the database -- may take awhile because there are 18 million records.
    #   also use row.names = FALSE to avoid making a "row.names" column
    success <- dbWriteTable(conn = rsql, name = c("addhealth", "data_21600_0001"), value = ah1, row.names = FALSE, overwrite = TRUE)
}

## SQL and psql
Here we will use a large table of GPS records collected from my phone since Sept 10, 2019. We first read in the large table of points records to examine the column names and values. Seeing that some of the columns contain unusable data, we drop some columns, save a new CSV, and then write to the database using the psql `\copy` statement.

Constructing a Windows system command in R is no fun at all.

We are jumping the gun a bit with respect to running SQL commands (e.g., `CREATE TABLE` and `SELECT`) but we do need to get some data into the databases and verify that the data were written.

```{r gps}
# only run if gps.tracks does not exist
if(!dbTableExists(conn = rsql, table_schema = "gps", table_name = "tracks")){
    # unzip
    if(!file.exists("../data/gpslogger.csv")){
        unzip("../data/gpslogger.csv.zip", exdir = "../data")
    }

    # read in a bit of the file -- so we can see the column names
    gps <- read.csv("../data/gpslogger.csv", as.is = TRUE)

    # there are a few columns worth dropping because they have no data
    gps$ageofdgpsdata <- gps$dgpsid <- gps$activity <- gps$annotation <- NULL

    # save the GPS data minus dropped columns
    write.csv(x = gps, file = "../data/gpslogger_dropped_cols.csv", row.names = FALSE, na = "")

    # an SQL query to define the table
    sql <- "
    --drop the table if it exists
    drop table if exists gps.tracks;
    --create the table
    create table gps.tracks
    (fname text
        , time_gps timestamptz
        , calendar_date date
        , lat float
        , lon float
        , elevation float
        , accuracy float
        , bearing float
        , speed float
        , satellites integer
        , provider text
        , hdop float
        , vdop float
        , pdop float
        , geoidheight float
        , battery integer);"

    # run the query to make the table
    res <- dbGetQuery(conn = rsql, statement = sql)

    # write to the database from the CSV
    # test OS
    if(Sys.info()['sysname'] == "Windows"){
        # formulate a horrible windows command string
        mycmd <- "cmd /c C:\\PROGRA~1\\PostgreSQL\\12\\bin\\psql -U csde -c \"\\copy gps.tracks from '../data/gpslogger_dropped_cols.csv' with csv header\" rsql"
        # run the "\copy"
        system(mycmd)
    } else {
        # formulate a nicer looking POSIX command string -- still need to escape internal quotes and backslashes.
        mycmd <- "psql -U csde -c \"\\copy gps.tracks from '../data/gpslogger_dropped_cols.csv' with csv header\" rsql"
        # run the "\copy"
        system(mycmd)
    }
}
```

It does take some time for the table to read and write, but the command succeeds. We can verify:

```{r gpsverify}
dbGetQuery(conn = rsql, statement = "select count(*) from gps.tracks")
```

<hr class="new4">
<!--<end rmd_add_03.txt>-->

5.1 dbWriteTable() to load data into PostgreSQL

For this exercise we will read into R the full raw data table of the public version of the Add Health data in Stata DTA format (using the readstata13 library), and then write directly to the database.

Download the 21600-0001-Data.zip into the data folder.

5.2 SQL and psql

With this method, an SQL query is written to create a table (with no records), by defining column names and data types. Then a CSV file is written to the database. The format of the CSV file must be compliant with the table definition.

For this exercise, download a table of GPS points.


Render your Rmd file with these new changes:

rmarkdown::render("rmd/r_sql.rmd")