Section 9 sqldf()

The sqldf() function can be used to run SQL syntax on data frames in the R environment. By default, if no database connection specifications are made, R will use an SQLite connection to a database stored in RAM. Thus, with no connection specifications, SQLite syntax will be used. This may be a problem if you are trying to perform functions such as select * from a full join b using(id) because SQLite does not have a native full join.

For this reason I have included lines in the script for PostgreSQL connections to set some sqldf options:

# options for using sqldf (connect to PostgreSQL rather than SQLite)
sqldf.pg.options <- function(dbname="test", host="localhost", port=5432, user=Sys.getenv("USER") ){
    # get the password
    if(Sys.info()['sysname'] == "Windows"){
        pgpass = scan(file.path(Sys.getenv("APPDATA"), "postgresql/pgpass.conf"), what = "character")
    } else {
        pgpass = scan("~/.pgpass", what = "character")
    }
    passwd <- strsplit(pgpass, ":")[[1]][5]

    3 set the options
    options(
        sqldf.RPostgreSQL.user = user,
        sqldf.RPostgreSQL.password = passwd,
        sqldf.RPostgreSQL.dbname = dbname,
        sqldf.RPostgreSQL.host = host,
        sqldf.RPostgreSQL.port = port
    )
}
.Options$sqldf.RPostgreSQL.password <- NULL

In this code, the default database to connect to is test and the default user is the currently logged on user. Also, the function assumes you have set up the pgpass file for passwordless database connections. To use this function for this workshop, you would use the function as

sqldf.pg.options(dbname = "rsql", user = "csde")

In this final exercise, we will read a few GPS records from the rsql database and then ise sqldf() to perform some simple opertions. We will use the custon sqldf.pg.options() as described above to show that we can run ad hoc SQL-formatted queries on data frames. We will also show the difference between using the SQLite and PostgreSQL drivers in the use of sqldf().

Add these lines to your Rmd file:

<!--<start rmd_add_07.txt>-->
# `sqldf()`
To demonstrate the use of `sqldf()` we will start by using a small set of GPS records from the previous database.

```{r gps_sel}
# filename, time
gps_10_records_meta <- dbGetQuery(conn = rsql, statement = "select fname, time_gps from gps.tracks limit 10;")
# time, lat, long
gps_10_records_ll <- dbGetQuery(conn = rsql, statement = "select time_gps, lat, lon from gps.tracks limit 10;")

sql <- "select a.fname, a.time_gps, b.lat, b.lon from gps_10_records_meta as a full join gps_10_records_ll as b on a.time_gps = b.time_gps;"
```

The query we want to run with `sqldf()` is intended to join the two tables based on common time stamp:

```
`r sql`
```

Using `sqldf()` without connection parameters usually instantiates an SQLite database stored in RAM. However, if a PostgreSQL connection has already been made, the default connection for `sqldf()` is with a database named "test" and the user "postgres". _[Note: the functions below use `verbose = TRUE` to list the sqldf details.]_

```{r sqldf_pg1}
sqldf(x = sql, verbose = TRUE)
```

This fails with a new or restarted R session because there is no database named "test" and we have not provided a password for the user "postgres". _[If you have already set sqldf options, or if you render again without restarting R the sqldf()` statement above should actually produce desired results.]_

Let's see if SQLite will work. To use a specific driver, include the `drv = ...` option. In this example we specify `SQLite`.

```{r sqldf_sqlite}
sqldf(x = sql, drv = "SQLite", verbose = TRUE)
```

The query fails because SQLite does not allow for full joins.

Here we can try with the `PostgreSQL` driver and also specifying the database and user names:

```{r sqldf_pg2}
sqldf(x = sql, drv = "PostgreSQL", dbname = "rsql", user = "csde")
```

The query fails because `sqldf()` requires an explicit password (i.e., it does not use the `pgpass` file as does `dbConnect()`, etc.). To get around this, we can set up some options from the `dbconnect.R` script.

The workaround here is to set several R global options, which are wrapped in the `dbconnect.R` script. _[Note: the script assumes that the `pgpass` file has been set up.]_

Now the query should produce desired results.

```{r sql_pg3}
# load the options
sqldf.pg.options(dbname = "rsql", user = "csde")

# run the query
junk <- sqldf(x = sql, verbose = TRUE)

DT::datatable(junk)
```

What we have just done is to use the PostgreSQL engine to peform a query on a data frame that is stored in R's environment, and to return a data frame (`junk`) as a result.

<hr class="new4">

# Disconnect

Now that we are done, we will disconnect all connections.

```{r disconnectall}
# our custom PostgreSQL function
disconnectall(verbose = TRUE)

dbDisconnect(mydb)
```

<hr class="new4">
<!--<end rmd_add_07.txt>-->

Restart the R session (in RStudio select Session > Restart R from the menu). This will clear out any objects or settings.

Next, render the Rmd file:

rmarkdown::render("rmd/r_sql.Rmd")