Congrats on getting credentials! This document will explain how to connect to our database and how to use it effectively.
The database is hosted through Amazon Web Services (AWS) RDS. RDS is a type of AWS product that is strictly for relational databases. We are also construcitng an AWS EC2 account that alsao links to the RDS. For now, as we transition, it is easier to connect to the RDS directly. To access, users need the following information:
Item | Value |
---|---|
host url | ballot.cppysy51d7px.us-east-1.rds.amazonaws.com |
port to access | 5432 |
name of datbase | ballot |
username | [given] |
password | [given] |
Common packages
Next, we will set up a connection object. You will need a username (using kuriwaki
as one example) and password. R will prompt your for your password.
The ODBC interface is a connection to drivers that faciliates users to connect to any type of database (Postgres, Redshift, etc..). This setup is recommended by RStudio, but it requires you to have a PostgreSQL driver set up via ODBC (https://db.rstudio.com/best-practices/drivers/). If you do not, you can use the RpostgreSQL
below package instead of odbc
.
A more direct way to use a postgres-specific interface:
This databse currently has two tables - "vote"
(each observation is a vote for an office), and "person"
(each observation is a voter). Let’s explore the databse. We will use the dbplyr
package that allows execution of dplyr
verbs on a remote database.
To retrive a table from a database, we use the "tbl"
command:
If using odbc, you can use dbplyr
’s in_schema
syntax
This returns an object that can be used much like a data frame. In this example below, we count the number of votes for each candidate in the 2016 Republican Presidential Primary.
This will give :
## Source: lazy query [?? x 6]
# Database: postgres [kuriwaki@localhost:/ballot]
elec st voter_id choice_id choice_name contest_name
<date> <chr> <chr> <int> <chr> <chr>
1 2016-11-08 SC 45087-09449 46 Melanie Lawson Clerk of Court
2 2016-11-08 SC 45087-09449 53 Kacie McCall Petrie CCL0006 County Council District 6
3 2016-11-08 SC 45087-09449 56 Mike Scales Soil and Water District Commissioners
4 2016-11-08 SC 45087-09449 62 Kakie Nichols White SCH0006 School Board District 6
5 2016-11-08 SC 45087-09449 69 Robert Fred Small MUNUN00 Union Mayor
6 2016-11-08 SC 45087-09449 78 Ricky Todd Harris CTYUN04 Union City Council Dist 4
7 2016-11-08 SC 45087-09449 81 Yes, in favor of the question SALES TAX REFERENDUM
8 2016-11-08 SC 45087-09450 14 Donald J Trump President and Vice President
9 2016-11-08 SC 45087-09450 22 Tim Scott U S SENATOR
10 2016-11-08 SC 45087-09450 28 Mick Mulvaney CON0005 U.S. House of Reps District 5
# ... with more rows
Tables can be combined for custom browsing:
# the person table has information about at the voter-level (e.g. location)
person <- tbl(con, "person")
# get votes for 2012 Generals in Charleston county
charleston_2012 <- person %>%
filter(elec == "2012-11-06", county == "Charleston") %>%
select(elec, county, precinct_id, voter_id) %>%
left_join(vote, by = c("elec", "voter_id"))
Commands that compute the whole dataset take a lot of time to compute. For example, tally
and count
should be used sparingly. This takes some time to run.
Here, we see that Obama got more votes than Romney in 2012, Charleston:
If not via R, you can directly connect from the Terminal. For example, for the username kuriwaki
,
psql --host=ballot.cppysy51d7px.us-east-1.rds.amazonaws.com \
--port=5432 \
--username=kuriwaki \
--dbname=ballot;
you need postgresql program and its proper path to use this command.
Once connected, you can look at tables from the Terminal.
psql (10.4, server 9.6.6)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.
The \d
command in Postgres will display the list of tables in the dataset.
ballot=> \d
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | EL155_SC_samp | table | kuriwaki
public | candidate | table | kuriwaki
public | person | table | kuriwaki
public | samp | table | kuriwaki
public | vote | table | kuriwaki
(5 rows)
And you can look up tables from SQL syntax. For example, this takes all columns from the vote
table, limiting to just ten rows.
ballot=> select * from vote limit 10;
elec | st | voter_id | choice_id | choice_name | contest_name
------------+----+------------+-----------+------------------+--------------
2016-06-14 | SC | 45059-0001 | 1 | Ricky W Chastain | Sheriff
2016-06-14 | SC | 45059-0001 | 5 | David Tribble Jr | Auditor
2016-06-14 | SC | 45059-0002 | 2 | Don Reynolds | Sheriff
2016-06-14 | SC | 45059-0002 | 5 | David Tribble Jr | Auditor
2016-06-14 | SC | 45059-0003 | 2 | Don Reynolds | Sheriff
2016-06-14 | SC | 45059-0003 | 4 | Jim Coleman | Auditor
2016-06-14 | SC | 45059-0004 | 2 | Don Reynolds | Sheriff
2016-06-14 | SC | 45059-0004 | 4 | Jim Coleman | Auditor
2016-06-14 | SC | 45059-0005 | 2 | Don Reynolds | Sheriff
2016-06-14 | SC | 45059-0005 | 4 | Jim Coleman | Auditor
If you’ve given a username for the server
Item | Value |
---|---|
host url | ec2-35-171-26-70.compute-1.amazonaws.com |
username | [given] |
password | [given] |
To get into the EC2 server, you need a separate user that has access to PEM file issued by AWS. Suppose your PEM file is located is called ec2aslett.pem
. Then, moving to the directory where that pem file is located, you can ssh in by noting the EC2 server.
To login in the terminal with a PEM file:
Working without a PEM could work:
This EC2 server also has a Rstudio Server wrapped around it (http://louisaslett.com/RStudio_AMI/). In this setup, the EC2 server leads directly to a RStudio IDE: