I’ve been doing some pro bono data and statistics work for a nonprofit focused on keeping constituents informed of how well their elected representatives are representing them between elections. The goal is to allow users to vote along with their representatives and to receive information on how aligned their views are with the votes of their representatives.
We are trying to maintain and analyze data from a number of different sources: about the users (e.g. congressional district, join date); about the bills being voted on; about user engagement; and about numerous other topics. In most of my past work, I was given a dataset to analyze. Now I’m faced with the challenge of building up a database to hopefully maintain and use well into the future.
To that end, I’ve set up a (local)
PostgreSQL database. I’ve had passing encounters with
SQL in the past, but this is my first time working with it in any serious capacity. I’m starting from scratch. I mostly followed the Official PostgreSQL Tutorial, which provides something of a “quick start” guide. I pulled information and tips from a bunch of other sources over the course of setting up the database. This is my attempt to organize that information so I can reproduce it in the future if needed.
I am using Arch Linux, so the installation and setup notes are specific to that context. I mostly followed the materials posted on the Arch Linux wiki.
First, install the
sudo pacman -S PostgreSQL
Switch to the PostgreSQL user
$ su $ su -l postgres
Initialize the database cluster
$ initdb -D /var/lib/postgres/data
Return to regular user (with
systemctl start postgresql.service systemctl enable postgresql.service
Make a user
Make a database
We use the
-Oflag to assign the owner of the database. In the previous step, I created a user named
dan. I assigned the new database (
createdb <database-name> -O dan
Basic Operations with
postgresql package comes with
psql, a PostgreSQL interactive terminal.
psql allows us to interactively work with our PostgreSQL databases – we can create, modify, and query databases from the
psql terminal. It also provides various tools for working with scripts and for automating commonly-used procedures.
We can access a given database, from bash, as follows:
psql -d <database-name> -U dan
psql provides many useful tools that are not part of the
SQL language. A sampling of useful utilities include:
\help: get help
\c <database>: connect to database
\du: list users and permissions
<C-d>: exit the
\?: list other meta-commands
Creating a Table
Tables can be created as follows (the below is written in
SQL and is not a
psql terminal command, though it can be entered directly into the
CREATE TABLE <table-name> ( col1 varchar(80), -- variable-length character col2 int, -- integer col3 timestamp, -- date/time col4 real, -- real number col5 date -- date );
Each line above includes the column name (e.g.
col1), the variable type (e.g.
varchar(80)), and a comment (text following
--). This basic syntax creates an empty table within our database.
Adding Some Data to the Table
There are numerous ways to enter data into the table. We can make use of the
INSERT INTO commands:
INSERT INTO <table-name> VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
This will fill in one row of data. It assumes we have positioned the data according to the order of the columns in the table. We can also specify exactly which columns we are filling (for example, if we don’t know the order):
INSERT INTO <table-name> (col1, col2, col3, col4, col5) VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
In my case, I had some
CSV files I wanted to load into the database. This can be accomplished with the
COPY command or with the
\copy command. I opted to use the
\copy approach, following a lot of advice I found from various sources online.
\copy <table-name>(col1, col2, col3, col4, col5) FROM '/path/to/file/' DELIMITER ',' CSV HEADER;
We specify the (already-created but as-of-yet unpopulated) table in our database to which we want to import the data first, and the file source second. By noting that the
CSV has a header, we tell the
\copy command to omit the first row.
The above represents my very first steps into the world of database design and
SQL. Much more will follow. I have set up several tables in this database have connected them with primary and foreign keys – a concept I will explore further later. I plan, in the near future, to figure out how to interact with this database using
R; how to set up easily-reproducible queries to answer common questions of interest; and how, in general, to make the best use of the database in a data science and statistics workflow.