This is the start of a series about partitioning in PostgreSQL. For this and the following posts I will use PostgreSQL 12 (which currently is in beta) so some stuff might not work if you are on PostgreSQL 11 or even on PostgreSQL 10 when declarative partitioning was introduced. This very first post is only about preparing the data set and I decided to separate this one from the next post because it shows various things around loading data in PostgreSQL and therefore might be useful on its one.
When you are looking for free data sets “The home of the U.S. Government’s open data” is a great source. It provides free data sets for research, development or just data you can play with for whatever reason. For the scope of this and the following posts we will use the Traffic violations data set. It contains more the 1.5 millions of rows and is a good starting point for diving into PostgreSQL partitioning.
The initial import goes into a table which only contains text columns and from there we can load our partitions later on. The structure of the table is:
create table traffic_violations ( seqid text , date_of_stop text , time_of_stop text , agency text , subagency text , description text , location text , latitude text , longitude text , accident text , belts text , personal_injury text , property_damage text , fatal text , commercial_license text , hazmat text , commercial_vehicle text , alcohol text , workzone text , state text , vehicletype text , year text , make text , model text , color text , violation_type text , charge text , article text , contributed_to_accident text , race text , gender text , driver_city text , driver_state text , dl_state text , arrest_type text , geolocation text , council_districts text , councils text , communities text , zip_codes text , municipalities text );
For loading the table the easiest way is to use PostgreSQL’s copy command:
postgres=# ! ls -l Traffic_Violations.csv -rw-rw-r--. 1 postgres postgres 634161687 May 28 11:28 Traffic_Violations.csv postgres=# copy traffic_violations from '/home/postgres/Traffic_Violations.csv' with ( format csv, header true, null 'null', delimiter ','); COPY 1528078
So we have around 1.5 millions of rows with real data. As this table contains only text columns we will create a materialized view with data types that fit the data. Looking at one row of the table:
postgres=# x Expanded display is on. postgres=# select * from traffic_violations limit 1; -[ RECORD 1 ]-----------+------------------------------------------------------------------------- seqid | e13d2082-55a7-4a93-8836-173be19d2648 date_of_stop | 06/01/2019 time_of_stop | 13:38:00 agency | MCP subagency | 2nd District, Bethesda description | DRIVER FAILURE TO YIELD RIGHT-OF-WAY TO VEH. UPON EMERGING FROM DRIVEWAY location | RANDOLPH RD / NEW HAMPSHIRE latitude | 39.07592 longitude | -77.0011316666667 accident | Yes belts | No personal_injury | No property_damage | Yes fatal | No commercial_license | No hazmat | No commercial_vehicle | No alcohol | No workzone | No state | MD vehicletype | 02 - Automobile year | 2014 make | HYUN model | TK color | SILVER violation_type | Warning charge | 21-705(c) article | Transportation Article contributed_to_accident | false race | WHITE gender | F driver_city | SILVER SPRING driver_state | MD dl_state | MD arrest_type | A - Marked Patrol geolocation | (39.07592, -77.0011316666667) council_districts | 5 councils | 5 communities | 26 zip_codes | 12 municipalities | 1
The materialized view becomes:
create materialized view mv_traffic_violations ( seqid , date_of_stop , time_of_stop , agency , subagency , description , location , latitude , longitude , accident , belts , personal_injury , property_damage , fatal , commercial_license , hazmat , commercial_vehicle , alcohol , workzone , state , vehicletype , year , make , model , color , violation_type , charge , article , contributed_to_accident , race , gender , driver_city , driver_state , dl_state , arrest_type , geolocation , council_districts , councils , communities , zip_codes , municipalities ) as select seqid , to_date(date_of_stop,'MM/DD/YYYY') , time_of_stop::time , agency , subagency , description , location , latitude::numeric , longitude::numeric , accident , belts::boolean , personal_injury::boolean , property_damage::boolean , fatal::boolean , commercial_license::boolean , hazmat::boolean , commercial_vehicle::boolean , alcohol::boolean , workzone::boolean , state , vehicletype , case year when '' then null else year::smallint end , make , model , color , violation_type , charge , article , contributed_to_accident::boolean , race , gender , driver_city , driver_state , dl_state , arrest_type , geolocation::point , case council_districts when '' then null else council_districts::smallint end , case councils when '' then null else councils::smallint end , case communities when '' then null else communities::smallint end , case zip_codes when '' then null else zip_codes::smallint end , case municipalities when '' then null else municipalities::smallint end from traffic_violations;
The statement contains some “case” expressions to test for empty strings. If you do not do this you get issues like this when you cast into a specific data type:
psql: ERROR: invalid input syntax for type smallint: ""
The beauty of a materialized view is, that you can refresh whenever the underlying data set changed, e.g.:
postgres=# refresh materialized view mv_traffic_violations WITH data; REFRESH MATERIALIZED VIEW
Note that this will block access to the materialized view for the time of the refresh. You could avoid that with refreshing concurrently but that only works when you have a unique index on the materialized view. With this data set we can not create a unique index because some of the rows are duplicates, e.g.:
postgres=# select * from mv_traffic_violations where seqid='b87c908c-ce2d-4c10-89fa-ca48735af485' and date_of_stop = '2012-11-07' and time_of_stop = '05:07:00' and description = 'FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE' and charge = '20-105(b)' and violation_type = 'Citation'; -[ RECORD 1 ]-----------+------------------------------------------------------------------------------------------- seqid | b87c908c-ce2d-4c10-89fa-ca48735af485 date_of_stop | 2012-11-07 time_of_stop | 05:07:00 agency | MCP subagency | 2nd District, Bethesda description | FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE location | IFO 2531 ROSS RD latitude | 39.0222578333333 longitude | -77.04575825 accident | Yes belts | f personal_injury | f property_damage | t fatal | f commercial_license | f hazmat | f commercial_vehicle | f alcohol | f workzone | f state | MD vehicletype | 02 - Automobile year | 2002 make | SUBARU model | FORESTER color | SILVER violation_type | Citation charge | 20-105(b) article | Transportation Article contributed_to_accident | f race | HISPANIC gender | M driver_city | SILVER SPRING driver_state | MD dl_state | MD arrest_type | A - Marked Patrol geolocation | (39.022257833333,-77.04575825) council_districts | 5 councils | 5 communities | 10 zip_codes | 17 municipalities | 1 -[ RECORD 2 ]-----------+------------------------------------------------------------------------------------------- seqid | b87c908c-ce2d-4c10-89fa-ca48735af485 date_of_stop | 2012-11-07 time_of_stop | 05:07:00 agency | MCP subagency | 2nd District, Bethesda description | FAILURE OF VEH. DRIVER IN ACCIDENT TO LOCATE AND NOTIFY OWNER OF UNATTENDED VEH. OF DAMAGE location | IFO 2531 ROSS RD latitude | 39.0222578333333 longitude | -77.04575825 accident | Yes belts | f personal_injury | f property_damage | t fatal | f commercial_license | f hazmat | f commercial_vehicle | f alcohol | f workzone | f state | MD vehicletype | 02 - Automobile year | 2002 make | SUBARU model | FORESTER color | SILVER violation_type | Citation charge | 20-105(b) article | Transportation Article contributed_to_accident | f race | HISPANIC gender | M driver_city | SILVER SPRING driver_state | MD dl_state | MD arrest_type | A - Marked Patrol geolocation | (39.022257833333,-77.04575825) council_districts | 5 councils | 5 communities | 10 zip_codes | 17 municipalities | 1
So now we have the data set we can use for the partitioning examples in the next posts. To summarize, what did we do:
- https://www.data.gov/ is a great source for free data sets and we used one of them to load data into PostgreSQL
- copy is great for loading data into PostgreSQL
- The “case” statement is great for testing various conditions in a statement
- materialized view are great when you want to refresh from an underlying data set that changes, but you will need a unique index for refreshing concurrently
- You might need to cast one data type into another
In the next post we will look at the first partitioning strategy: Range partitioning.