Nowadays everybody is talking about columnar storage format. What can PostgreSQL do in this area? There is no native support for that in PostgreSQL but thanks to the fact that PostgreSQL is highly extensible there is a foreign data wrapper called cstore_fdw. Lets take a look on what it can do.
For installing cstore_fdw we’ll need to install the protobuf-c-devel package which is available in the epel repository if you are on a redhat base distribution:
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm yum localinstall epel-release-7-5.noarch.rpm yum install protobuf-c-devel
Once this is installed we can continue with installing the extension. Obviously we need the sources in a first step:
su - postgres cd /var/tmp git clone git://github.com/citusdata/cstore_fdw cd cstore_fdw
Building is straightforward and easy:
export PATH=[YOUR_PG_HOME]/bin/:$PATH make make install
That’s it. Now we need to add the extension to PostgreSQL:
grep shared_pre postgresql.conf shared_preload_libraries = 'cstore_fdw' # (change requires restart)
After a restart of the database cstore_fdw will be available:
postgres=# CREATE EXTENSION cstore_fdw; CREATE EXTENSION
Ok, and now? Lets create our first columnar store table:
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw; CREATE FOREIGN TABLE demo2 ( id int , text varchar(15) , number int , time timestamp , text1 varchar(100) , text2 varchar(100) , text3 varchar(100) , text4 varchar(100) , text5 varchar(100) , text6 varchar(100) ) SERVER cstore_server OPTIONS (compression 'pglz');
Lets create some sample data (10mio rows).
timing on drop table if exists demo; create table demo ( id int , text varchar(15) , number int , time timestamp , text1 varchar(100) , text2 varchar(100) , text3 varchar(100) , text4 varchar(100) , text5 varchar(100) , text6 varchar(100) ); DO $$DECLARE l_people_array varchar(15)[12] := '{"Marc", "Bill", "George", "Eliot", "Matt", "Trey", "Tracy","Greg", "Steve", "Kristina", "Katie", "Jeff"}'; n int; linterval varchar(20); text varchar(100) := md5('aaa'); BEGIN for i in 0..1e7 loop n:=trunc(random()*1000+1); linterval := n||' days'; insert into DEMO values( i , l_people_array[floor((random()*11))+1::int] , n , current_timestamp + linterval::interval , text , text , text , text , text , text ); end loop; END$$;
For loading the data to the column store table I’ll write the demo table to a flat by using copy as usual (this will take some time for this amount of rows):
postgres=# copy demo to '/home/postgres/column_store_data.txt'; COPY 10000001 Time: 219766.916 ms
The resulting file is 2.3 gb:
-rw-r--r--. 1 postgres postgres 2.3G May 13 15:55 column_store_data.txt
Now we can load the same data to the demo2 table:
postgres=# copy demo2 from '/home/postgres/column_store_data.txt'; COPY 10000001 Time: 46389.124 ms postgres=#
First lets see on how much smaller the demo2 table really is:
postgres=# select pg_size_pretty(pg_relation_size('demo')); pg_size_pretty ---------------- 2441 MB (1 row) Time: 0.366 ms postgres=# ! ls -lha $PGDATA/cstore_fdw/13056 total 153M drwx------. 2 postgres postgres 37 May 13 15:58 . drwx------. 3 postgres postgres 18 May 13 15:04 .. -rw-------. 1 postgres postgres 153M May 13 15:58 17247 -rw-------. 1 postgres postgres 1.2K May 13 15:58 17247.footer
Not bad, 153MB compared to 2441MB for the traditional table. Now lets look at the response time for the traditional table:
postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 26842.250 ms postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 28530.903 ms postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 29815.443 ms
The column store should be faster:
postgres=# select id,text,number from demo2 where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 631.668 ms postgres=# select id,text,number from demo2 where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 20.740 ms postgres=# select id,text,number from demo2 where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 15.368 ms
And yes, it is much faster. Less than a second compared to around 28 seconds. Note that you can not create an index on the column store table:
postgres=# create index i2 on demo2(id); ERROR: cannot create index on foreign table "demo2" STATEMENT: create index i2 on demo2(id); ERROR: cannot create index on foreign table "demo2" Time: 0.596 ms
With an index on the traditional demo table the response time is faster on the demo table than on the demo2 table:
postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 0.706 ms postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 0.304 ms postgres=# select id,text,number from demo where id = 50; id | text | number ----+------+-------- 50 | Trey | 625 (1 row) Time: 0.286 ms postgres=#
But of course this is only true if you query few rows. As soon as we query more data it is the other way around:
postgres=# select sum(number), avg(id) from demo2 where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 1485.294 ms postgres=# select sum(number), avg(id) from demo2 where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 939.993 ms postgres=# select sum(number), avg(id) from demo2 where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 938.552 ms
For the traditional table:
postgres=# select sum(number), avg(id) from demo where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 23447.962 ms postgres=# select sum(number), avg(id) from demo where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 23488.484 ms postgres=# select sum(number), avg(id) from demo where id between 666 and 6666666; sum | avg ------------+---------------------- 3336866872 | 3333666.000000000000 (1 row) Time: 23660.244 ms postgres=#
But there are restrictions currently: You can neither update on nor delete from the columnar table:
postgres=# update demo2 set number = 10 where id = 5; ERROR: operation is not supported STATEMENT: update demo2 set number = 10 where id = 5; ERROR: operation is not supported Time: 51.183 ms postgres=# delete from demo2 where id = 5; ERROR: operation is not supported STATEMENT: delete from demo2 where id = 5; ERROR: operation is not supported Time: 0.260 ms postgres=#
But you can append data:
postgres=# insert into demo2 select * from demo limit 5; INSERT 0 5 Time: 165.618 ms postgres=#
Depending on the use case the cstore_fdw might be great alternative (dwh or archive data, for example).
Want to meet in one day our experts from all technologies? Come to our Event In-Memory: boost your IT performance! where we talk about SQL Server, Oracle and SAP HANA.