Now that we have the basic knowledge for operating a Greenplum cluster we should talk about why Greenplum can be option and what are the benefits of such a system. Again, here are the previous posts: Getting started with Greenplum – 1 – Installation, Getting started with Greenplum – 2 – Initializing and bringing up the cluster, Getting started with Greenplum – 3 – Behind the scenes, Getting started with Greenplum – 4 – Backup & Restore – databases, Getting started with Greenplum – 5 – Recovering from failed segment nodes.
First of all, Greenplum is for data analytics and data warehousing. It is not meant as an OLTP system. Greenplum also describes itself as an MPP system, which means “Massively Parallel Processing”. The idea behind that is quite simple: Use as many hosts as you can, all of them with their own CPUs, memory disks, and operating system, process the required work in parallel on those nodes, combine the results and give it back to the client. This is also known as a “shared nothing” architecture.
Coming back to the setup we’ve running now:
|-------------------|
|------6000|primary--------- |
| | Segment 1 | |
| 7000|mirror<------| | |
| |-------------------|
| | |
|-------------------| | |
| | | |
5432| Coordinator | | |
| | | |
|-------------------| | |
| | |
| |-------------------|
|------6000|primary ------ | |
| Segment 2 | |
7000|mirror<--------| |
|-------------------|
We have the coordinator node in front, this one receives all the client requests (the coordinator host can also be fault tolerant by adding a standby coordinator host). Work is then distributed to the amount of segment nodes you have available in the cluster. In my case this are just small VMs running on the same host so it will not give me any benefit performance wise. In a real setup all those nodes would run either on bare metal nodes or at least on different host nodes in a virtualized setup. The more segment nodes get added to the cluster, the more compute resources become available to be utilized. A critical part in such a setup is of course the networking. All the traffic goes through the network and the faster the network between the Greenplum nodes (which is called the interconnect) the better the whole cluster will perform.
Let’s assume we got all these building blocks right, do we need to consider more points to get most out of such a setup? Yes, we do.
In PostgreSQL all tables are heap organized. With Greenplum you can choose between heap oriented and append optimized tables. Heap oriented tables should be used when you expect frequent updates and deletes, append oriented tables should be used for initial load tables which only receive bulk inserts after loading.
The following statements create two simple tables, one heap organized, the other one append optimized, both distributed by id:
postgres=# create table t1 ( id int primary key
, dummy text
)
using heap
distributed by (id);
CREATE TABLE
postgres=# create table t2 ( id int primary key
, dummy text
)
using ao_row
distributed by (id);
CREATE TABLE
Populating both tables with the same amount of data and comparing the size of the tables give this:
postgres=# \timing
Timing is on.
postgres=# insert into t1 select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 2639.981 ms (00:02.640)
postgres=# insert into t2 select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 2878.901 ms (00:02.879)
postgres=# select pg_relation_size('t1');
pg_relation_size
------------------
44072960
(1 row)
Time: 1.394 ms
postgres=# select pg_relation_size('t2');
pg_relation_size
------------------
25226336
(1 row)
Time: 2.035 ms
The append optimized table is much smaller than the traditional heap organized table even without compression. The reason is that the tuple headers are much smaller for append optimized tables. This is also reason why they should not be used for frequent update and delete operations. Append only is only meant for bulk loading and bulk insert operations.
Another option you have if you go for append optimized tables is columnar storage but again, consider when you to use them: Columnar storage is read optimized and is not mean for write heavy operations. If you only access a small number of columns this can reduce the required I/O significantly. A table which is organized by columns is created like this:
postgres=# create table t4 ( id int primary key
, dummy text
)
using ao_row
distributed by (id);
CREATE TABLE
On top of all that you can chose to go for compression and partitioning, which gives you even more choice on how you want to layout your data. Connectors are something you might want to look as well.
Coming back to the initial question: Why Greenplum?
If you have the requirement for massive parallel data processing and you want to have a system which very much feels like a standard PostgreSQL for the clients, Greenplum is a valid option. As there is an open source edition give it a try and explore the possibilities. There is a lot of choice for various use cases and access patterns.