Because PostgreSQL is fully open source there are many forks of it. One of them is called Greenplum which describes itself as “an advanced, fully featured, open source data warehouse, based on PostgreSQL”. Sounds interesting, so lets give it a try. This will be a series of blog posts and in this first one we’re going to prepare the operating system, install the software and verify the installation afterwards.

What follows is basically a short version of the installation guide which you can find here.

One of the requirements is either to disable SELinux or to configure it properly for the Greenplum installation. As this is only a playground, let’s do it the easy way and just disable it. This can be done by setting SELinux to “disabled” in /etc/sysconfig/selinux and reboot the system (I am using Rocky Linux 9 here):

[gpadmin@rocky9-gp7-master ~]$ grep -w SELINUX /etc/sysconfig/selinux 
# SELINUX= can take one of these three values:
# NOTE: Up to RHEL 8 release included, SELINUX=disabled would also
SELINUX=disabled
[root@rocky9-gp7-master ~]$ reboot
[root@rocky9-gp7-master ~]$ getenforce 
Disabled

The same for the local firewall, either disable it or configure it properly:

[root@rocky9-gp7-master ~]$ systemctl stop firewalld
[root@rocky9-gp7-master ~]$ systemctl disable firewalld
Removed "/etc/systemd/system/multi-user.target.wants/firewalld.service".
Removed "/etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service".

To avoid DNS the hosts file on all my three nodes looks like this:

[root@rocky9-gp7-master ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.122.200 rocky9-gp7-master rocky9-gp7-master.it.dbi-services.com
192.168.122.201 rocky9-gp7-segment1 rocky9-gp7-segment1.it.dbi-services.com
192.168.122.202 rocky9-gp7-segment2 rocky9-gp7-segment2.it.dbi-services.com

The first node is the so called “Coordinator Host”. This one will receive all the client requests and route them to one of the so called “Segment Hosts”. In this case there are two segment nodes and those will host the actual data.

For the kernel & system requirements this are the recommended settings:

[root@rocky9-gp7-master ~]$ cat /etc/sysctl.conf
# kernel.shmall = _PHYS_PAGES / 2 # See Shared Memory Pages
kernel.shmall = 197951838
# kernel.shmmax = kernel.shmall * PAGE_SIZE 
kernel.shmmax = 810810728448
kernel.shmmni = 4096
vm.overcommit_memory = 2 # See Segment Host Memory
vm.overcommit_ratio = 95 # See Segment Host Memory

net.ipv4.ip_local_port_range = 10000 65535 # See Port Settings
kernel.sem = 250 2048000 200 8192
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
kernel.core_pattern=/var/core/core.%h.%t
net.ipv4.tcp_syncookies = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ipfrag_high_thresh = 41943040
net.ipv4.ipfrag_low_thresh = 31457280
net.ipv4.ipfrag_time = 60
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.swappiness = 10
vm.zone_reclaim_mode = 0
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_background_ratio = 0 # See System Memory
vm.dirty_ratio = 0
vm.dirty_background_bytes = 1610612736
vm.dirty_bytes = 4294967296
[root@rocky9-gp7-master ~]$ sysctl -p
[root@rocky9-gp7-master ~]$ egrep "^\*" /etc/security/limits.conf
* soft nofile 524288
* hard nofile 524288
* soft nproc 131072
* hard nproc 131072
* soft  core unlimited

Another requirement is, that rc.local needs to be enabled or, in other words, it needs to be executable when the systems are starting up:

[root@rocky9-gp7-master ~]$ chmod +x /etc/rc.d/rc.local
[root@rocky9-gp7-master ~]$ reboot

As usual on system swhich host a database it is recommended to disable transparent huge pages (this required a reboot as well):

[root@rocky9-gp7-master ~]$ grubby --update-kernel=ALL --args="transparent_hugepage=never"

Deactivate systemd’s IPC object removal (this is already the default on Rocky Linux 9, but anyway):

[root@rocky9-gp7-master ~]$ sed -i 's/#RemoveIPC=no/RemoveIPC=no/g' /etc/systemd/logind.conf 
[root@rocky9-gp7-master ~]$ systemctl restart systemd-logind.service

As Greenplum should run under a dedicated user, let’s create it:

[root@rocky9-gp7-master ~]$ groupadd gpadmin
[root@rocky9-gp7-master ~]$ useradd -g gpadmin -m gpadmin
[root@rocky9-gp7-master ~]$ passwd gpadmin
Changing password for user gpadmin.
New password: 
BAD PASSWORD: The password fails the dictionary check - it is based on a dictionary word
Retype new password: 
passwd: all authentication tokens updated successfully.

sudo configuration is optional, but as it makes life a lot easier, lets configure this as well for the gpadmin user:

[root@rocky9-gp7-master ~]$ grep gpadmin /etc/sudoers
gpadmin ALL=(ALL)       NOPASSWD: ALL

The installation of Greenplum is just a matter of installing the rpm, which can be downloaded from the project’s Github repository:

[root@rocky9-gp7-master ~]$ su - gpadmin
Last login: Wed Feb 28 14:48:01 CET 2024 on pts/0
[gpadmin@rocky9-gp7-master ~]$ ls -l
total 50320
-rw-r--r-- 1 gpadmin gpadmin 51527129 Feb 28 14:50 open-source-greenplum-db-7.1.0-el9-x86_64.rpm
[gpadmin@rocky9-gp7-master ~]$ sudo dnf localinstall ./open-source-greenplum-db-7.1.0-el9-x86_64.rpm 
Rocky Linux 9 - BaseOS                                                                                14 kB/s | 4.1 kB     00:00    
Rocky Linux 9 - BaseOS                                                                               5.6 MB/s | 2.2 MB     00:00    
Rocky Linux 9 - AppStream                                                                             22 kB/s | 4.5 kB     00:00    
Rocky Linux 9 - AppStream                                                                             12 MB/s | 7.4 MB     00:00    
Rocky Linux 9 - Extras                                                                               6.7 kB/s | 2.9 kB     00:00    
Rocky Linux 9 - Extras                                                                                24 kB/s |  14 kB     00:00    
Dependencies resolved.
=====================================================================================================================================
 Package                                  Architecture       Version                                  Repository                Size
=====================================================================================================================================
Installing:
 open-source-greenplum-db-7               x86_64             7.1.0-1.el9                              @commandline              49 M
Installing dependencies:
 annobin                                  x86_64             12.12-1.el9                              appstream                977 k
 apr                                      x86_64             1.7.0-12.el9_3                           appstream                122 k
 apr-util                                 x86_64             1.6.1-23.el9                             appstream                 94 k
 apr-util-bdb                             x86_64             1.6.1-23.el9                             appstream                 12 k
...
  tar-2:1.34-6.el9_1.x86_64                                       unzip-6.0-56.el9.x86_64                                           
  zip-3.0-35.el9.x86_64                                          

Complete!

[gpadmin@rocky9-gp7-master ~]$ sudo chown -R gpadmin:gpadmin /usr/local/greenplum-db*

(the last step could also be done automatically by the package, but as it is not and the documentation recommends doing it, lets do so)

As password-less ssh is a requirement as well, let’s generate ssh keys on the coordinator node, create the authorized_keys file and then copy over the whole “.ssh” directory to the other nodes. Once this is done, password-less SSH connections should already work between the nodes:

[gpadmin@rocky9-gp7-master ~]$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/gpadmin/.ssh/id_rsa): 
Created directory '/home/gpadmin/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/gpadmin/.ssh/id_rsa
Your public key has been saved in /home/gpadmin/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:+8uzanzyzxWSPhvtEHQiZ3s5+qSM9/YdUshPjEz4ojg [email protected]
The key's randomart image is:
+---[RSA 3072]----+
|                 |
|            .    |
|          ..=..  |
|           ===+. |
|        S  .=*=+ |
|        .....*+o |
|       E..  *.+o |
|        =oo+.@o o|
|       ..=B**o+.o|
+----[SHA256]-----+

[gpadmin@rocky9-gp7-master ~]$ ssh-keygen 
[gpadmin@rocky9-gp7-master ~]$ scp -r .ssh/ rocky9-gp7-segment1:/home/gpadmin/
[gpadmin@rocky9-gp7-master ~]$ scp -r .ssh/ rocky9-gp7-segment2:/home/gpadmin/
[gpadmin@rocky9-gp7-master ~]$ ssh rocky9-gp7-segment1
Last login: Wed Feb 28 15:04:18 2024 from 192.168.122.200
[gpadmin@rocky9-gp7-segment1 ~]$ 
logout
Connection to rocky9-gp7-segment1 closed.
[gpadmin@rocky9-gp7-master ~]$ ssh rocky9-gp7-segment2
Last login: Wed Feb 28 14:50:50 2024
[gpadmin@rocky9-gp7-segment2 ~]$ 
logout
Connection to rocky9-gp7-segment2 closed.

To verify the SSH setup there is utility called “gpssh”. Before using this create a file called “hostfile_exkeys” and add all the host names which will be part of the cluster:

[gpadmin@rocky9-gp7-master ~]$ echo "rocky9-gp7-master
rocky9-gp7-segment1
rocky9-gp7-segment2" > /home/gpadmin/hostfile_exkeys

Testing the SSH setup can then be done by asking “gpssh” to execute commands on all the nodes like this:

[gpadmin@rocky9-gp7-master ~]$ /usr/local/greenplum-db/bin/gpssh -f /home/gpadmin/hostfile_exkeys -e 'ls -l /usr/local/greenplum-db'
Traceback (most recent call last):
  File "/usr/local/greenplum-db/bin/gpssh", line 32, in <module>
    from gppylib.util import ssh_utils
ModuleNotFoundError: No module named 'gppylib'

… and this fails. The reason is that the Greenplum environment is not yet set properly. This can be done by sourcing “greenplum_path.sh” into the gpadmin user’s environment:

[gpadmin@rocky9-gp7-master ~]$ tail -1 .bash_profile 
. /usr/local/greenplum-db/greenplum_path.sh
[gpadmin@rocky9-gp7-master ~]$ /usr/local/greenplum-db/bin/gpssh -f /home/gpadmin/hostfile_exkeys -e 'ls -l /usr/local/greenplum-db'

This fails again with:

Traceback (most recent call last):
  File "/usr/local/greenplum-db/bin/gpssh", line 32, in <module>
    from gppylib.util import ssh_utils
  File "/usr/local/greenplum-db-7.1.0/lib/python/gppylib/util/ssh_utils.py", line 13, in <module>
    from gppylib.commands.unix import Hostname, Echo
  File "/usr/local/greenplum-db-7.1.0/lib/python/gppylib/commands/unix.py", line 18, in <module>
    from pkg_resources import parse_version
ModuleNotFoundError: No module named 'pkg_resources'

The reason is, that the python3-setuptools package is not installed on the system. So, lets do this and try again:

[gpadmin@rocky9-gp7-master ~]$ sudo dnf install -y python3-setuptools
[gpadmin@rocky9-gp7-master ~]$ /usr/local/greenplum-db/bin/gpssh -f /home/gpadmin/hostfile_exkeys -e 'ls -l /usr/local/greenplum-db'
[rocky9-gp7-segment1] ls -l /usr/local/greenplum-db
[rocky9-gp7-segment1] lrwxrwxrwx 1 gpadmin gpadmin 29 Feb 28 14:53 /usr/local/greenplum-db -> /usr/local/greenplum-db-7.1.0
[  rocky9-gp7-master] ls -l /usr/local/greenplum-db
[  rocky9-gp7-master] lrwxrwxrwx 1 gpadmin gpadmin 29 Feb 28 14:52 /usr/local/greenplum-db -> /usr/local/greenplum-db-7.1.0
[rocky9-gp7-segment2] ls -l /usr/local/greenplum-db
[rocky9-gp7-segment2] lrwxrwxrwx 1 gpadmin gpadmin 29 Feb 28 14:53 /usr/local/greenplum-db -> /usr/local/greenplum-db-7.1.0

Now everything looks fine and we can proceed with creating the “Data Storage Areas”, but this will be the topic of the next post.