This blog refers to an older version of EDB’s Postgres on Kubernetes offering that is no longer available.

In the last post we had a look at the basic configuration of EDB EFM and confirmed that we can do manual switchovers for maintenance operations. Being able to do a switchover and switchback is fine but what really is important are automatic failovers. Without an automatic failover this setup would be somehow useless as a) EDB EFM is supposed to do exactly that and b) without that we would need to implement something on our own. So lets have a look if that works.

For this little I scaled the PostgreSQL instances to three, meaning one master and two replica instances. Three is the minimum required as EFM requires at least three agents so at least two can decide what to do. In a traditional setup this is usually done by making a non database host the EDB EFM witness host. In this OpenShift/MiniShift setup we do not have a witness node so the minimum amount of database hosts (where EFM runs on) is three. This is how it currently looks like:

dwe@dwe:~$ oc get pods -o wide -L role
NAME                 READY     STATUS    RESTARTS   AGE       IP            NODE        ROLE
edb-as10-0-1-dzcj6   0/1       Running   2          3h        172.17.0.9    localhost   standbydb
edb-as10-0-1-jmfjk   0/1       Running   2          4h        172.17.0.3    localhost   masterdb
edb-as10-0-1-rlzdb   0/1       Running   2          3h        172.17.0.6    localhost   standbydb
edb-pgpool-1-jtwg2   0/1       Running   2          4h        172.17.0.4    localhost   queryrouter
edb-pgpool-1-pjxzs   0/1       Running   2          4h        172.17.0.5    localhost   queryrouter

What I will do is to kill the master pod (edb-as10-0-1-jmfjk) to see what happens to my current setup. Before I do that, lets open a second session into one of the standby pods and check the current EFM status:

dwe@dwe:~$ oc rsh edb-as10-0-1-dzcj6
sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb
Cluster Status: edb
VIP: 

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Master      172.17.0.3           UP     UP        
	Standby     172.17.0.6           UP     UP        
	Standby     172.17.0.9           UP     UP        

Allowed node host list:
	172.17.0.3

Membership coordinator: 172.17.0.3

Standby priority host list:
	172.17.0.9 172.17.0.6

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      172.17.0.3           0/F000060        
	Standby     172.17.0.9           0/F000060        
	Standby     172.17.0.6           0/F000060        

	Standby database(s) in sync with master. It is safe to promote.

Looks fine. So from the first session using the oc command line utility lets kill the master pod. What I at least expect is that one of the standby instances get promoted and the remaining standby is reconfigured to connect to the new master.

dwe@dwe:~$ oc delete pod edb-as10-0-1-jmfjk
pod "edb-as10-0-1-jmfjk" deleted
dwe@dwe:~$ oc get pods
NAME                 READY     STATUS        RESTARTS   AGE
edb-as10-0-1-dzcj6   1/1       Running       2          3h
edb-as10-0-1-jmfjk   1/1       Terminating   2          4h
edb-as10-0-1-rlzdb   1/1       Running       2          4h
edb-as10-0-1-snnxc   0/1       Running       0          2s
edb-bart-1-s2fgj     1/1       Running       2          4h
edb-pgpool-1-jtwg2   1/1       Running       2          4h
edb-pgpool-1-pjxzs   1/1       Running       2          4h

The master pod is terminating. What does the second session in the standby pod tell us when we ask EFM for the cluster status?

sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb
Cluster Status: edb
VIP: 

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Standby     172.17.0.6           UP     UP        
	Promoting   172.17.0.9           UP     UP        

Allowed node host list:
	172.17.0.3

Membership coordinator: 172.17.0.6

Standby priority host list:
	172.17.0.6

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      172.17.0.9           0/F000250        
	Standby     172.17.0.6           0/F000140        

	One or more standby databases are not in sync with the master database.

The master is gone and one of the standby instances gets promoted. Lets wait a few seconds and check again:

sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb
Cluster Status: edb
VIP: 

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Idle        172.17.0.11          UP     UNKNOWN   
	Standby     172.17.0.6           UP     UP        
	Master      172.17.0.9           UP     UP        

Allowed node host list:
	172.17.0.3

Membership coordinator: 172.17.0.6

Standby priority host list:
	172.17.0.6

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      172.17.0.9           0/F000288        
	Standby     172.17.0.6           0/F000288        

	Standby database(s) in sync with master. It is safe to promote.

Idle Node Status (idle nodes ignored in XLog location comparisons):

	Address              XLog Loc         Info
	--------------------------------------------------------------
	172.17.0.11          0/F000288        DB is in recovery.

The new master is ready, the remaining standby is reconfigured to connect to the new master. Even better a few seconds later the old master is back as a new standby:

sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb
Cluster Status: edb
VIP: 

	Agent Type  Address              Agent  DB       Info
	--------------------------------------------------------------
	Standby     172.17.0.11          UP     UP        
	Standby     172.17.0.6           UP     UP        
	Master      172.17.0.9           UP     UP        

Allowed node host list:
	172.17.0.3

Membership coordinator: 172.17.0.6

Standby priority host list:
	172.17.0.6 172.17.0.11

Promote Status:

	DB Type     Address              XLog Loc         Info
	--------------------------------------------------------------
	Master      172.17.0.9           0/F000288        
	Standby     172.17.0.6           0/F000288        
	Standby     172.17.0.11          0/F000288        

	Standby database(s) in sync with master. It is safe to promote.

Works fine and is what you must have in a good setup. When we kill one of the standby pods the expected result is that the pod restarts and the instance will come back as a standby, lets check:

20:32:47 dwe@dwe:~$ oc delete pod edb-as10-0-1-hf27d
pod "edb-as10-0-1-hf27d" deleted
dwe@dwe:~$ oc get pods -o wide -L role
NAME                 READY     STATUS        RESTARTS   AGE       IP            NODE        ROLE
edb-as10-0-1-8p5rd   0/1       Running       0          3s        172.17.0.6    localhost   
edb-as10-0-1-dzcj6   1/1       Running       2          4h        172.17.0.9    localhost   masterdb
edb-as10-0-1-hf27d   0/1       Terminating   1          12m       172.17.0.3    localhost   
edb-as10-0-1-snnxc   1/1       Running       0          20m       172.17.0.11   localhost   standbydb
edb-pgpool-1-jtwg2   1/1       Running       2          5h        172.17.0.4    localhost   queryrouter
edb-pgpool-1-pjxzs   1/1       Running       2          5h        172.17.0.5    localhost   queryrouter

A few moments later we have the previous status:

dwe@dwe:~$ oc get pods -o wide -L role
NAME                 READY     STATUS    RESTARTS   AGE       IP            NODE        ROLE
edb-as10-0-1-8p5rd   1/1       Running   0          4m        172.17.0.6    localhost   standbydb
edb-as10-0-1-dzcj6   1/1       Running   2          4h        172.17.0.9    localhost   masterdb
edb-as10-0-1-snnxc   1/1       Running   0          24m       172.17.0.11   localhost   standbydb
edb-pgpool-1-jtwg2   1/1       Running   2          5h        172.17.0.4    localhost   queryrouter
edb-pgpool-1-pjxzs   1/1       Running   2          5h        172.17.0.5    localhost   queryrouter

Fine as well. To complete this little series about EDB containers in MiniShift/OpenShift we will have a look at how we can add an EDB BART container to the setup, because backup and recovery is still missing 🙂