This blog refers to an older version of EDB’s Postgres on Kubernetes offering that is no longer available.
In the last post we had a look at the basic configuration of EDB EFM and confirmed that we can do manual switchovers for maintenance operations. Being able to do a switchover and switchback is fine but what really is important are automatic failovers. Without an automatic failover this setup would be somehow useless as a) EDB EFM is supposed to do exactly that and b) without that we would need to implement something on our own. So lets have a look if that works.
For this little I scaled the PostgreSQL instances to three, meaning one master and two replica instances. Three is the minimum required as EFM requires at least three agents so at least two can decide what to do. In a traditional setup this is usually done by making a non database host the EDB EFM witness host. In this OpenShift/MiniShift setup we do not have a witness node so the minimum amount of database hosts (where EFM runs on) is three. This is how it currently looks like:
dwe@dwe:~$ oc get pods -o wide -L role NAME READY STATUS RESTARTS AGE IP NODE ROLE edb-as10-0-1-dzcj6 0/1 Running 2 3h 172.17.0.9 localhost standbydb edb-as10-0-1-jmfjk 0/1 Running 2 4h 172.17.0.3 localhost masterdb edb-as10-0-1-rlzdb 0/1 Running 2 3h 172.17.0.6 localhost standbydb edb-pgpool-1-jtwg2 0/1 Running 2 4h 172.17.0.4 localhost queryrouter edb-pgpool-1-pjxzs 0/1 Running 2 4h 172.17.0.5 localhost queryrouter
What I will do is to kill the master pod (edb-as10-0-1-jmfjk) to see what happens to my current setup. Before I do that, lets open a second session into one of the standby pods and check the current EFM status:
dwe@dwe:~$ oc rsh edb-as10-0-1-dzcj6 sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb Cluster Status: edb VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Master 172.17.0.3 UP UP Standby 172.17.0.6 UP UP Standby 172.17.0.9 UP UP Allowed node host list: 172.17.0.3 Membership coordinator: 172.17.0.3 Standby priority host list: 172.17.0.9 172.17.0.6 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 172.17.0.3 0/F000060 Standby 172.17.0.9 0/F000060 Standby 172.17.0.6 0/F000060 Standby database(s) in sync with master. It is safe to promote.
Looks fine. So from the first session using the oc command line utility lets kill the master pod. What I at least expect is that one of the standby instances get promoted and the remaining standby is reconfigured to connect to the new master.
dwe@dwe:~$ oc delete pod edb-as10-0-1-jmfjk pod "edb-as10-0-1-jmfjk" deleted dwe@dwe:~$ oc get pods NAME READY STATUS RESTARTS AGE edb-as10-0-1-dzcj6 1/1 Running 2 3h edb-as10-0-1-jmfjk 1/1 Terminating 2 4h edb-as10-0-1-rlzdb 1/1 Running 2 4h edb-as10-0-1-snnxc 0/1 Running 0 2s edb-bart-1-s2fgj 1/1 Running 2 4h edb-pgpool-1-jtwg2 1/1 Running 2 4h edb-pgpool-1-pjxzs 1/1 Running 2 4h
The master pod is terminating. What does the second session in the standby pod tell us when we ask EFM for the cluster status?
sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb Cluster Status: edb VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Standby 172.17.0.6 UP UP Promoting 172.17.0.9 UP UP Allowed node host list: 172.17.0.3 Membership coordinator: 172.17.0.6 Standby priority host list: 172.17.0.6 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 172.17.0.9 0/F000250 Standby 172.17.0.6 0/F000140 One or more standby databases are not in sync with the master database.
The master is gone and one of the standby instances gets promoted. Lets wait a few seconds and check again:
sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb Cluster Status: edb VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Idle 172.17.0.11 UP UNKNOWN Standby 172.17.0.6 UP UP Master 172.17.0.9 UP UP Allowed node host list: 172.17.0.3 Membership coordinator: 172.17.0.6 Standby priority host list: 172.17.0.6 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 172.17.0.9 0/F000288 Standby 172.17.0.6 0/F000288 Standby database(s) in sync with master. It is safe to promote. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 172.17.0.11 0/F000288 DB is in recovery.
The new master is ready, the remaining standby is reconfigured to connect to the new master. Even better a few seconds later the old master is back as a new standby:
sh-4.2$ /usr/edb/efm-3.0/bin/efm cluster-status edb Cluster Status: edb VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Standby 172.17.0.11 UP UP Standby 172.17.0.6 UP UP Master 172.17.0.9 UP UP Allowed node host list: 172.17.0.3 Membership coordinator: 172.17.0.6 Standby priority host list: 172.17.0.6 172.17.0.11 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 172.17.0.9 0/F000288 Standby 172.17.0.6 0/F000288 Standby 172.17.0.11 0/F000288 Standby database(s) in sync with master. It is safe to promote.
Works fine and is what you must have in a good setup. When we kill one of the standby pods the expected result is that the pod restarts and the instance will come back as a standby, lets check:
20:32:47 dwe@dwe:~$ oc delete pod edb-as10-0-1-hf27d pod "edb-as10-0-1-hf27d" deleted dwe@dwe:~$ oc get pods -o wide -L role NAME READY STATUS RESTARTS AGE IP NODE ROLE edb-as10-0-1-8p5rd 0/1 Running 0 3s 172.17.0.6 localhost edb-as10-0-1-dzcj6 1/1 Running 2 4h 172.17.0.9 localhost masterdb edb-as10-0-1-hf27d 0/1 Terminating 1 12m 172.17.0.3 localhost edb-as10-0-1-snnxc 1/1 Running 0 20m 172.17.0.11 localhost standbydb edb-pgpool-1-jtwg2 1/1 Running 2 5h 172.17.0.4 localhost queryrouter edb-pgpool-1-pjxzs 1/1 Running 2 5h 172.17.0.5 localhost queryrouter
A few moments later we have the previous status:
dwe@dwe:~$ oc get pods -o wide -L role NAME READY STATUS RESTARTS AGE IP NODE ROLE edb-as10-0-1-8p5rd 1/1 Running 0 4m 172.17.0.6 localhost standbydb edb-as10-0-1-dzcj6 1/1 Running 2 4h 172.17.0.9 localhost masterdb edb-as10-0-1-snnxc 1/1 Running 0 24m 172.17.0.11 localhost standbydb edb-pgpool-1-jtwg2 1/1 Running 2 5h 172.17.0.4 localhost queryrouter edb-pgpool-1-pjxzs 1/1 Running 2 5h 172.17.0.5 localhost queryrouter
Fine as well. To complete this little series about EDB containers in MiniShift/OpenShift we will have a look at how we can add an EDB BART container to the setup, because backup and recovery is still missing 🙂