In the last post I looked at how you can do maintenance operations on the standby node when you are working in a PostgreSQL cluster protected by EDB Failover Manager. In this post I’ll look on how you can do maintenance on the primary node (better: the node where the primary instance currently runs on). This requires slightly more work and attention. Lets go.
As a quick reminder this is the setup:
IP | Description |
---|---|
192.168.22.243 | Current PostgreSQL hot standby instance |
192.168.22.245 | Currernt PostgreSQL primary instance |
192.168.22.244 | EDB Failover Manager Witness Node + EDB BART |
192.168.22.250 | Virtual IP that is used for client connections to the master database |
When we want to do maintenance on the current primary node this will require a fail over of the PostgreSQL instance. In addition the VIP shall fail over too to provide the clients the same address to connect as before the fail over. Lets check the current status of the fail over cluster:
postgres@edbbart:/home/postgres/ [pg950] efmstat Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Master 192.168.22.245 UP UP Witness 192.168.22.244 UP N/A Standby 192.168.22.243 UP UP Allowed node host list: 192.168.22.244 192.168.22.245 192.168.22.243 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/350000D0 Standby 192.168.22.243 0/350000D0 Standby database(s) in sync with master. It is safe to promote.
The last line is telling us that it is probably safe to promote. Consider these two cases:
- Case 1: You want to perform a maintenance which requires the reboot the whole node
- Case 2: You want to perform a maintenance which requires only the restart of the PostgreSQL instance on that node
It is important to distinguish the two cases because it will impact on how you have to deal with EDB Failover Manager. Lets see what happens in case one when we do a promote and then reboot the node. I’ll do the promote command on the witness node but it doesn’t really matter as you can execute it from any node in the fail over cluster:
-bash-4.2$ /usr/efm-2.0/bin/efm promote efm Promote command accepted by local agent. Proceeding with promotion. Run the 'cluster-status' command for information about the new cluster state.
What happened. First lets check the cluster status:
-bash-4.2$ /usr/efm-2.0/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Idle 192.168.22.245 UP UNKNOWN Witness 192.168.22.244 UP N/A Master 192.168.22.243 UP UP Allowed node host list: 192.168.22.244 192.168.22.245 192.168.22.243 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.243 0/350001E0 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.22.245 0/350000D0 DB is not in recovery.
The old master is gone (Idle/UNKNOWN) and the old standby became the new master. You can double check this if you login to the new master and check if the instance is still in recovery mode:
postgres@edbppas:/home/postgres/ [PGSITE1] ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:6d:d8:b7 brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3 valid_lft 85491sec preferred_lft 85491sec inet6 fe80::a00:27ff:fe6d:d8b7/64 scope link valid_lft forever preferred_lft forever 3: enp0s8: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:e4:19:ec brd ff:ff:ff:ff:ff:ff inet 192.168.22.243/24 brd 192.168.22.255 scope global enp0s8 valid_lft forever preferred_lft forever inet 192.168.22.250/24 brd 192.168.22.255 scope global secondary enp0s8:0 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fee4:19ec/64 scope link valid_lft forever preferred_lft forever postgres@edbppas:/home/postgres/ [PGSITE1] sqh psql.bin (9.5.0.5) Type "help" for help. postgres=# select * from pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row)
From the IP configuration output it is also proved that the VIP failed over to the new master node. What is the status of the old master instance?
postgres@ppasstandby:/home/postgres/ [PGSITE2] ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:19:6f:0a brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3 valid_lft 85364sec preferred_lft 85364sec inet6 fe80::a00:27ff:fe19:6f0a/64 scope link valid_lft forever preferred_lft forever 3: enp0s8: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:ba:c0:6a brd ff:ff:ff:ff:ff:ff inet 192.168.22.245/24 brd 192.168.22.255 scope global enp0s8 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:feba:c06a/64 scope link valid_lft forever preferred_lft forever postgres@ppasstandby:/home/postgres/ [PGSITE2] sqh psql.bin (9.5.0.5) Type "help" for help. postgres=# select * from pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row)
This is dangerous, because: You now have two masters. If any of your applications uses the local IP address to connect then it would be still connected to the standby instance. Failover Manager created a new recovery.conf file though:
postgres@ppasstandby:/home/postgres/ [PGSITE2] cat $PGDATA/recovery.conf # EDB Failover Manager # This generated recovery.conf file prevents the db server from accidentally # being restarted as a master since a failover or promotion has occurred standby_mode = on restore_command = 'echo 2>"recovery suspended on failed server node"; exit 1'
But as long as you do not restart the old primary instance the instance will accept modifications. If you restart:
2016-04-11 11:37:40.114 GMT - 8 - 3454 - - @ LOCATION: ReadRecord, xlog.c:3983 2016-04-11 11:37:40.114 GMT - 6 - 3452 - - @ LOG: 00000: database system is ready to accept read only connections
… this can not do any harm anymore. But the required step to restart automatically by EDB Failover Manager is missing. It is left to you to perform an immediate restart of the old master instance to prevent the two master scenario. This should definitely be improved in EDB Failover Manager.
Once maintenance is finished and the old master node is rebooted what is the status of the cluster? Lets check:
[root@edbbart ~] /usr/edb-efm/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Master 192.168.22.243 UP UP Witness 192.168.22.244 UP N/A Allowed node host list: 192.168.22.244 192.168.22.243 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.243 0/37021258 No standby databases were found.
The information of the old master completely disappeared. So how can we recover from that and bring back the old configuration? The first step is to add the old master node back to the allowed hosts of the cluster:
[root@edbbart ~] /usr/edb-efm/bin/efm add-node efm 192.168.22.245 add-node signal sent to local agent. [root@edbbart ~]
This should result in the following output:
[root@edbbart ~] /usr/edb-efm/bin/efm cluster-status efm | grep -A 2 "Allowed" Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245
Now we need to rebuild the old master as a new standby. This can be done in various ways, the two most common are to create a new basebackup of the new master or to use pg_rewind. I’ll us pg_rewind here. So to rebuild the old master as a new standby:
postgres@ppasstandby:/u02/pgdata/PGSITE2/ [PGSITE2] pg_rewind -D /u02/pgdata/PGSITE2/ --source-server="port=4445 host=192.168.22.243 user=postgres dbname=postgres" servers diverged at WAL position 0/350000D0 on timeline 2 rewinding from last common checkpoint at 0/35000028 on timeline 2 Done!
Make sure your recovery.conf matches your environment;
postgres@ppasstandby:/u02/pgdata/PGSITE2/ [PGSITE2] cat recovery.conf standby_mode = 'on' primary_slot_name = 'standby1' primary_conninfo = 'user=postgres password=admin123 host=192.168.22.243 port=4445 sslmode=prefer sslcompression=1' recovery_target_timeline = 'latest' trigger_file='/u02/pgdata/PGSITE2/trigger_file'
Start the new standby instance and check the log file:
2016-04-17 11:21:40.021 GMT - 1 - 2440 - - @ LOG: database system was interrupted while in recovery at log time 2016-04-17 11:09:28 GMT 2016-04-17 11:21:40.021 GMT - 2 - 2440 - - @ HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2016-04-17 11:21:40.554 GMT - 3 - 2440 - - @ LOG: entering standby mode 2016-04-17 11:21:40.608 GMT - 4 - 2440 - - @ LOG: redo starts at 0/35000098 2016-04-17 11:21:40.704 GMT - 5 - 2440 - - @ LOG: consistent recovery state reached at 0/37047858 2016-04-17 11:21:40.704 GMT - 6 - 2440 - - @ LOG: invalid record length at 0/37047858 2016-04-17 11:21:40.704 GMT - 4 - 2438 - - @ LOG: database system is ready to accept read only connections 2016-04-17 11:21:40.836 GMT - 1 - 2444 - - @ LOG: started streaming WAL from primary at 0/37000000 on timeline 3
So far for the database part. To bring back the EDB failover manager configuration we need to adjust the efm.nodes file on the new standby to include all the hosts in the configuration:
[root@ppasstandby efm-2.0] pwd /etc/efm-2.0 [root@ppasstandby efm-2.0] cat efm.nodes # List of node address:port combinations separated by whitespace. 192.168.22.244:9998 192.168.22.243:9998 192.168.22.245:9998
Once this is done EDB Failover Manager can be restarted and the configuration is fine again:
[root@ppasstandby efm-2.0] systemctl start efm-2.0.service [root@ppasstandby efm-2.0] /usr/efm-2.0/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.22.244 UP N/A Standby 192.168.22.245 UP UP Master 192.168.22.243 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: 192.168.22.245 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.243 0/37048540 Standby 192.168.22.245 0/37048540 Standby database(s) in sync with master. It is safe to promote.
Coming to the second scenario: When you do not need to reboot the server but only need to take down the master database for maintenance what are the steps to follow? As with scenario 1 you’ll have to promote to activate the standby database and immediately shutdown the old master:
[root@edbbart ~]# /usr/edb-efm/bin/efm promote efm Promote command accepted by local agent. Proceeding with promotion. Run the 'cluster-status' command for information about the new cluster state. [root@edbbart ~]# /usr/edb-efm/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Idle 192.168.22.243 UP UNKNOWN Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/37048730 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.22.243 UNKNOWN Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Now you can do your maintenance operations and once you finished rebuild the old master as a new standby:
postgres@edbppas:/u02/pgdata/PGSITE1/ [PGSITE1] pg_rewind -D /u02/pgdata/PGSITE1/ --source-server="port=4445 host=192.168.22.245 user=postgres dbname=postgres" servers diverged at WAL position 0/37048620 on timeline 3 rewinding from last common checkpoint at 0/37048578 on timeline 3 Done! postgres@edbppas:/u02/pgdata/PGSITE1/ [PGSITE1] cat recovery.conf standby_mode = 'on' primary_slot_name = 'standby1' primary_conninfo = 'user=postgres password=admin123 host=192.168.22.245 port=4445 sslmode=prefer sslcompression=1' recovery_target_timeline = 'latest' trigger_file='/u02/pgdata/PGSITE1/trigger_file'
Once you restarted the instance EDB Failover Manager shows:
[root@edbbart ~]# /usr/edb-efm/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Master 192.168.22.245 UP UP Witness 192.168.22.244 UP N/A Idle 192.168.22.243 UP UNKNOWN Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/37068EF0 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.22.243 0/37068EF0 DB is in recovery.
The new standby is detected as “in recovery” but still shows “UNKNOWN”. To fix this execute the “resume” command on the new standby and then check the cluster status again:
[root@edbppas ~] /usr/edb-efm/bin/efm resume efm Resume command successful on local agent. [root@edbppas ~] /usr/edb-efm/bin/efm cluster-status efm Cluster Status: efm Automatic failover is disabled. Agent Type Address Agent DB Info -------------------------------------------------------------- Standby 192.168.22.243 UP UP Witness 192.168.22.244 UP N/A Master 192.168.22.245 UP UP Allowed node host list: 192.168.22.244 192.168.22.243 192.168.22.245 Standby priority host list: 192.168.22.243 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.22.245 0/37068EF0 Standby 192.168.22.243 0/37068EF0 Standby database(s) in sync with master. It is safe to promote.
Everything back to normal operations.
Conclusion: It depends on what exactly you want to do to get the failover cluster configuration back to normal operations. The steps itself are easy, the main issue is to perform them in the right order. Having that documented is a must as you’ll probably not do these kind of tasks every day.