By Mouhamadou Diaw
Edb Failover Manager edb-efm30 is now released and supports PostgreSQL 10.1.
Efm is an EDB tool which can be used to perform a switchover and a failover in a PostgreSQL standby environment.
In this blog we are going to see how we can install and configure efm in a standby environment. We suppose that the standby is already confugured and is running. We also have not configured any VIP. I just use 3 virtual machines to test. Below the configuration we are using 3 servers with Oracle Linux 7.2
pgservertools: 192.168.56.30 which is the witness
pgserver1: 192.168.56.36 which is the primary server
pgserver2: 192.168.56.37 which is the standby
EFM must be installed on both 3 nodes.To install EDB EFM,I used the rpm provided by EnterpriseDB. Note that you will need an Edb account.We will show the installation on only one node, but is the same for both nodes.
1
2
|
[root@host tmp] # wget http://yum.enterprisedb.com/edbrepos/edb-repo-latest.noarch.rpm [root@host tmp] # yum localinstall -y edb-repo-latest.noarch.rpm |
After we have to enable the corresponding edb repository (You will need a login and password)
1
2
3
4
5
6
7
|
[root@pgservertools yum.repos.d] # cat /etc/yum.repos.d/edb.repo [edbas10] name=EnterpriseDB Advanced Server 10 $releasever - $basearch enabled=1 gpgcheck=1 gpgkey= file : ///etc/pki/rpm-gpg/ENTERPRISEDB-GPG-KEY |
And then we can search for the package
1
2
3
4
5
6
|
[root@pgservertools ~] # yum search efm Loaded plugins: langpacks, ulninfo =============================== N /S matched: efm =============================== edb-efm30.x86_64 : EnterpriseDB Failover Manager efm20.x86_64 : EnterpriseDB Failover Manager efm21.x86_64 : EnterpriseDB Failover Manager |
And then install it (ouputs are truncated)
1
2
3
4
5
6
7
8
9
10
11
|
[root@pgservertools ~] # yum install edb-efm30.x86_64 Loaded plugins: langpacks, ulninfo Resolving Dependencies --> Running transaction check ---> Package edb-efm30.x86_64 0:3.0.0-1.rhel7 will be installed --> Finished Dependency Resolution … … Installed: edb-efm30.x86_64 0:3.0.0-1.rhel7 Complete! |
efm requires also openjdk. So we also have installed openjdk 1.8 on all nodes:
1
|
root@host tmp] # yum install java-1.8.0-openjdk.x86_64 |
To manage efm we create a dedicated user in the primary cluster
1
2
3
|
postgres= # create user efm with login password 'root' superuser; CREATE ROLE postgres= # |
We have to add following entries to pg_hba.conf of all databases clusters to allow connection for user efm.
1
2
3
4
|
##for efm host postgres efm 192.168.56.36 /32 md5 host postgres efm 192.168.56.37 /32 md5 host postgres efm 192.168.56.30 /32 md5 |
The configuration of efm consists of editing 2 main configuration files: efm.nodes and efm.properties. In my case these files are located in /etc/edb/efm-3.0. There are already two sample files that we can copy and then edit.
First we need to encrypt the password of user efm and after we have to configure efm.nodes and efm.properties on both nodes.
1
2
3
4
5
6
7
8
|
[root@pgserver1 efm-3.0] # /usr/edb/efm-3.0/bin/efm encrypt efm This utility will generate an encrypted password for you to place in your EFM cluster property file . Please enter the password and hit enter: Please enter the password again to confirm: The encrypted password is: ff7f041651e5e864013c1102d26a5e08 Please paste this into your cluster properties file . db.password.encrypted= ff7f041651e5e864013c1102d26a5e08 |
Below we show the contents of our two configuration files
On pgserver1 which is the primary
efm.nodes
1
2
3
4
5
|
root@pgserver1 101] # cat /etc/edb/efm-3.0/efm.nodes # List of node address:port combinations separated by whitespace. # The list should include at least the membership coordinator's address. 192.168.56.30:9998 192.168.56.37:9998 [root@pgserver1 101] # |
efm.properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
[root@pgserver1 101] # cat /etc/edb/efm-3.0/efm.properties | grep -v ^# db.user=efm db.password.encrypted=ff7f041651e5e864013c1102d26a5e08 db.port=5436 db.database=postgres db.service.owner=postgres db.service.name= db.bin= /u01/app/postgres/product/10/db_1/bin db.recovery.conf. dir = /u90/mydata/101 jdbc.sslmode=disable script.notification= bind.address=192.168.56.36:9998 admin.port=9999 is.witness= false local .period=10 local .timeout=60 local .timeout.final=10 remote.timeout=10 node.timeout=50 stop.isolated.master= false pingServerIp=192.168.56.30 pingServerCommand= /bin/ping -q -c3 -w5 auto.allow.hosts= false db.reuse.connection.count=0 auto.failover= true auto.reconfigure= true promotable= true minimum.standbys=0 recovery.check.period=2 auto.resume.period=0 script.fence= script.post.promotion= script.resumed= script.db.failure= script.master.isolated= script.remote.pre.promotion= script.remote.post.promotion= script.custom.monitor= custom.monitor.interval= custom.monitor.timeout= custom.monitor.safe.mode= sudo . command = sudo sudo .user. command = sudo -u %u log. dir = /var/log/efm-3 .0 jgroups.loglevel=INFO efm.loglevel=INFO jvm.options=-Xmx32m [root@pgserver1 101] # |
On pgserver2 which is the standby
efm.nodes
1
2
3
4
5
|
[root@pgserver2 tmp] # cat /etc/edb/efm-3.0/efm.nodes # List of node address:port combinations separated by whitespace. # The list should include at least the membership coordinator's address. 192.168.56.30:9998 192.168.56.36:9998 [root@pgserver2 tmp] # |
efm.properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
[root@pgserver2 tmp] # cat /etc/edb/efm-3.0/efm.properties | grep -v ^# db.user=efm db.password.encrypted=ff7f041651e5e864013c1102d26a5e08 db.port=5436 db.database=postgres db.service.owner=postgres db.service.name= db.bin= /u01/app/postgres/product/10/db_1/bin db.recovery.conf. dir = /u90/mydata/101 jdbc.sslmode=disable script.notification= bind.address=192.168.56.37:9998 admin.port=9999 is.witness= false local .period=10 local .timeout=60 local .timeout.final=10 remote.timeout=10 node.timeout=50 stop.isolated.master= false pingServerIp=192.168.56.30 pingServerCommand= /bin/ping -q -c3 -w5 auto.allow.hosts= true db.reuse.connection.count=0 auto.failover= true auto.reconfigure= true promotable= true minimum.standbys=0 recovery.check.period=2 auto.resume.period=0 script.fence= script.post.promotion= script.resumed= script.db.failure= script.master.isolated= script.remote.pre.promotion= script.remote.post.promotion= script.custom.monitor= custom.monitor.interval= custom.monitor.timeout= custom.monitor.safe.mode= sudo . command = sudo sudo .user. command = sudo -u %u log. dir = /var/log/efm-3 .0 jgroups.loglevel=INFO efm.loglevel=INFO jvm.options=-Xmx32m [root@pgserver2 tmp] # |
On pgservertools which is the witness
efm.nodes
1
2
3
4
5
|
[root@pgservertools efm-3.0] # cat /etc/edb/efm-3.0/efm.nodes # List of node address:port combinations separated by whitespace. # The list should include at least the membership coordinator's address. 192.168.56.36:9998 192.168.56.37:9998 [root@pgservertools efm-3.0] # |
efm.properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
[root@pgservertools efm-3.0] # cat /etc/edb/efm-3.0/efm.properties | grep -v ^# db.user=efm db.password.encrypted=ff7f041651e5e864013c1102d26a5e08 db.port=5436 db.database=postgres db.service.owner=postgres db.service.name= db.bin= /u01/app/postgres/product/10/db_1/bin db.recovery.conf. dir = /u90/mydata/101 jdbc.sslmode=disable script.notification= bind.address=192.168.56.30:9998 admin.port=9999 is.witness= true local .period=10 local .timeout=60 local .timeout.final=10 remote.timeout=10 node.timeout=50 stop.isolated.master= false pingServerIp=192.168.56.30 pingServerCommand= /bin/ping -q -c3 -w5 auto.allow.hosts= false db.reuse.connection.count=0 auto.failover= true auto.reconfigure= true promotable= true minimum.standbys=0 recovery.check.period=2 auto.resume.period=0 script.fence= script.post.promotion= script.resumed= script.db.failure= script.master.isolated= script.remote.pre.promotion= script.remote.post.promotion= script.custom.monitor= custom.monitor.interval= custom.monitor.timeout= custom.monitor.safe.mode= sudo . command = sudo sudo .user. command = sudo -u %u log. dir = /var/log/efm-3 .0 jgroups.loglevel=INFO efm.loglevel=INFO jvm.options=-Xmx32m [root@pgservertools efm-3.0] # |
Now let’s start efm on both nodes. If there is any error check logs on /var/log/efm-3.0/.
I started on following order: pgserver1, pgserver2 and pgservertools. Services can be configured to be started automatically when the server starts.
Below an example on pgserverools
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
[root@pgservertools efm-3.0] # service efm-3.0 start . [root@pgservertools efm-3.0] # service efm-3.0 status Redirecting to /bin/systemctl status efm-3.0.service ● efm-3.0.service - EnterpriseDB Failover Manager 3.0 Loaded: loaded ( /usr/lib/systemd/system/efm-3 .0.service; disabled; vendor preset: disabled) Active: active (running) since Tue 2018-03-06 15:58:51 CET; 1h 46min ago Process: 22260 ExecStart= /bin/bash -c /usr/edb/efm-3 .0 /bin/runefm .sh start ${CLUSTER} (code=exited, status=0 /SUCCESS ) Main PID: 22321 (java) CGroup: /system .slice /efm-3 .0.service └─22321 /usr/lib/jvm/java-1 .8.0-openjdk-1.8.0.151-1.b12.el7_4.x86_64 /jre/bin/java - cp /usr/edb/efm-3 .0 /lib/EFM-3 .0.0.jar -Xmx32m com.enterprisedb.efm.main.ServiceCommand __int_start /e ... Mar 06 15:58:45 pgservertools.localdomain systemd[1]: Starting EnterpriseDB Failover Manager 3.0... Mar 06 15:58:51 pgservertools.localdomain systemd[1]: Started EnterpriseDB Failover Manager 3.0. [root@pgservertools efm-3.0] # |
Once started we can verified from any node the status of our cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Master 192.168.56.36 UP UP Standby 192.168.56.37 UP UP Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: 192.168.56.37 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.56.36 0 /430001A8 Standby 192.168.56.37 0 /430001A8 Standby database(s) in sync with master. It is safe to promote. [root@pgservertools efm-3.0] # |
Now that everything is ok, let’s do a switchover
1
2
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm promote efm -switchover Promote /switchover command accepted by local agent. Proceeding with promotion and will reconfigure original master. Run the 'cluster-status' command for information about the new cluster state. |
And if we run the cluster-status command during the switchover
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Idle 192.168.56.36 UP UNKNOWN Standby 192.168.56.37 UP UP Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: 192.168.56.37 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Standby 192.168.56.37 0 /44000098 No master database was found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.56.36 UNKNOWN Connection to 192.168.56.36:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP /IP connections. |
And we can see the promoting phase
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Idle 192.168.56.36 UP UNKNOWN Promoting 192.168.56.37 UP UP Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.56.37 0 /44000170 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.56.36 UNKNOWN Connection to 192.168.56.36:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP /IP connections. |
And after a few time we can see that the new master is on pgserver2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Standby 192.168.56.36 UP UP Master 192.168.56.37 UP UP Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: 192.168.56.36 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.56.37 0 /440001A8 Standby 192.168.56.36 0 /440001A8 Standby database(s) in sync with master. It is safe to promote. [root@pgservertools efm-3.0] # |
The purpose of the witness is to do an automatic failover when the primary is down. Let’s simulate a crash of our primary database by killing the corresponding process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
[root@pgserver2 tmp] # ps -ef | grep postgres ... ... root 17529 14103 0 16:45 pts /1 00:00:00 tail -f /u90/mydata/101/log/postgresql-2018-03-06 .log <strong>postgres 20612 1 0 17:56 ? 00:00:00 /u01/app/postgres/product/10/db_1/bin/postgres -D /u90/mydata/101 < /strong > postgres 20613 20612 0 17:56 ? 00:00:00 postgres: logger process postgres 20615 20612 0 17:56 ? 00:00:00 postgres: checkpointer process postgres 20616 20612 0 17:56 ? 00:00:00 postgres: writer process postgres 20617 20612 0 17:56 ? 00:00:00 postgres: stats collector process postgres 20819 20612 0 18:00 ? 00:00:00 postgres: wal writer process postgres 20820 20612 0 18:00 ? 00:00:00 postgres: autovacuum launcher process postgres 20821 20612 0 18:00 ? 00:00:00 postgres: archiver process last was 00000008. history postgres 20822 20612 0 18:00 ? 00:00:00 postgres: bgworker: logical replication launcher postgres 20832 20612 0 18:00 ? 00:00:00 postgres: wal sender process repliuser 192.168.56.36(45827) streaming 0 /440001A8 root 21481 16868 0 18:16 pts /0 00:00:00 grep --color=auto postgres [root@pgserver2 tmp] # |
And let’s execute the kill command
1
|
[root@pgserver2 tmp] # kill -9 20612 |
If we check the cluster status from the witness server, we can see that the master is in an UNKNOWN status
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Standby 192.168.56.36 UP UP Idle 192.168.56.37 UP UNKNOWN Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: 192.168.56.36 Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Standby 192.168.56.36 0 /440001A8 No master database was found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.56.37 UNKNOWN Connection to 192.168.56.37:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP /IP connections. [root@pgservertools efm-3.0] # |
In the alert log in our standby server pgserver1, we can see that that the database is converted to a primary one a few time after.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
2018-03-06 18:17:49.381 CET [18384] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2018-03-06 18:17:49.382 CET [18380] LOG: invalid record length at 0 /440001A8 : wanted 24, got 0 2018-03-06 18:17:49.387 CET [19049] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting TCP /IP connections on port 5436? 2018-03-06 18:17:54.404 CET [19055] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting TCP /IP connections on port 5436? 2018-03-06 18:17:59.406 CET [19107] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting … …. …. TCP /IP connections on port 5436? TCP /IP connections on port 5436? 2018-03-06 18:18:34.450 CET [19128] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting TCP /IP connections on port 5436? 2018-03-06 18:18:39.451 CET [19134] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting TCP /IP connections on port 5436? 2018-03-06 18:18:44.462 CET [19135] FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "192.168.56.37" and accepting TCP /IP connections on port 5436? 2018-03-06 18:18:49.456 CET [18380] LOG: trigger file found: /tmp/postgresql .trigger 2018-03-06 18:18:49.456 CET [18380] LOG: redo done at 0 /44000170 2018-03-06 18:18:49.479 CET [18380] LOG: selected new timeline ID: 9 2018-03-06 18:18:50.128 CET [18380] LOG: archive recovery complete 2018-03-06 18:18:50.229 CET [18378] LOG: database system is ready to accept connections |
What we can confirm by querying the cluster status
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Master 192.168.56.36 UP UP Idle 192.168.56.37 UP UNKNOWN Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.56.36 0 /440002B8 No standby databases were found. Idle Node Status (idle nodes ignored in XLog location comparisons): Address XLog Loc Info -------------------------------------------------------------- 192.168.56.37 UNKNOWN Connection to 192.168.56.37:5436 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP /IP connections. [root@pgservertools efm-3.0] # |
and
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[root@pgservertools efm-3.0] # /usr/edb/efm-3.0/bin/efm cluster-status efm Cluster Status: efm VIP: Agent Type Address Agent DB Info -------------------------------------------------------------- Witness 192.168.56.30 UP N /A Master 192.168.56.36 UP UP Allowed node host list: 192.168.56.36 192.168.56.37 192.168.56.30 Membership coordinator: 192.168.56.30 Standby priority host list: (List is empty.) Promote Status: DB Type Address XLog Loc Info -------------------------------------------------------------- Master 192.168.56.36 0 /440002B8 No standby databases were found. [root@pgservertools efm-3.0] # |
On the old primary pgserver2 we can see the contents of the file recovery.conf automatically created by EDB Failover manager
1
2
3
4
5
6
7
|
[postgres@pgserver2 101]$ cat recovery.conf # EDB Failover Manager # This generated recovery.conf file prevents the db server from accidentally # being restarted as a master since a failover or promotion has occurred standby_mode = on restore_command = 'echo 2>"recovery suspended on failed server node"; exit 1' [postgres@pgserver2 101]$ |
To rebuild our standby database we have to edit the recovery.conf file
Conclusion
We have seen in this blog how to configure edb-efm30. Note that a virtual IP can be also configured. The official documentation can help.