In a distributed database system like PostgreSQL managed by Patroni, etcd plays a critical role as the distributed key-value store for cluster coordination and configuration. As your system evolves, upgrading etcd becomes necessary to leverage new features, bug fixes, and security enhancements. However, upgrading etcd in a live cluster requires careful planning and execution to ensure data integrity. In this guide, we’ll walk through the process of upgrading etcd from version 3.4.25 to 3.5.12 in a Patroni cluster, based on the detailed notes I took during the upgrade process.

Check the upgrade checklist

Before trying to upgrade, it is important to have a look at all the deprecated features and at the upgrade requirements. In our case, to upgrade to version 3.5.x, it is mandatory that the running cluster is healthy and at least in version 3.4 already.
You can find all this information on the official etcd documentation:
https://etcd.io/docs/v3.3/upgrades/upgrade_3_5/

No downtime for patroni

During the upgrade process, an etcd cluster can accommodate a mix of etcd member versions, functioning based on the protocol of the lowest common version present. The cluster achieves the upgraded status only when all its members are updated to version 3.5. Internally, etcd members negotiate among themselves to establish the overall cluster version, influencing the reported version and the features supported by the cluster.

In most scenarios, transitioning from etcd 3.4 to 3.5 can be accomplished seamlessly through a rolling upgrade process, ensuring zero downtime. Sequentially halt the etcd v3.4 processes, substituting them with etcd v3.5 processes. Upon completion of the migration to v3.5 across all nodes, the enhanced functionalities introduced in v3.5 become accessible to the cluster.

Preparing for the Upgrade

Before starting the upgrade process, it’s essential to make adequate preparations to minimize any potential risks or disruptions. Here are some preliminary steps:

  • Check current etcd version
[pgt001] postgres@patroni-1:/postgres/app/postgres> etcdctl version
etcdctl version: 3.4.25
API version: 3.4
  • Backup etcd data

Use etcdctl to create a snapshot of the etcd data. This ensures that you have a fallback option in case something goes wrong during the upgrade process.

[pgt001] postgres@patroni-1:~> etcdctl snapshot save backup.db
{"level":"info","ts":1710507460.523724,"caller":"snapshot/v3_snapshot.go:119","msg":"created temporary db file","path":"backup.db.part"}
{"level":"info","ts":"2024-03-15T13:57:40.538461+0100","caller":"clientv3/maintenance.go:200","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":1710507460.539052,"caller":"snapshot/v3_snapshot.go:127","msg":"fetching snapshot","endpoint":"127.0.0.1:2379"}
{"level":"info","ts":"2024-03-15T13:57:40.548342+0100","caller":"clientv3/maintenance.go:208","msg":"completed snapshot read; closing"}
{"level":"info","ts":1710507460.5576544,"caller":"snapshot/v3_snapshot.go:142","msg":"fetched snapshot","endpoint":"127.0.0.1:2379","size":"57 kB","took":0.030259485}
{"level":"info","ts":1710507460.5580025,"caller":"snapshot/v3_snapshot.go:152","msg":"saved","path":"backup.db"}
Snapshot saved at backup.db
[pgt001] postgres@patroni-1:~> ll
total 60
-rw------- 1 postgres postgres 57376 Mar 15 13:57 backup.db

[pgt001] postgres@patroni-1:~> etcdctl --write-out=table snapshot status backup.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 29c96081 |      107 |        117 |      57 kB |
+----------+----------+------------+------------+
  • Pause Cluster Management

Use Patroni’s patronictl to pause cluster management. This prevents any automated failover or configuration changes during the upgrade process. (https://patroni.readthedocs.io/en/latest/pause.html)

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> patronictl pause --wait
'pause' request sent, waiting until it is recognized by all nodes
Success: cluster management is paused

Performing the Upgrade

Now that you’ve prepared your cluster for the upgrade, you can proceed with the actual upgrade steps. All the steps are performed node by node, as mentioned earlier. I will start the upgrade on the third node of my cluster, patroni-3.

  • Stop etcd

Stop the etcd service. This ensures that no changes are made to the cluster while the upgrade is in progress.

[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> sudo systemctl stop etcd
  • Extract and Install New etcd Version

Download the new etcd binary and extract it. Then, replace the existing etcd binaries with the new ones.

[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> tar axf etcd-v3.5.12-linux-amd64.tar.gz
[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> mv etcd-v3.5.12-linux-amd64/etcd* /postgres/app/postgres/local/dmk/bin/
[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> etcdctl version
etcdctl version: 3.5.12
API version: 3.5
  • Start etcd

Start the upgraded etcd service

[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> sudo systemctl start etcd
[pg133] postgres@patroni-3:/postgres/app/postgres/local/dmk/bin> sudo systemctl status etcd
● etcd.service - dbi services etcd service
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-03-15 14:02:39 CET; 10s ago
   Main PID: 1561 (etcd)
      Tasks: 9 (limit: 9454)
     Memory: 13.1M
        CPU: 369ms
     CGroup: /system.slice/etcd.service
             └─1561 /postgres/app/postgres/local/dmk/bin/etcd --config-file /postgres/app/postgres/local/dmk/etc/etcd.conf

Mar 15 14:02:38 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:38.292751+0100","caller":"etcdserver/server.go:783","msg":"initialized peer connections; fast-forwarding electi>
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.282054+0100","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: f1457fc5460d0329 elected>
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.302529+0100","caller":"etcdserver/server.go:2068","msg":"published local member to cluster through raft","lo>
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.302985+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.30307+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.302942+0100","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.303671+0100","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
Mar 15 14:02:39 patroni-3 systemd[1]: Started etcd.service - dbi services etcd service.
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.304964+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>
Mar 15 14:02:39 patroni-3 etcd[1561]: {"level":"info","ts":"2024-03-15T14:02:39.305719+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>

Verifying Cluster Health

After each etcd upgrade, it’s always nice to verify the health and functionality of the etcd and Patroni cluster. You can notice from the etcdtcl command that the version was upgraded on the third node.

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> etcdctl endpoint status --cluster -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.56.123:2379 | 90015c533cbf2e84 |  3.4.25 |   61 kB |     false |      false |        15 |        150 |                150 |        |
| http://192.168.56.124:2379 | 9fe85e3cebf257e3 |  3.4.25 |   61 kB |     false |      false |        15 |        150 |                150 |        |
| http://192.168.56.125:2379 | f1457fc5460d0329 |  3.5.12 |   61 kB |      true |      false |        15 |        150 |                150 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> patronictl list
+ Cluster: pgt001 (7346518467491201916) ----------+----+-----------+
| Member   | Host           | Role    | State     | TL | Lag in MB |
+----------+----------------+---------+-----------+----+-----------+
| pgt001_1 | 192.168.56.123 | Leader  | running   |  5 |           |
| pgt001_2 | 192.168.56.124 | Replica | streaming |  5 |         0 |
+----------+----------------+---------+-----------+----+-----------+
 Maintenance mode: on

Continuing the installation on the other nodes

  • Upgrading etcd on the second node of the cluster
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> sudo systemctl stop etcd
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> tar axf etcd-v3.5.12-linux-amd64.tar.gz
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> mv etcd-v3.5.12-linux-amd64/etcd* /postgres/app/postgres/local/dmk/bin/
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> etcdctl version
etcdctl version: 3.5.12
API version: 3.5
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> sudo systemctl start etcd
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> sudo systemctl status etcd
● etcd.service - dbi services etcd service
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-03-15 14:04:46 CET; 4s ago
   Main PID: 1791 (etcd)
      Tasks: 7 (limit: 9454)
     Memory: 9.7M
        CPU: 295ms
     CGroup: /system.slice/etcd.service
             └─1791 /postgres/app/postgres/local/dmk/bin/etcd --config-file /postgres/app/postgres/local/dmk/etc/etcd.conf

Mar 15 14:04:45 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:45.690431+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer">
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.739502+0100","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 9fe85e3cebf257e3 elected>
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.75204+0100","caller":"etcdserver/server.go:2068","msg":"published local member to cluster through raft","loc>
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.752889+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.753543+0100","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
Mar 15 14:04:46 patroni-2 systemd[1]: Started etcd.service - dbi services etcd service.
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.754213+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.757187+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.757933+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>
Mar 15 14:04:46 patroni-2 etcd[1791]: {"level":"info","ts":"2024-03-15T14:04:46.75994+0100","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}


[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> patronictl list
+ Cluster: pgt001 (7346518467491201916) ----------+----+-----------+
| Member   | Host           | Role    | State     | TL | Lag in MB |
+----------+----------------+---------+-----------+----+-----------+
| pgt001_1 | 192.168.56.123 | Leader  | running   |  5 |           |
| pgt001_2 | 192.168.56.124 | Replica | streaming |  5 |         0 |
+----------+----------------+---------+-----------+----+-----------+
 Maintenance mode: on
  • Upgrading etcd on the third node of the cluster
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> sudo systemctl stop etcd

Now that we stopped etcd on the server where is our Patroni leader node, let’s take a look at our patroni cluster status.

[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> patronictl list
2024-03-15 14:05:52,778 - ERROR - Failed to get list of machines from http://192.168.56.123:2379/v3beta: MaxRetryError("HTTPConnectionPool(host='192.168.56.123', port=2379): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3584365590>: Failed to establish a new connection: [Errno 111] Connection refused'))")
+ Cluster: pgt001 (7346518467491201916) ----------+----+-----------+
| Member   | Host           | Role    | State     | TL | Lag in MB |
+----------+----------------+---------+-----------+----+-----------+
| pgt001_1 | 192.168.56.123 | Leader  | running   |  5 |           |
| pgt001_2 | 192.168.56.124 | Replica | streaming |  5 |         0 |
+----------+----------------+---------+-----------+----+-----------+
 Maintenance mode: on
[pgt001] postgres@patroni-2:/postgres/app/postgres/local/dmk/bin> sq
psql (14.7 dbi services build)
Type "help" for help.

postgres=# exit

We can notice that our Patroni cluster is still up and running and that PostgreSQL cluster is still reachable. Also, thanks to patroni maintenance mode, no failover or configuration changes are happening.

Let’s continue with the installation

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> tar axf etcd-v3.5.12-linux- amd64.tar.gz
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> mv etcd-v3.5.12-linux-amd64 /etcd* /postgres/app/postgres/local/dmk/bin/
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> etcdctl version
etcdctl version: 3.5.12
API version: 3.5
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> sudo systemctl start etcd
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> sudo systemctl status etcd
● etcd.service - dbi services etcd service
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; preset: enabled)
Active: active (running) since Fri 2024-03-15 14:07:12 CET; 3s ago
Main PID: 1914 (etcd)
Tasks: 7 (limit: 9454)
Memory: 15.9M
CPU: 160ms
CGroup: /system.slice/etcd.service
└─1914 /postgres/app/postgres/local/dmk/bin/etcd --config-file /postgres/app/postgres/local/dmk/etc/etcd.conf

Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.180191+0100","caller":"etcdserver/server.go:2068","msg":"published local member to cluster through raft","lo>
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.180266+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.181162+0100","caller":"embed/serve.go:103","msg":"ready to serve client requests"}
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.182377+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.182625+0100","caller":"embed/serve.go:187","msg":"serving client traffic insecurely; this is strongly discou>
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.183861+0100","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
Mar 15 14:07:12 patroni-1 systemd[1]: Started etcd.service - dbi services etcd service.
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.187771+0100","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.195369+0100","caller":"membership/cluster.go:576","msg":"updated cluster version","cluster-id":"571a53e78674>
Mar 15 14:07:12 patroni-1 etcd[1914]: {"level":"info","ts":"2024-03-15T14:07:12.195541+0100","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5>

We now have upgraded etcd on all our nodes and we need to control the status of our clusters.

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> patronictl list
+ Cluster: pgt001 (7346518467491201916) ----------+----+-----------+
| Member   | Host           | Role    | State     | TL | Lag in MB |
+----------+----------------+---------+-----------+----+-----------+
| pgt001_1 | 192.168.56.123 | Leader  | running   |  5 |           |
| pgt001_2 | 192.168.56.124 | Replica | streaming |  5 |         0 |
+----------+----------------+---------+-----------+----+-----------+

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> etcdctl endpoint status --cluster -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.56.123:2379 | 90015c533cbf2e84 |  3.5.12 |   61 kB |     false |      false |        15 |        150 |                150 |        |
| http://192.168.56.124:2379 | 9fe85e3cebf257e3 |  3.5.12 |   61 kB |     false |      false |        15 |        150 |                150 |        |
| http://192.168.56.125:2379 | f1457fc5460d0329 |  3.5.12 |   61 kB |      true |      false |        15 |        150 |                150 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> etcdctl version
etcdctl version: 3.5.12
API version: 3.5

Resume Cluster Management

Once you’ve confirmed that the upgrade was successful, resume cluster management to allow Patroni to resume its normal operations and quit maintenance mode.

[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> patronictl resume --wait
'resume' request sent, waiting until it is recognized by all nodes
Success: cluster management is resumed
[pgt001] postgres@patroni-1:/postgres/app/postgres/local/dmk/bin> patronictl list
+ Cluster: pgt001 (7346518467491201916) ----------+----+-----------+
| Member   | Host           | Role    | State     | TL | Lag in MB |
+----------+----------------+---------+-----------+----+-----------+
| pgt001_1 | 192.168.56.123 | Leader  | running   |  5 |           |
| pgt001_2 | 192.168.56.124 | Replica | streaming |  5 |         0 |
+----------+----------------+---------+-----------+----+-----------+

Conclusion

Upgrading etcd in a Patroni cluster is a critical maintenance task that requires careful planning and execution. By following the steps outlined in this guide and leveraging the detailed notes taken during the upgrade process, you can ensure a smooth and successful upgrade while ensuring data integrity. Remember to always test the upgrade process in a staging environment before performing it in production to mitigate any potential risks.