Patroni runs your PostgreSQL cluster and handles failover, promoting a replica the moment the primary dies and recording the change in its distributed store (etcd, Consul, or ZooKeeper). That part works on its own.

Your applications still need one stable address to connect to, and they need writes to reach the primary while reads spread across replicas. HAProxy handles that routing, with a floating IP from Keepalived in front of it.

How the tools work together

Three components stands between your application and the database.

Patroni manages replication and failover, and it runs an agent on every PostgreSQL node. Each agent exposes a small REST API (port 8008 by default) that reports that node’s role.

HAProxy accepts client connections and forwards them to the right node. It asks Patroni’s REST API which node is the primary and which are replicas, then sends each connection to a matching node.

Keepalived publishes a virtual IP that floats between your HAProxy hosts using VRRP. Your application connects to the VIP, so one HAProxy host going down doesn’t take the whole entry point with it.

Your application talks to the VIP. Keepalived points the VIP at a live HAProxy. HAProxy forwards the connection to whichever PostgreSQL node Patroni reports as healthy for that role.

The health-check method

HAProxy checks one port and routes to another.

Patroni’s REST API returns an HTTP status that depends on the node’s role:

  • GET / returns 200 only on the leader (the primary). A non-leader node returns 503.
  • GET /primary is the explicit name for the same leader check.
  • GET /replica returns 200 only on a running replica.
  • GET /read-only returns 200 on the primary or a replica, any node that can serve a read.

In our case, HAProxy runs its health check against the API port (8008) and reads that status code, then forwards the SQL connection to the database port (5432). A node receives traffic only when its API answers 200 for the role that listener cares about. Point a listener’s check at / and it follows the primary. Point it at /replica and it follows the replicas. Patroni promotes a new leader, the status codes change, and HAProxy moves traffic to match within a couple of health-check cycles.

A first and simple working configuration

A two-node setup with 10.5.5.147 and 10.5.5.148 looks like this. One listener handles writes, the other handles reads.

listen PG1
    bind *:5000
    option httpchk
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgresql_10.5.5.147_5432 10.5.5.147:5432 maxconn 100 check port 8008
    server postgresql_10.5.5.148_5432 10.5.5.148:5432 maxconn 100 check port 8008

listen PG1_ro
    bind *:5001
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgresql_10.5.5.147_5432 10.5.5.147:5432 maxconn 100 check port 8008
    server postgresql_10.5.5.148_5432 10.5.5.148:5432 maxconn 100 check port 8008

This example runs PostgreSQL on port 5432 and the Patroni API on 8008, so swap in whatever ports your deployment uses (the defaults are 5432 and 8008).

Line by line:

  • bind *:5000 and bind *:5001 are the two addresses your applications connect to. Send writes to 5000 and reads to 5001.
  • option httpchk (with no path) on the first listener checks Patroni’s root endpoint. Only the leader answers 200, so HAProxy sends port 5000 traffic to the current primary.
  • option httpchk GET /replica on the second listener checks the replica endpoint, so HAProxy sends port 5001 traffic to a replica.
  • http-check expect status 200 tells HAProxy that 200 means healthy and anything else means down.
  • inter 3s fall 3 rise 2 checks every 3 seconds, marks a server down after 3 failures, and brings it back after 2 successes.
  • on-marked-down shutdown-sessions kills existing connections to a server the instant HAProxy marks it down, so clients reconnect and get rerouted instead of hanging on a dead node.
  • check port 8008 is the trick in action: health checks hit the Patroni API on 8008 while HAProxy forwards traffic to PostgreSQL on 5432.
  • maxconn 100 limit connections per server so you don’t exhaust PostgreSQL’s connection slots.

For a primary plus one or more replicas, this routes writes and reads to the right node and survives a failover.

The failure mode hiding in the read path

Imagine a two-node cluster: one primary, one replica. The replica goes down. Maybe it crashed, maybe Patroni is mid-switchover and no standby exists for a few seconds.

Your read traffic hits port 5001. That listener marks a server up only when GET /replica returns 200, and right now no node is a replica. HAProxy has zero usable servers in the pool, so it refuses the connection. Read queries start failing.

The primary is up the entire time, and it can serve those reads. Your config won’t send them there, because you told the read listener to look for replicas and nothing else. You’ve turned a degraded cluster that could still serve reads into a read outage. You feel this most on small clusters, and each failover passes through a window where the old primary becomes a replica and no standby is available yet. In the worst case, your replica is down, and one of your application is connecting to port 5001, resulting in errors.

The fix: fall back to the primary

Send reads to the primary when the read listener runs out of replicas, instead of dropping them.

listen PG1
    bind *:5000
    option httpchk
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgresql_10.5.5.147_5432 10.5.5.147:5432 maxconn 100 check port 8008
    server postgresql_10.5.5.148_5432 10.5.5.148:5432 maxconn 100 check port 8008

listen PG1_ro
    bind *:5001
    option httpchk GET /replica
    http-check expect status 200
    use_backend PG1_ro_leader if { nbsrv(PG1_ro) eq 0 }
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgresql_10.5.5.147_5432 10.5.5.147:5432 maxconn 100 check port 8008
    server postgresql_10.5.5.148_5432 10.5.5.148:5432 maxconn 100 check port 8008

backend PG1_ro_leader
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgresql_10.5.5.147_5432 10.5.5.147:5432 maxconn 100 check port 8008
    server postgresql_10.5.5.148_5432 10.5.5.148:5432 maxconn 100 check port 8008

This new line carries the whole fix:

use_backend PG1_ro_leader if { nbsrv(PG1_ro) eq 0 }

nbsrv(PG1_ro) counts the usable servers in the PG1_ro pool, which here means the number of available replicas, since those servers pass the check only when GET /replica returns 200. While at least one replica is up, the count stays above zero, the condition is false, and reads stay on the replicas. The moment the last replica drops, nbsrv(PG1_ro) hits zero, the condition fires, and HAProxy diverts reads to the PG1_ro_leader backend.

That backend health-checks with GET /primary, so the only server it counts as up is the current primary. HAProxy sends reads to the primary until a replica returns, then shifts them back to the replica pool once a replica passes its check again.

Three names have to agree for this to work. The backend you define (backend PG1_ro_leader), the backend you route to (use_backend PG1_ro_leader), and the pool you count (nbsrv(PG1_ro)) all reference the real section names. Drop in a stale name from an earlier version and HAProxy either refuses to start or counts the wrong pool.

The /read-only shortcut and what it costs

Patroni offers GET /read-only, which returns 200 on the primary and the replicas alike. Point the read listener there and both the primary and the replicas serve reads, no fallback backend needed.

The cost is read load on the primary even when your replicas are healthy and idle. The fallback approach keeps reads off the primary until the replicas are gone, then leans on it as a safety net. You protect the primary’s write capacity during normal operation and still keep reads alive during a replica outage.

To keep lagging replicas out of the read pool, Patroni accepts a threshold on the replica check, for example GET /replica?lag=10MB, which fails any replica more than 10 MB behind. Pair that with the fallback and HAProxy drops the lagging replicas from rotation while reads still have somewhere to go.

Keepalived: removing HAProxy as a single point of failure

One HAProxy host fronting the cluster moves the single point of failure up a layer. Run HAProxy on two hosts and let Keepalived float a virtual IP between them with VRRP. Your application connects to the VIP, and whichever HAProxy holds it answers.

A minimal keepalived.conf on the primary HAProxy host:

vrrp_script chk_haproxy {
    script "killall -0 haproxy"   # succeeds while the haproxy process is alive
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass changeme
    }
    virtual_ipaddress {
        10.0.0.1
    }
    track_script {
        chk_haproxy
    }
}

The second HAProxy host runs the same file with state BACKUP and a lower priority (100). Both advertise over VRRP, and the higher priority holds the VIP. chk_haproxy runs every two seconds. If HAProxy dies on the active host, its priority drops and the backup takes the VIP, so an HAProxy crash on one host no longer takes the entry point down with it.

Point your applications at 10.0.0.1:5000 for writes and 10.5.5.100:5001 for reads. Your applications never see which physical HAProxy does the work.

Summary

Patroni keeps the cluster healthy and picks the leader. HAProxy turns Patroni’s REST API into routing, sending writes to the primary and reads to the replicas by health-checking the API port while forwarding to the database port. The naive read-only listener drops reads when the last replica goes down, even though the primary could serve them. Adding use_backend ... if { nbsrv(...) eq 0 } with a primary-checking backend closes that gap, and a lag threshold on the replica check keeps stale standbys out of rotation. Keepalived puts a floating VIP in front of two HAProxy instances so the proxy layer survives a host failure too.

Writes reach the primary and reads spread across the replicas. Reads stay up as long as one node in the cluster is alive.

Let me know if you find any improvements to this configuration 😀