In the last post about PostgreSQL on ZFS the pgbench tests showed, that ZFS was slower for the standard tests compared to ext4 on Debian. A comment to that post made me wonder if changing the ZFS record size would change the results, so this blog post is about exactly that. There is plenty of information out there when it comes to ZFS for PostgreSQL and most of them recommend aligning the ZFS record size and the PostgreSQL block size in some way or the other (e.g. please see here). Reading that, there are basically three options: Either change the ZFS record size, change the PostgreSQL block size or change both of them.

As I don’t like changing the PostgreSQL block size, as everybody (as far as I know) runs with 8kB, the first test runs against a ZFS with an aligned record size to match the PostgreSQL block size:

[postgres@freebsd ~]$ geom disk list
Geom name: nda0
Providers:
1. Name: nda0
   Mediasize: 10737418240 (10G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e6
   descr: Amazon Elastic Block Store
   ident: vol0a89ebfc6cc9173ff
   rotationrate: 0
   fwsectors: 0
   fwheads: 0

Geom name: nda1
Providers:
1. Name: nda1
   Mediasize: 21474836480 (20G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0
   descr: Amazon Elastic Block Store
   ident: vol0a6dcd2a6a1a1b898
   rotationrate: 0
   fwsectors: 0
   fwheads: 0

[postgres@freebsd ~]$ sudo zpool create pgpool /dev/nda1
[postgres@freebsd ~]$ sudo zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pgpool  19.5G   360K  19.5G        -         -     0%     0%  1.00x    ONLINE  -
[postgres@freebsd ~]$ sudo zfs create pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set mountpoint=/u02/pgdata/zfs pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set compression=zstd pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set recordsize=8K pgpool/pgzfs
[postgres@freebsd ~]$ df -h | grep zfs
df: minimum blocksize is 512
pgpool/pgzfs        19G     96K     19G     0%    /u02/pgdata/zfs

For PostgreSQL, exactly the same setup as before (no checksums, and the same three parameter changes):

$ initdb --pgdata=/u02/pgdata/zfs/pg
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /u02/pgdata/zfs/pg ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /u02/pgdata/zfs/pg -l logfile start

$ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ pg_ctl --pgdata=/u02/pgdata/zfs/pg/ --log=/u02/pgdata/zfs/pg/log.log start
waiting for server to start.... done
$ export PGPORT=5433
$ psql
psql (17.1 dbi services build)
Type "help" for help.

postgres=# \q

We’ll use the same data set as in the last post:

$ pgbench -p 5433 -i -s 1000 postgres

Same test as before:

$ for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5433; done
pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 778622
number of failed transactions: 0 (0.000%)
latency average = 4.624 ms
initial connection time = 36.558 ms
tps = 865.137263 (without initial connection time)

pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 812209
number of failed transactions: 0 (0.000%)
latency average = 4.432 ms
initial connection time = 11.599 ms
tps = 902.449933 (without initial connection time)

pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 828141
number of failed transactions: 0 (0.000%)
latency average = 4.347 ms
initial connection time = 11.673 ms
tps = 920.148011 (without initial connection time)

Comparing that to the previous numbers this is almost double the speed.

Same test with the simple-update run:

$ do pgbench --client=4 --jobs=4 --time=900 --port=5433 --builtin=simple-update; done
pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: simple update>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 900677
number of failed transactions: 0 (0.000%)
latency average = 3.997 ms
initial connection time = 14.278 ms
tps = 1000.689595 (without initial connection time)

pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: simple update>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 928162
number of failed transactions: 0 (0.000%)
latency average = 3.879 ms
initial connection time = 10.493 ms
tps = 1031.289907 (without initial connection time)

pgbench (17.1 dbi services build)
starting vacuum...end.
transaction type: <builtin: simple update>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 939785
number of failed transactions: 0 (0.000%)
latency average = 3.831 ms
initial connection time = 16.170 ms
tps = 1044.205709 (without initial connection time)

Same picture here, considerable faster. This means, that aligning the PostgreSQL block size and the ZFS record size indeed matters a lot.

How does that look like if we increase the PostgreSQL block size and also align the ZFS record size? Same test with a 32kB block size in PostgreSQL and a 32kB record size with ZFS:

$ sudo zfs set recordsize=32K pgpool/pgzfs
$ zfs get recordsize /u02/pgdata/zfs
NAME          PROPERTY    VALUE    SOURCE
pgpool/pgzfs  recordsize  32K      local
$ initdb -D /u02/pgdata/zfs/pg/
$ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
$ pg_ctl --pgdata=/u02/pgdata/zfs/pg/ --log=/u02/pgdata/zfs/pg/log.log start
$ psql -c "show block_size"
 block_size 
------------
 32768
$ pgbench -p 5433 -i -s 1000 postgres
$ psql -c "\l+"
                                                                                  List of databases
   Name    |  Owner   | Encoding | Locale Provider | Collate |  Ctype  | Locale | ICU Rules |   Access privileges   | Size  | Tablespace |                Description                 
-----------+----------+----------+-----------------+---------+---------+--------+-----------+-----------------------+-------+------------+--------------------------------------------
 postgres  | postgres | UTF8     | libc            | C.UTF-8 | C.UTF-8 |        |           |                       | 14 GB | pg_default | default administrative connection database
 template0 | postgres | UTF8     | libc            | C.UTF-8 | C.UTF-8 |        |           | =c/postgres          +| 16 MB | pg_default | unmodifiable empty database
           |          |          |                 |         |         |        |           | postgres=CTc/postgres |       |            | 
 template1 | postgres | UTF8     | libc            | C.UTF-8 | C.UTF-8 |        |           | =c/postgres          +| 16 MB | pg_default | default template for new databases
           |          |          |                 |         |         |        |           | postgres=CTc/postgres |       |            | 
(3 rows)

Same standard test again (summaries only):

$ for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5433; done

# run 1
number of transactions actually processed: 964541
tps = 1071.697122 (without initial connection time)

# run 2
number of transactions actually processed: 1045888
tps = 1162.073678 (without initial connection time)

# run 3
number of transactions actually processed: 1063922
tps = 1182.126126 (without initial connection time)

Even better than with an aligned 8kB block- and recordsize. Same test with the simple-update run:

$ for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5433 --builtin=simple-update; done

# run 1
number of transactions actually processed: 1233855
tps = 1370.926077 (without initial connection time)

# run 2
number of transactions actually processed: 1242710
tps = 1380.790526 (without initial connection time)

# run 3
number of transactions actually processed: 1251885
tps = 1390.963532 (without initial connection time)

Same picture, using a 32kB block- and recordsize gives better results.

To summarize: If you want to go for ZFS with PostgreSQL it indeed matters that the PostgreSQL block size and the ZFS record size are aligned. I am still not sure if I would go for a 32kB block size in PostgreSQL, this requires more testing. We haven’t even looked at other stuff to adapt (shared_buffers, checkpointing, … ) but I think that ZFS today is a valid option to run PostgreSQL workloads. But, again: Please test for your own, using your workload, before taking any decisions.