In the last post we’ve set the baseline for comparing ext4 to zfs when it comes to PostgreSQL. We’ve ended with two PostgreSQL clusters, one running on the ext4 file system while the other is running on top of the ZFS file system. In this post we’ll look at how they compare when it comes to performance and the data size on disk.

Warning: Please test for your own using your specific workloads before taking anything for granted you read here. What follows are very basic tests and that might look totally different in your environment.

Before we start with any tests, lets have a look at the the size of the data directories of both clusters. On the ext4 file system the size of the empty cluster is 39MB, which is expected:

postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/ext4/pg/
39M     /u02/pgdata/ext4/pg/

On the ZFS file system the size of the empty cluster is significantly lower, only 4.7MB:

postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/zfs/pg
4.7M    /u02/pgdata/zfs/pg

The reason is ZFS compression (which we’ve enabled in the last post):

postgres@ip-10-0-1-159:/home/postgres/ [pg17] zfs get compression pgpool/pgzfs
NAME          PROPERTY     VALUE           SOURCE
pgpool/pgzfs  compression  zstd            local
postgres@ip-10-0-1-159:/home/postgres/ [pg17] zfs get compressratio pgpool/pgzfs
NAME          PROPERTY       VALUE  SOURCE
pgpool/pgzfs  compressratio  6.76x  -

This is quite nice, lets see how that looks like when we load some data with pgbench. For ext4:

postgres@ip-10-0-1-159:/home/postgres/ [pg17] pgbench -p 5432 -i -s 1000 postgres
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...                                                                                      
creating primary keys...
done in 382.58 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 200.98 s, vacuum 7.11 s, primary keys 174.48 s).
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/ext4/pg/
16G     /u02/pgdata/ext4/pg/

For ZFS:

postgres@ip-10-0-1-159:/home/postgres/ [pg17] pgbench -p 5433 -i -s 1000 postgres
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...                                                                                      
creating primary keys...
done in 267.87 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 163.79 s, vacuum 3.14 s, primary keys 100.93 s).
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/zfs/pg/
914M    /u02/pgdata/zfs/pg/

This is 16GB for the ext4 file systems, compared to 914MB for the ZFS file system, which is quite impressive. But also the runtime was less: 382.58 s for ext4 and 267.87 s for ZFS.

Writing is one thing, but we’re also interested in reading, so lets run some simple pgbench tests, 3 times for 15 minutes each. For ext4:

postgres@ip-10-0-1-159:/home/postgres/ [pg17] for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5432; done
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 714533
number of failed transactions: 0 (0.000%)
latency average = 5.038 ms
initial connection time = 24.493 ms
tps = 793.932864 (without initial connection time)

pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 763618
number of failed transactions: 0 (0.000%)
latency average = 4.714 ms
initial connection time = 9.015 ms
tps = 848.455044 (without initial connection time)

pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 793100
number of failed transactions: 0 (0.000%)
latency average = 4.539 ms
initial connection time = 11.945 ms
tps = 881.218923 (without initial connection time)

For ZFS, same settings:

for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5433; done
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 441902
number of failed transactions: 0 (0.000%)
latency average = 8.147 ms
initial connection time = 13.602 ms
tps = 491.000544 (without initial connection time)

pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 469018
number of failed transactions: 0 (0.000%)
latency average = 7.676 ms
initial connection time = 20.941 ms
tps = 521.136559 (without initial connection time)

pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 486532
number of failed transactions: 0 (0.000%)
latency average = 7.399 ms
initial connection time = 15.169 ms
tps = 540.595008 (without initial connection time)

This means, for this type of workload the instance running on ZFS is considerable slower than the instance running on ext4. Doing the same test with the “simple-update” builtin test suite.

ext4 (summary only):

# run 1
number of transactions actually processed: 757846
tps = 842.051961 (without initial connection time)

# run 2
number of transactions actually processed: 770246
tps = 855.823802 (without initial connection time)

# run 3
number of transactions actually processed: 786752
tps = 874.172287 (without initial connection time)

ZFS (summary only):

# run 1
number of transactions actually processed: 553939
tps = 615.494547 (without initial connection time)

# run 2
number of transactions actually processed: 558915
tps = 621.035480 (without initial connection time)

# run 3
number of transactions actually processed: 568043
tps = 631.161413 (without initial connection time)

Same picture as above, the ext4 file system is faster than the ZFS one.

Same tests on FreeBSD 14.1 (just one big block to show that I’ve used the same settings and data set), same instance type and storage:

[postgres@freebsd ~]$ uname -a
FreeBSD freebsd 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64
[postgres@freebsd ~]$ sudo geom disk list
Geom name: nda0
Providers:
1. Name: nda0
   Mediasize: 10737418240 (10G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e6
   descr: Amazon Elastic Block Store
   ident: vol0acdef620c8b7d665
   rotationrate: 0
   fwsectors: 0
   fwheads: 0

Geom name: nda1
Providers:
1. Name: nda1
   Mediasize: 21474836480 (20G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r0w0e0
   descr: Amazon Elastic Block Store
   ident: vol0bfdba7a5ac577099
   rotationrate: 0
   fwsectors: 0
   fwheads: 0

[postgres@freebsd ~]$ sudo zpool create pgpool /dev/nda1
[postgres@freebsd ~]$ sudo zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pgpool  19.5G   360K  19.5G        -         -     0%     0%  1.00x    ONLINE  -
[postgres@freebsd ~]$ sudo zfs create pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set mountpoint=/u02/pgdata/zfs pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set compression=zstd pgpool/pgzfs
[postgres@freebsd ~]$ df -h | grep zfs
df: minimum blocksize is 512
pgpool/pgzfs        19G     96K     19G     0%    /u02/pgdata/zfs
[postgres@freebsd ~]$ sudo chown postgres:postgres /u02/pgdata/zfs
[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/initdb --pgdata=/u02/pgdata/zfs/pg
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /u02/pgdata/zfs/pg ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    /u01/app/postgres/product/17/db_0/bin/pg_ctl -D /u02/pgdata/zfs/pg -l logfile start

[postgres@freebsd ~]$ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/pg_ctl --pgdata=/u02/pgdata/zfs/pg/  start
waiting for server to start....2024-11-13 07:47:54.324 UTC [26000] LOG:  starting PostgreSQL 17.0 dbi services build on x86_64-unknown-freebsd14.1, compiled by FreeBSD clang version 18.1.5 (https://github.com/llvm/llvm-project.git llvmorg-18.1.5-0-g617a15a9eac9), 64-bit
2024-11-13 07:47:54.324 UTC [26000] LOG:  listening on IPv6 address "::1", port 5433
2024-11-13 07:47:54.324 UTC [26000] LOG:  listening on IPv4 address "127.0.0.1", port 5433
2024-11-13 07:47:54.328 UTC [26000] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5433"
2024-11-13 07:47:54.334 UTC [26003] LOG:  database system was shut down at 2024-11-13 07:46:02 UTC
2024-11-13 07:47:54.347 UTC [26000] LOG:  database system is ready to accept connections
 done
server started

[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/pgbench -p 5433 -i -s 1000 postgres
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...                                                                                      
creating primary keys...
done in 363.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 208.49 s, vacuum 15.66 s, primary keys 139.50 s).

Same test as above, three times in a row (summaries only):

[postgres@freebsd ~]$ for i in {1..3}; do /u01/app/postgres/product/17/db_0/bin/pgbench --client=4 --jobs=4 --time=900 --port=5433; done

# run 1
number of transactions actually processed: 424597
tps = 471.765244 (without initial connection time)

# run 2
number of transactions actually processed: 467988
tps = 519.976884 (without initial connection time)

# run 3
number of transactions actually processed: 506474
tps = 562.733462 (without initial connection time)

## simple-update
# run 1
number of transactions actually processed: 580056
tps = 644.504340 (without initial connection time)

# run 2
number of transactions actually processed: 593871
tps = 659.866618 (without initial connection time)

# run 3
number of transactions actually processed: 620252
tps = 689.163219 (without initial connection time)

In summary: Writing is better on ZFS but reading(or mixed workloads) is not as fast as on ext4 (Both on Linux and FreeBSD). Should you go for ZFS? I cannot tell you, this depends on your workload and your requirements. If space on disk is major concern it might very well be an option. If snapshots and replication on the file system level are important, ZFS might be what you need. If your workload is mostly about reads, it might not be the best option for you. There is no way around testing it for your own in your environment.

If someone thinks I’ve missed something in the ZFS setup, please let me know and I will be happy to re-test.