In the last post we’ve set the baseline for comparing ext4 to zfs when it comes to PostgreSQL. We’ve ended with two PostgreSQL clusters, one running on the ext4 file system while the other is running on top of the ZFS file system. In this post we’ll look at how they compare when it comes to performance and the data size on disk.
Warning: Please test for your own using your specific workloads before taking anything for granted you read here. What follows are very basic tests and that might look totally different in your environment.
Before we start with any tests, lets have a look at the the size of the data directories of both clusters. On the ext4 file system the size of the empty cluster is 39MB, which is expected:
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/ext4/pg/
39M /u02/pgdata/ext4/pg/
On the ZFS file system the size of the empty cluster is significantly lower, only 4.7MB:
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/zfs/pg
4.7M /u02/pgdata/zfs/pg
The reason is ZFS compression (which we’ve enabled in the last post):
postgres@ip-10-0-1-159:/home/postgres/ [pg17] zfs get compression pgpool/pgzfs
NAME PROPERTY VALUE SOURCE
pgpool/pgzfs compression zstd local
postgres@ip-10-0-1-159:/home/postgres/ [pg17] zfs get compressratio pgpool/pgzfs
NAME PROPERTY VALUE SOURCE
pgpool/pgzfs compressratio 6.76x -
This is quite nice, lets see how that looks like when we load some data with pgbench. For ext4:
postgres@ip-10-0-1-159:/home/postgres/ [pg17] pgbench -p 5432 -i -s 1000 postgres
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 382.58 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 200.98 s, vacuum 7.11 s, primary keys 174.48 s).
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/ext4/pg/
16G /u02/pgdata/ext4/pg/
For ZFS:
postgres@ip-10-0-1-159:/home/postgres/ [pg17] pgbench -p 5433 -i -s 1000 postgres
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 267.87 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 163.79 s, vacuum 3.14 s, primary keys 100.93 s).
postgres@ip-10-0-1-159:/home/postgres/ [pg17] du -sh /u02/pgdata/zfs/pg/
914M /u02/pgdata/zfs/pg/
This is 16GB for the ext4 file systems, compared to 914MB for the ZFS file system, which is quite impressive. But also the runtime was less: 382.58 s for ext4 and 267.87 s for ZFS.
Writing is one thing, but we’re also interested in reading, so lets run some simple pgbench tests, 3 times for 15 minutes each. For ext4:
postgres@ip-10-0-1-159:/home/postgres/ [pg17] for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5432; done
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 714533
number of failed transactions: 0 (0.000%)
latency average = 5.038 ms
initial connection time = 24.493 ms
tps = 793.932864 (without initial connection time)
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 763618
number of failed transactions: 0 (0.000%)
latency average = 4.714 ms
initial connection time = 9.015 ms
tps = 848.455044 (without initial connection time)
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 793100
number of failed transactions: 0 (0.000%)
latency average = 4.539 ms
initial connection time = 11.945 ms
tps = 881.218923 (without initial connection time)
For ZFS, same settings:
for i in {1..3}; do pgbench --client=4 --jobs=4 --time=900 --port=5433; done
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 441902
number of failed transactions: 0 (0.000%)
latency average = 8.147 ms
initial connection time = 13.602 ms
tps = 491.000544 (without initial connection time)
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 469018
number of failed transactions: 0 (0.000%)
latency average = 7.676 ms
initial connection time = 20.941 ms
tps = 521.136559 (without initial connection time)
pgbench (17.0 dbi services build)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 900 s
number of transactions actually processed: 486532
number of failed transactions: 0 (0.000%)
latency average = 7.399 ms
initial connection time = 15.169 ms
tps = 540.595008 (without initial connection time)
This means, for this type of workload the instance running on ZFS is considerable slower than the instance running on ext4. Doing the same test with the “simple-update” builtin test suite.
ext4 (summary only):
# run 1
number of transactions actually processed: 757846
tps = 842.051961 (without initial connection time)
# run 2
number of transactions actually processed: 770246
tps = 855.823802 (without initial connection time)
# run 3
number of transactions actually processed: 786752
tps = 874.172287 (without initial connection time)
ZFS (summary only):
# run 1
number of transactions actually processed: 553939
tps = 615.494547 (without initial connection time)
# run 2
number of transactions actually processed: 558915
tps = 621.035480 (without initial connection time)
# run 3
number of transactions actually processed: 568043
tps = 631.161413 (without initial connection time)
Same picture as above, the ext4 file system is faster than the ZFS one.
Same tests on FreeBSD 14.1 (just one big block to show that I’ve used the same settings and data set), same instance type and storage:
[postgres@freebsd ~]$ uname -a
FreeBSD freebsd 14.1-RELEASE-p5 FreeBSD 14.1-RELEASE-p5 GENERIC amd64
[postgres@freebsd ~]$ sudo geom disk list
Geom name: nda0
Providers:
1. Name: nda0
Mediasize: 10737418240 (10G)
Sectorsize: 512
Stripesize: 4096
Stripeoffset: 0
Mode: r2w2e6
descr: Amazon Elastic Block Store
ident: vol0acdef620c8b7d665
rotationrate: 0
fwsectors: 0
fwheads: 0
Geom name: nda1
Providers:
1. Name: nda1
Mediasize: 21474836480 (20G)
Sectorsize: 512
Stripesize: 4096
Stripeoffset: 0
Mode: r0w0e0
descr: Amazon Elastic Block Store
ident: vol0bfdba7a5ac577099
rotationrate: 0
fwsectors: 0
fwheads: 0
[postgres@freebsd ~]$ sudo zpool create pgpool /dev/nda1
[postgres@freebsd ~]$ sudo zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pgpool 19.5G 360K 19.5G - - 0% 0% 1.00x ONLINE -
[postgres@freebsd ~]$ sudo zfs create pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set mountpoint=/u02/pgdata/zfs pgpool/pgzfs
[postgres@freebsd ~]$ sudo zfs set compression=zstd pgpool/pgzfs
[postgres@freebsd ~]$ df -h | grep zfs
df: minimum blocksize is 512
pgpool/pgzfs 19G 96K 19G 0% /u02/pgdata/zfs
[postgres@freebsd ~]$ sudo chown postgres:postgres /u02/pgdata/zfs
[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/initdb --pgdata=/u02/pgdata/zfs/pg
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "C.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /u02/pgdata/zfs/pg ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
/u01/app/postgres/product/17/db_0/bin/pg_ctl -D /u02/pgdata/zfs/pg -l logfile start
[postgres@freebsd ~]$ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql.auto.conf
[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/pg_ctl --pgdata=/u02/pgdata/zfs/pg/ start
waiting for server to start....2024-11-13 07:47:54.324 UTC [26000] LOG: starting PostgreSQL 17.0 dbi services build on x86_64-unknown-freebsd14.1, compiled by FreeBSD clang version 18.1.5 (https://github.com/llvm/llvm-project.git llvmorg-18.1.5-0-g617a15a9eac9), 64-bit
2024-11-13 07:47:54.324 UTC [26000] LOG: listening on IPv6 address "::1", port 5433
2024-11-13 07:47:54.324 UTC [26000] LOG: listening on IPv4 address "127.0.0.1", port 5433
2024-11-13 07:47:54.328 UTC [26000] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2024-11-13 07:47:54.334 UTC [26003] LOG: database system was shut down at 2024-11-13 07:46:02 UTC
2024-11-13 07:47:54.347 UTC [26000] LOG: database system is ready to accept connections
done
server started
[postgres@freebsd ~]$ /u01/app/postgres/product/17/db_0/bin/pgbench -p 5433 -i -s 1000 postgres
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
vacuuming...
creating primary keys...
done in 363.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 208.49 s, vacuum 15.66 s, primary keys 139.50 s).
Same test as above, three times in a row (summaries only):
[postgres@freebsd ~]$ for i in {1..3}; do /u01/app/postgres/product/17/db_0/bin/pgbench --client=4 --jobs=4 --time=900 --port=5433; done
# run 1
number of transactions actually processed: 424597
tps = 471.765244 (without initial connection time)
# run 2
number of transactions actually processed: 467988
tps = 519.976884 (without initial connection time)
# run 3
number of transactions actually processed: 506474
tps = 562.733462 (without initial connection time)
## simple-update
# run 1
number of transactions actually processed: 580056
tps = 644.504340 (without initial connection time)
# run 2
number of transactions actually processed: 593871
tps = 659.866618 (without initial connection time)
# run 3
number of transactions actually processed: 620252
tps = 689.163219 (without initial connection time)
In summary: Writing is better on ZFS but reading(or mixed workloads) is not as fast as on ext4 (Both on Linux and FreeBSD). Should you go for ZFS? I cannot tell you, this depends on your workload and your requirements. If space on disk is major concern it might very well be an option. If snapshots and replication on the file system level are important, ZFS might be what you need. If your workload is mostly about reads, it might not be the best option for you. There is no way around testing it for your own in your environment.
If someone thinks I’ve missed something in the ZFS setup, please let me know and I will be happy to re-test.
Yerry
20.11.2024Hi,
Comparing file systems with benchmarks tools like fio, pgbench, … is very difficult. (comparing apples and oranges)
ext4, xfs are older file systems when comparing them with zfs. Zfs does a lot more and is more sophisticated and advanced. If you look only at performance then those older file systems are faster. But do you really need this in all situations?
Testing postgres and zfs, you must tune some parameters, like the record size. A very good explanation is written by Klara
https://klarasystems.com/articles/5-reasons-why-your-zfs-storage-benchmarks-are-wrong/
An other example is checksums in zfs. Every time zfs reads data, it checks on corruption and it will repair it automatically. I do not think that ext4 wlll do that? But those actions takes iops. Maybe disable it when testing.
Also worth reading: https://bun.uptrace.dev/postgres/tuning-zfs-aws-ebs.html
With newer types of file storage (nvme’s, …) zfs has some performance problems. The main reason is that zfs is optimized for spinning disks. But the community is working on it.
Regards,
Yerry
Daniel Westermann
20.11.2024Hi Yerry,
> ext4, xfs are older file systems when comparing them with zfs. Zfs does a lot more and is more sophisticated and advanced. If you look only at performance then those older file systems are faster. But do you really need this in all situations?
That's why I wrote please test for yourself.
> Testing postgres and zfs, you must tune some parameters, like the record size. A very good explanation is written by Klara
> https://klarasystems.com/articles/5-reasons-why-your-zfs-storage-benchmarks-are-wrong/
Thanks for thehint.
> An other example is checksums in zfs. Every time zfs reads data, it checks on corruption and it will repair it automatically. I do not think that ext4 wlll do that? But those actions takes iops. Maybe disable it when testing.
I've disabled checksums in PostgreSQL for that reason.
> With newer types of file storage (nvme’s, …) zfs has some performance problems. The main reason is that zfs is optimized for spinning disks. But the community is working on it.
Thank you.
Cheers,
Daniel