In the last post about PostgreSQL on ZFS the pgbench tests showed, that ZFS was slower for the standard tests compared to ext4 on Debian. A comment to that post made me wonder if changing the ZFS record size would change the results, so this blog post is about exactly that. There is plenty of information out there when it comes to ZFS for PostgreSQL and most of them recommend aligning the ZFS record size and the PostgreSQL block size in some way or the other (e.g. please see here). Reading that, there are basically three options: Either change the ZFS record size, change the PostgreSQL block size or change both of them.
As I don’t like changing the PostgreSQL block size, as everybody (as far as I know) runs with 8kB, the first test runs against a ZFS with an aligned record size to match the PostgreSQL block size:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | [postgres@freebsd ~]$ geom disk list Geom name: nda0 Providers: 1. Name: nda0 Mediasize: 10737418240 (10G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r2w2e6 descr: Amazon Elastic Block Store ident: vol0a89ebfc6cc9173ff rotationrate: 0 fwsectors: 0 fwheads: 0 Geom name: nda1 Providers: 1. Name: nda1 Mediasize: 21474836480 (20G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r0w0e0 descr: Amazon Elastic Block Store ident: vol0a6dcd2a6a1a1b898 rotationrate: 0 fwsectors: 0 fwheads: 0 [postgres@freebsd ~]$ sudo zpool create pgpool /dev/nda1 [postgres@freebsd ~]$ sudo zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT pgpool 19.5G 360K 19.5G - - 0% 0% 1.00x ONLINE - [postgres@freebsd ~]$ sudo zfs create pgpool /pgzfs [postgres@freebsd ~]$ sudo zfs set mountpoint= /u02/pgdata/zfs pgpool /pgzfs [postgres@freebsd ~]$ sudo zfs set compression=zstd pgpool /pgzfs [postgres@freebsd ~]$ sudo zfs set recordsize=8K pgpool /pgzfs [postgres@freebsd ~]$ df -h | grep zfs df : minimum blocksize is 512 pgpool /pgzfs 19G 96K 19G 0% /u02/pgdata/zfs |
For PostgreSQL, exactly the same setup as before (no checksums, and the same three parameter changes):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | $ initdb --pgdata= /u02/pgdata/zfs/pg The files belonging to this database system will be owned by user "postgres" . This user must also own the server process. The database cluster will be initialized with locale "C.UTF-8" . The default database encoding has accordingly been set to "UTF8" . The default text search configuration will be set to "english" . Data page checksums are disabled. creating directory /u02/pgdata/zfs/pg ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default "max_connections" ... 100 selecting default "shared_buffers" ... 128MB selecting default time zone ... UTC creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth- local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D /u02/pgdata/zfs/pg -l logfile start $ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ pg_ctl --pgdata= /u02/pgdata/zfs/pg/ --log= /u02/pgdata/zfs/pg/log .log start waiting for server to start.... done $ export PGPORT=5433 $ psql psql (17.1 dbi services build) Type "help" for help. postgres= # \q |
We’ll use the same data set as in the last post:
1 | $ pgbench -p 5433 -i -s 1000 postgres |
Same test as before:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | $ for i in {1..3}; do pgbench --client=4 --jobs=4 -- time =900 --port=5433; done pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : TPC-B ( sort of)> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 778622 number of failed transactions: 0 (0.000%) latency average = 4.624 ms initial connection time = 36.558 ms tps = 865.137263 (without initial connection time ) pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : TPC-B ( sort of)> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 812209 number of failed transactions: 0 (0.000%) latency average = 4.432 ms initial connection time = 11.599 ms tps = 902.449933 (without initial connection time ) pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : TPC-B ( sort of)> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 828141 number of failed transactions: 0 (0.000%) latency average = 4.347 ms initial connection time = 11.673 ms tps = 920.148011 (without initial connection time ) |
Comparing that to the previous numbers this is almost double the speed.
Same test with the simple-update run:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | $ do pgbench --client=4 --jobs=4 -- time =900 --port=5433 -- builtin =simple-update; done pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : simple update> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 900677 number of failed transactions: 0 (0.000%) latency average = 3.997 ms initial connection time = 14.278 ms tps = 1000.689595 (without initial connection time ) pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : simple update> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 928162 number of failed transactions: 0 (0.000%) latency average = 3.879 ms initial connection time = 10.493 ms tps = 1031.289907 (without initial connection time ) pgbench (17.1 dbi services build) starting vacuum...end. transaction type : < builtin : simple update> scaling factor: 1000 query mode: simple number of clients: 4 number of threads: 4 maximum number of tries: 1 duration: 900 s number of transactions actually processed: 939785 number of failed transactions: 0 (0.000%) latency average = 3.831 ms initial connection time = 16.170 ms tps = 1044.205709 (without initial connection time ) |
Same picture here, considerable faster. This means, that aligning the PostgreSQL block size and the ZFS record size indeed matters a lot.
How does that look like if we increase the PostgreSQL block size and also align the ZFS record size? Same test with a 32kB block size in PostgreSQL and a 32kB record size with ZFS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | $ sudo zfs set recordsize=32K pgpool /pgzfs $ zfs get recordsize /u02/pgdata/zfs NAME PROPERTY VALUE SOURCE pgpool /pgzfs recordsize 32K local $ initdb -D /u02/pgdata/zfs/pg/ $ echo "wal_init_zero='off'" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ echo "full_page_writes='off'" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ echo "port=5433" >> /u02/pgdata/zfs/pg/postgresql .auto.conf $ pg_ctl --pgdata= /u02/pgdata/zfs/pg/ --log= /u02/pgdata/zfs/pg/log .log start $ psql -c "show block_size" block_size ------------ 32768 $ pgbench -p 5433 -i -s 1000 postgres $ psql -c "\l+" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges | Size | Tablespace | Description -----------+----------+----------+-----------------+---------+---------+--------+-----------+-----------------------+-------+------------+-------------------------------------------- postgres | postgres | UTF8 | libc | C.UTF-8 | C.UTF-8 | | | | 14 GB | pg_default | default administrative connection database template0 | postgres | UTF8 | libc | C.UTF-8 | C.UTF-8 | | | =c /postgres +| 16 MB | pg_default | unmodifiable empty database | | | | | | | | postgres=CTc /postgres | | | template1 | postgres | UTF8 | libc | C.UTF-8 | C.UTF-8 | | | =c /postgres +| 16 MB | pg_default | default template for new databases | | | | | | | | postgres=CTc /postgres | | | (3 rows) |
Same standard test again (summaries only):
1 2 3 4 5 6 7 8 9 10 11 12 13 | $ for i in {1..3}; do pgbench --client=4 --jobs=4 -- time =900 --port=5433; done # run 1 number of transactions actually processed: 964541 tps = 1071.697122 (without initial connection time ) # run 2 number of transactions actually processed: 1045888 tps = 1162.073678 (without initial connection time ) # run 3 number of transactions actually processed: 1063922 tps = 1182.126126 (without initial connection time ) |
Even better than with an aligned 8kB block- and recordsize. Same test with the simple-update run:
1 2 3 4 5 6 7 8 9 10 11 12 13 | $ for i in {1..3}; do pgbench --client=4 --jobs=4 -- time =900 --port=5433 -- builtin =simple-update; done # run 1 number of transactions actually processed: 1233855 tps = 1370.926077 (without initial connection time ) # run 2 number of transactions actually processed: 1242710 tps = 1380.790526 (without initial connection time ) # run 3 number of transactions actually processed: 1251885 tps = 1390.963532 (without initial connection time ) |
Same picture, using a 32kB block- and recordsize gives better results.
To summarize: If you want to go for ZFS with PostgreSQL it indeed matters that the PostgreSQL block size and the ZFS record size are aligned. I am still not sure if I would go for a 32kB block size in PostgreSQL, this requires more testing. We haven’t even looked at other stuff to adapt (shared_buffers, checkpointing, … ) but I think that ZFS today is a valid option to run PostgreSQL workloads. But, again: Please test for your own, using your workload, before taking any decisions.