Coming back to the last post: 8000 databases in one PostgreSQL cluster?. As we’ve seen this is doable but comes with a lot of pain points. Another solution for this kind of requirement would be to host 8000 PostgreSQL clusters on one host. Again, I am not saying this is something you should do, and again, this is not how it finally will be implemented. This is more about if you can do it, and what is means for operating such a setup. Eight thousand is a huge number, no matter if it is about this number of databases in a single cluster, or this number of clusters on a single node.
The default PostgreSQL configuration on Linux comes with 128MB of shared_buffers. Again, if we do the math, that would be 1’024’000MB for 8000 clusters on a single host. Surprisingly I do not have such a machine available right now. We’ll use a AWS EC2 t3a.xlarge which comes with 4vCPUs and 16GB of memory. If we reduce shared_buffers to 128kB (which is the minimum), we come to 1024000kB, which is around 8GB of shared_buffers for all clusters. This should be doable, not counting any other memory related parameters.
The next bit we need to calculate is the required disk space. A fresh initialized PGDATA for PostgreSQL 13.1 consumes 29MB. For 8000 clusters we will therefore need 312’000MB of disk space. Se, here we go:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | postgres@pg13: /home/postgres/ [pg131] grep MemTotal /proc/meminfo MemTotal: 16284668 kB ostgres@pg13: /home/postgres/ [pg131] grep proc /proc/cpuinfo processor : 0 processor : 1 processor : 2 processor : 3 postgres@pg13: /home/postgres/ [pg131] df -h Filesystem Size Used Avail Use% Mounted on udev 7.8G 0 7.8G 0% /dev tmpfs 1.6G 8.4M 1.6G 1% /run /dev/nvme0n1p1 492G 2.9G 469G 1% / tmpfs 7.8G 0 7.8G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup /dev/nvme0n1p15 124M 262K 124M 1% /boot/efi tmpfs 1.6G 0 1.6G 0% /run/user/1000 |
Initializing 8000 clusters will take some time, but the procedure is quite simple. We need a different directory for each PGDATA and a different port for each cluster. In addition we need to reduce shared_buffers to 128kB and then start up all the clusters. First, lets create all the data directories and check if we run into any issues with this already:
1 2 3 4 5 6 7 8 9 | port= "5432" for i in {1..8000} do PGDATA= "/u02/pgdata/pgdata${i}" initdb -D ${PGDATA} > /dev/null 2>&1 > /dev/null echo "port=${port}" >> ${PGDATA} /postgresql .auto.conf echo "shared_buffers=128kB" >> ${PGDATA} /postgresql .auto.conf (( port += 1 )) done |
If you want to monitor the progress, you might do something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | postgres@pg13: /home/postgres/ [pg131] watch "ls -l /u02/pgdata | grep -v total | wc -l" Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 10:50:41 2021 146 ... Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 11:06:14 2021 1197 ... Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 11:13:27 2021 1696 Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 11:54:09 2021 4452 Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 12:19:44 2021 6116 Every 2.0s: ls -l /u02/pgdata | grep - v total | wc -l pg13: Fri Feb 12 13:35:24 2021 8000 |
That takes quite some time. How many space or we consuming?
1 2 | postgres@pg13: /home/postgres/ [pg131] du -sh /u02/pgdata/ 308G /u02/pgdata/ |
Not far from what we calculated above. Before we try to start all the clusters there is another bit we need to think about. If we startup a PostgreSQL 13.1 instance, how many processes do we get by default?
1 2 3 4 5 6 7 8 9 10 11 | postgres@debian10pg: /home/postgres/ [pg131] ps -ef | egrep "postgres -D|postgres:" postgres 28408 1 0 07:42 ? 00:00:00 /u01/app/postgres/product/DEV/db_1/bin/postgres -D /u02/pgdata/DEV postgres 28409 28408 0 07:42 ? 00:00:00 postgres: logger postgres 28411 28408 0 07:42 ? 00:00:00 postgres: checkpointer postgres 28412 28408 0 07:42 ? 00:00:00 postgres: background writer postgres 28413 28408 0 07:42 ? 00:00:00 postgres: walwriter postgres 28414 28408 0 07:42 ? 00:00:00 postgres: autovacuum launcher postgres 28415 28408 0 07:42 ? 00:00:00 postgres: stats collector postgres 28416 28408 0 07:42 ? 00:00:00 postgres: logical replication launcher ostgres@debian10pg: /home/postgres/ [pg131] ps -ef | egrep "postgres -D|postgres:" | wc -l 8 |
That’s eight per cluster, multiplied by 8000, that makes 64’000 processes. In the default configuration on Debian we have these limits:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | postgres@debian10pg:/home/postgres/ [pg131] ulimit -a postgres@pg13:/home/postgres/ [pg131] ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63541 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real - time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 63541 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited |
This will for sure not work, as we can easily try:
1 2 3 4 5 6 7 | port= "5432" for i in {1..8000} do PGDATA= "/u02/pgdata/pgdata${i}" pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null (( port += 1 )) done |
This will go fine until around 1500 postmasters have been started:
1 | unable to fork process: Resource temporarily unavailable |
… so we are clearly hit some limits here. Lets increase the maximum number of open files and processes limits for the postgres user (-1 means unlimited, only hard limits here):
1 2 3 | postgres@pg13: /home/postgres/ [pg131] cat /etc/security/limits .conf | egrep - v "^#|^$" postgres hard nproc -1 postgres hard nofile -1 |
After restarting the session we can check the actual value of the hard limits:
1 2 3 4 | postgres@pg13: /home/postgres/ [pg131] ulimit -H -n 1048576 postgres@pg13: /home/postgres/ [pg131] ulimit -H -u unlimited |
This means we should be able to create an unlimited amount of user processes, and we should be able to open 1048576 files (this seems to be the maximum on Debian). As these are the hard limits we need to increase the soft limits as well:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | postgres@pg13: /home/postgres/ [pg131] ulimit -S -u unlimited postgres@pg13: /home/postgres/ [pg131] ulimit -S -n 1048576 postgres@pg13: /home/postgres/ [pg131] ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 63540 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real- time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, - v ) unlimited file locks (-x) unlimited |
Lets check if that is sufficient for starting the 8000 clusters:
1 2 3 4 5 6 7 | postgres@pg13: /home/postgres/ [pg131] port= "5432" postgres@pg13: /home/postgres/ [pg131] for i in {1..8000} > do > PGDATA= "/u02/pgdata/pgdata${i}" > pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null > (( port += 1 )) > done |
No, we’ll still hit limits (again around 1500 clusters):
1 2 3 4 5 | no data was returned by command "" /u01/app/postgres/product/13/db_1/bin/postgres " -V" The program "postgres" is needed by pg_ctl but was not found in the same directory as "/u01/app/postgres/product/13/db_1/bin/pg_ctl" . Check your installation. sh: 1: Cannot fork |
Lets try to use a larger instance type: c5.9xlarge: This one comes with 36vCPU and 72GiB of memory. Will that work?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | postgres@pg13: /home/postgres/ [pg131] cat /proc/cpuinfo | grep proc | wc -l 36 postgres@pg13: /home/postgres/ [pg131] cat /proc/meminfo | grep Mem MemTotal: 72033208 kB MemFree: 71577940 kB MemAvailable: 71139240 kB postgres@pg13: /home/postgres/ [pg131] ulimit -S -u unlimited postgres@pg13: /home/postgres/ [pg131] ulimit -S -n 1048576 postgres@pg13: /home/postgres/ [pg131] for i in {1..8000} > do > PGDATA= "/u02/pgdata/pgdata${i}" > pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null > (( port += 1 )) > done |
We can go a bit further and come to approx. 1740 running cluster, but this is far from 8000:
1 2 | postgres@pg13: /home/postgres/ [pg131] ps -ef | grep postgres | grep "postgres -D" | wc -l 1736 |
If we increase even more and go to c5a.24xlarge we’ll come to (96vCPUs and 192GiB of memory):
1 2 3 4 5 6 7 8 9 | postgres@pg13: /home/postgres/ [pg131] cat /proc/cpuinfo | grep proc | wc -l 96 postgres@pg13: /home/postgres/ [pg131] cat /proc/meminfo | grep Mem MemTotal: 195998200 kB MemFree: 195223608 kB MemAvailable: 194067016 kB postgres@pg13: /home/postgres/ [pg131] ulimit -S -u unlimited postgres@pg13: /home/postgres/ [pg131] ulimit -S -n 1048576 postgres@pg13: /home/postgres/ [pg131] for i in {1..8000}; do PGDATA= "/u02/pgdata/pgdata${i}" ; pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null ; (( port += 1 )); done |
This will give us around 4096 clusters, which is a bit more than half the 8000 we wanted. But this also means that there is another limit we’re hitting. Four thousand clusters will consume 32000 processes, and we’re not close to the limit we’ve set above. What is it then? Increasing the number of vCPUs clearly gave us more running clusters, so there must be a relation between the number of vCPUs and the number of processes we can start. There is a kernel parameter which limits the maximum number of processes one can start on a Linux system:
1 2 3 4 5 6 7 | postgres@pg13: /home/postgres/ [pg131] cat /proc/sys/kernel/pid_max 98304 postgres@pg13: /home/postgres/ [pg131] man proc 5 /proc/sys/kernel/pid_max (since Linux 2.5.34) This file specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID). PIDs greater than this value are not allocated; thus, the value in this file also acts as a system-wide limit on the total number of processes and threads. The default value for this file , 32768, results in the same range of PIDs as on earlier kernels. On 32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million). |
We’re already higher than the default, which is 32768 and this value is dependent on the number of CPUs (more information here). This is actually not a real limit, but once that number is reached, the counter starts from 1 again and if no numbers are left, no additional process can be started. If we scale down our instance to c5.large (4vCPUs and 8GiB of memory) we’ll again get the default:
1 2 | postgres@pg13: /home/postgres/ [pg131] cat /proc/sys/kernel/pid_max 32768 |
Although we should not hit the limit of 98304 lets try to increase that and do another run. This time we’ll also have a look at the memory usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 | postgres@pg13: /home/postgres/ [pg131] sudo bash root@pg13: /home/postgres $ echo "196608" > /proc/sys/kernel/pid_max root@pg13: /home/postgres $ exit postgres@pg13: /home/postgres/ [pg131] cat /proc/sys/kernel/pid_max 196608 postgres@pg13: /home/postgres/ [pg131] ulimit -S -u unlimited postgres@pg13: /home/postgres/ [pg131] ulimit -S -n 1048576 postgres@pg13: /home/postgres/ [pg131] for i in {1..8000} > do > PGDATA= "/u02/pgdata/pgdata${i}" > pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null > (( port += 1 )) > done |
In a second session monitor memory usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | postgres@pg13: /home/postgres/ [pg131] watch "free -m" ## with approx.1000 clusters Every 2.0s: free -m pg13: Tue Feb 16 13:16:50 2021 total used free shared buff /cache available Mem: 189152 6412 177810 4183 4929 177353 Swap: 0 0 0 ## with approx. 2000 clusters Every 2.0s: free -m pg13: Tue Feb 16 13:18:38 2021 total used free shared buff /cache available Mem: 189152 12098 167470 8276 9583 167516 Swap: 0 0 0 ## with approx. 3000 clusters Every 2.0s: free -m pg13: Tue Feb 16 13:20:26 2021 total used free shared buff /cache available Mem: 189152 17803 157119 12369 14229 157664 Swap: 0 0 0 ## with approx. 4000 clusters Every 2.0s: free -m pg13: Tue Feb 16 13:22:16 2021 total used free shared buff /cache available Mem: 189152 23673 146495 16540 18983 147562 Swap: 0 0 0 |
… and it still fails with a bit more than 4000 clusters. The issue is not the overall memory consumption, we still have plenty of memory left. The answer is in the PostgreSQL startup log:
1 2 3 4 5 | 2021-02-16 13:43:17.990 UTC [97614] FATAL: could not create shared memory segment: No space left on device 2021-02-16 13:43:17.990 UTC [97614] DETAIL: Failed system call was shmget(key=21380032, size=56, 03600). 2021-02-16 13:43:17.990 UTC [97614] HINT: This error does *not* mean that you have run out of disk space. It occurs either if all available shared memory IDs have been taken, in which case you need to raise the SHMMNI parameter in your kernel, or because the system's overall limit for shared memory has been reached. The PostgreSQL documentation contains more information about shared memory configuration. 2021-02-16 13:43:17.990 UTC [97614] LOG: database system is shut down |
So we need to increase shmmni for enough shared memory segments being available. Next try:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | postgres@pg13: /home/postgres/ [pg131] sudo sysctl -w kernel.shmmni=8192 kernel.shmmni = 8192 postgres@pg13: /home/postgres/ [pg131] sysctl -p postgres@pg13: /home/postgres/ [pg131] sudo sysctl -a | grep kernel.shmmni kernel.shmmni = 8192 postgres@pg13: /home/postgres/ [pg131] for i in {1..8000} > do > PGDATA= "/u02/pgdata/pgdata${i}" > pg_ctl start -D ${PGDATA}/ -l /dev/null 2>&1 > /dev/null > (( port += 1 )) > done sh: 1: Cannot fork no data was returned by command "" /u01/app/postgres/product/13/db_1/bin/postgres " -V" The program "postgres" is needed by pg_ctl but was not found in the same directory as "/u01/app/postgres/product/13/db_1/bin/pg_ctl" . Check your installation. |
Next limit hit, now we come to around 4600 clusters, which is better, but still not the 8000. This time we are hitting the number of open files:
1 2 3 4 | postgres@pg13: /home/postgres/ [pg131] sudo lsof -u postgres 2> /dev/null | wc -l 1884863 postgres@pg13: /home/postgres/ [pg131] ulimit -Hn 1048576 |
But even if we increase the system-wide limit we’ll run into the same issue because we can not raise the user defined limit by more than 1048576:
1 2 3 4 | postgres@pg13: /home/postgres/ [pg131] sudo sysctl -w fs. file -max=100000000 fs. file -max = 100000000 postgres@pg13: /home/postgres/ [pg131] sudo sysctl -a | grep fs. file -max fs. file -max = 100000000 |
The reason is in the sshd_config:
1 2 | postgres@pg13: /home/postgres/ [pg131] sudo cat /etc/ssh/sshd_config | grep -i UsePam UsePAM yes |
PAM is responsible for setting the user limits. Lets skip that by removing this line and restarting sshd:
1 2 3 4 | postgres@pg13: /home/postgres/ [pg131] sudo sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config postgres@pg13: /home/postgres/ [pg131] grep UsePAM /etc/ssh/sshd_config UsePAM no postgres@pg13: /home/postgres/ [pg131] sudo systemctl restart sshd |
Once we restarted the ssh session let’s try again:
1 2 | dwe@ltdwe:~$ ssh -i /home/dwe/Documents/aws/dwe-key-pair .pem admin@18.195.216.2 admin@18.195.216.2: Permission denied (publickey). |
Next issue, we are not able to login anymore. The way to do it is to disable the line which does the enforcing of the limits:
1 2 3 4 5 6 | postgres@debian10pg:/home/postgres/ [pg14] grep -A 1 "limits" /etc/pam.d/sshd # access limits that are hard to express in sshd_config. # account required pam_access.so -- # Set up user limits from /etc/security/limits.conf. # session required pam_limits.so |
After restarting sshd lets try again:
1 2 3 4 5 6 7 | port= "5432" for i in {1..8000} do PGDATA= "/u02/pgdata/pgdata${i}" pg_ctl start -D ${PGDATA}/ -l /home/postgres/ ${i}log.log (( port += 1 )) done |
.. and it still fails:
1 2 3 4 5 | 2021-02-19 15:49:38.457 UTC [109716] LOG: listening on Unix socket "/tmp/.s.PGSQL.10063" 2021-02-19 15:49:38.461 UTC [109717] LOG: database system was shut down at 2021-02-19 15:39:59 UTC 2021-02-19 15:49:38.465 UTC [109716] LOG: could not fork worker process: Resource temporarily unavailable 2021-02-19 15:49:38.465 UTC [109716] LOG: database system is ready to accept connections 2021-02-19 15:49:50.889 UTC [109716] LOG: received smart shutdown request |
The next test is to check if we are hitting a user/session, or a system wide limit. For that we will start the first 4000 clusters with the postgres user we already have. For the remaining 4000 clusters we’ll create another user, change $PGDATA to be owned by that user, and then try to start them. If that works, we can at least start 8000 clusters using two separate OS users:
1 2 3 4 5 6 7 8 9 | root@ip-10-0-1-21: /home/admin # useradd -g postgres -m postgres2 root@ip-10-0-1-21: /home/admin # for i in {4001..8000} > do > PGDATA= "/u02/pgdata/pgdata${i}" > chown -R postgres2:postgres ${PGDATA} > done Every 2.0s: ps -ef | egrep 'postgres -D' | wc -l ip-10-0-1-21: Tue Feb 23 10:38:49 2021 4000 |
First 4000 clusters:
1 2 3 4 | postgres@ip-10-0-1-21: /home/postgres/ [pg132] for i in {1..4000}; do PGDATA= "/u02/pgdata/pgdata${i}" ; pg_ctl start -D ${PGDATA}/ -l /home/postgres/ ${i}log.log; done Every 2.0s: ps -ef | egrep 'postgres -D' | wc -l ip-10-0-1-21: Tue Feb 23 10:38:49 2021 4000 |
Lets check if we can start another 4000 using the other user:
1 2 3 4 5 6 7 8 9 10 | postgres2@ip-10-0-1-21: /home/postgres2/ [pg132] id -a uid=1002(postgres2) gid=1001(postgres) groups =1001(postgres) postgres@ip-10-0-1-21: /home/postgres2/ [pg132] for i in {1..4000}; do PGDATA= "/u02/pgdata/pgdata${i}" ; pg_ctl start -D ${PGDATA}/ -l /home/postgres2/ ${i}log.log; done postgres2@ip-10-0-1-21: /home/postgres2/ [pg132] cat 4632log.log 2021-02-23 10:45:38.573 UTC [61580] LOG: starting PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit 2021-02-23 10:45:38.573 UTC [61580] LOG: listening on IPv4 address "127.0.0.1" , port 10063 2021-02-23 10:45:38.578 UTC [61580] LOG: listening on Unix socket "/tmp/.s.PGSQL.10063" 2021-02-23 10:45:38.585 UTC [61581] LOG: database system was shut down at 2021-02-21 17:53:04 UTC 2021-02-23 10:45:38.597 UTC [61580] LOG: could not fork worker process: Resource temporarily unavailable 2021-02-23 10:45:38.597 UTC [61580] LOG: database system is ready to accept connections |
Which leads to the same issue once we reach around 4600 clusters. At this point I decided to stop this experiment as the conclusion is clear: Don’t do it, for obvious reasons:
- You’ll run into all kinds of limits
- Managing 8000 clusters on one host is not doable, you’ll for sure do a mistake one time and then one or may clusters are damaged
- Only looking at the number of open files and the number of processes should already tell you, again: Don’t do it
- Many, many other reasons (backup/restore/patching/upgrades)