When it comes to sorting strings it is all about locales and how PostgreSQL is using them. PostgreSQL uses the locales provided by the underlying operating system, by default this is “C” or “POSIX”. To see how that works lets do some simple demos to start with and then go a bit deeper.
The system I am using here is a Debian 12:
1 2 | postgres@debian12-pg: /home/postgres/ [pgdev] cat /etc/debian_version 12.2 |
I’ve modified the current environment to not set any locale specific environment variables:
1 2 | postgres@debian12-pg: /home/postgres/ [pgdev] env | grep LC postgres@debian12-pg: /home/postgres/ [pgdev] |
When we initialize a new PostgreSQL cluster from this environment, what do we get as locale and encoding settings in the new cluster?
1 2 3 4 5 6 7 8 9 10 11 12 13 | postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/xx postgres@debian12-pg: /home/postgres/ [pgdev] export PGPORT=8888 postgres@debian12-pg: /home/postgres/ [pgdev] pg_ctl --pgdata= /var/tmp/xx --log= /dev/null start postgres@debian12-pg: /home/postgres/ [pgdev] psql -p 8888 -c "\l" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges -----------+----------+-----------+-----------------+---------+-------+------------+-----------+----------------------- postgres | postgres | SQL_ASCII | libc | C | C | | | template0 | postgres | SQL_ASCII | libc | C | C | | | =c /postgres + | | | | | | | | postgres=CTc /postgres template1 | postgres | SQL_ASCII | libc | C | C | | | =c /postgres + | | | | | | | | postgres=CTc /postgres (3 rows) |
As nothing specific was defined in the environment and we did not specify anything with initdb, we got “C” as the locale and “SQL_ASCII” as the encoding. Especially the encoding of “SQL_ASCII” is not a good choice as it is clearly stated in the documentation linked above:
“The SQL_ASCII
setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII
, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII
. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII
setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.”
In other words, you should avoid that. How can you change the default encoding? There are two way of doing this. The first option is to properly set the locale settings in your environment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | postgres@debian12-pg: /home/postgres/ [pgdev] export LANG=en_US.UTF-8 postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/yy postgres@debian12-pg: /home/postgres/ [pgdev] export PGPORT=8889 postgres@debian12-pg: /home/postgres/ [pgdev] pg_ctl --pgdata= /var/tmp/yy --log= /dev/null start postgres@debian12-pg: /home/postgres/ [pgdev] psql -p 8889 -c "\l" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges -----------+----------+----------+-----------------+-------------+-------------+------------+-----------+----------------------- postgres | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | | | template0 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres template1 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres (3 rows) |
This will give you an encoding of UTF8 and a locale of en_US. The other option is to tell initdb what you want to have:
1 2 3 4 5 6 7 8 9 10 11 12 13 | postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/zz --locale= "en_US.UTF8" postgres@debian12-pg: /home/postgres/ [pgdev] export PGPORT=8890 postgres@debian12-pg: /home/postgres/ [pgdev] pg_ctl --pgdata= /var/tmp/zz --log= /dev/null start postgres@debian12-pg: /home/postgres/ [pgdev] psql -p 8890 -c "\l" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges -----------+----------+----------+-----------------+------------+------------+------------+-----------+----------------------- postgres | postgres | UTF8 | libc | en_US.UTF8 | en_US.UTF8 | | | template0 | postgres | UTF8 | libc | en_US.UTF8 | en_US.UTF8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres template1 | postgres | UTF8 | libc | en_US.UTF8 | en_US.UTF8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres (3 rows) |
This gives you the exact same thing. If you want to use your native locale, you can do that as well:
1 2 3 4 5 6 | postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/oo --locale= "de_CH.UTF8" The files belonging to this database system will be owned by user "postgres" . This user must also own the server process. initdb: error: invalid locale name "de_CH.UTF8" initdb: hint: If the locale name is specific to ICU, use --icu-locale. |
What is the issue here? The answer is quite simple: This locale does not exist on the operating system:
1 2 3 4 5 | postgres@debian12-pg: /home/postgres/ [pgdev] locale -a C C.utf8 en_US.utf8 POSIX |
On Debian this needs to be generated by enabling the locale in /etc/locale.gen and then run “locale-gen” afterwards:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | postgres@debian12-pg: /home/postgres/ [pgdev] sudo vi /etc/locale .gen postgres@debian12-pg: /home/postgres/ [pgdev] grep de_CH /etc/locale .gen # de_CH ISO-8859-1 de_CH.UTF-8 UTF-8 postgres@debian12-pg: /home/postgres/ [pgdev] sudo locale-gen Generating locales (this might take a while )... de_CH.UTF-8... done en_US.UTF-8... done Generation complete. postgres@debian12-pg: /home/postgres/ [pgdev] locale -a C C.utf8 de_CH.utf8 en_US.utf8 POSIX |
Now, that the new locale is there, it can also be given to initdb (or specified in the environment):
1 2 3 4 5 6 7 8 9 10 11 12 13 | postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/oo --locale= "de_CH.UTF8" postgres@debian12-pg: /home/postgres/ [pgdev] export PGPORT=8891 postgres@debian12-pg: /home/postgres/ [pgdev] pg_ctl --pgdata= /var/tmp/oo --log= /dev/null start postgres@debian12-pg: /home/postgres/ [pgdev] psql -p 8891 -c "\l" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges -----------+----------+----------+-----------------+------------+------------+------------+-----------+----------------------- postgres | postgres | UTF8 | libc | de_CH.UTF8 | de_CH.UTF8 | | | template0 | postgres | UTF8 | libc | de_CH.UTF8 | de_CH.UTF8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres template1 | postgres | UTF8 | libc | de_CH.UTF8 | de_CH.UTF8 | | | =c /postgres + | | | | | | | | postgres=CTc /postgres (3 rows) |
If you don’t want to use the locales from glibc you can also use ICU (International Components for Unicode). When your version of PostgreSQL is compiled with support for ICU and it is installed on your system then this is another choice you have. On Debian 12 it is this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | postgres@debian12-pg: /home/postgres/ [pgdev] apt info libicu72 Package: libicu72 Version: 72.1-3 Priority: optional Section: libs Source: icu Maintainer: Laszlo Boszormenyi (GCS) gcs@debian.org Installed-Size: 37.0 MB Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.0), libstdc++6 (>= 12) Breaks: libiculx63 (<< 63.1-5), openttd (<< 1.8.0-2~) Replaces: libiculx63 (<< 63.1-5) Homepage: https: //icu .unicode.org/ Tag: role::shared-lib Download-Size: 9,376 kB APT-Manual-Installed: no APT-Sources: http: //deb .debian.org /debian bookworm /main amd64 Packages Description: International Components for Unicode ICU is a C++ and C library that provides robust and full-featured Unicode and locale support. This package contains the runtime libraries for ICU. |
For using this you need to switch the locale provider when you are initializing a new cluster:
1 2 3 4 5 6 7 8 9 10 11 12 13 | postgres@debian12-pg: /home/postgres/ [pgdev] initdb --pgdata= /var/tmp/qq --locale-provider=icu --icu-locale=en-US-x-icu postgres@debian12-pg: /home/postgres/ [pgdev] export PGPORT=8892 postgres@debian12-pg: /home/postgres/ [pgdev] pg_ctl --pgdata= /var/tmp/qq --log= /dev/null start postgres@debian12-pg: /home/postgres/ [pgdev] psql -p 8892 -c "\l" List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges -----------+----------+----------+-----------------+-------------+-------------+-------------+-----------+----------------------- postgres | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | | template0 | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | | =c /postgres + | | | | | | | | postgres=CTc /postgres template1 | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | | =c /postgres + | | | | | | | | postgres=CTc /postgres (3 rows) |
Once a cluster is running you may ask PostgreSQL what is available:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | postgres=# select * from pg_collation ; oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | colliculocale | collicurules | collversion -------+------------------------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+------------------+--------------+------------- 100 | default | 11 | 10 | d | t | -1 | | | | | 950 | C | 11 | 10 | c | t | -1 | C | C | | | 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX | | | 962 | ucs_basic | 11 | 10 | c | t | 6 | C | C | | | 963 | unicode | 11 | 10 | i | t | -1 | | | und | | 153.120 12348 | C.utf8 | 11 | 10 | c | t | 6 | C.utf8 | C.utf8 | | | 12349 | de_CH.utf8 | 11 | 10 | c | t | 6 | de_CH.utf8 | de_CH.utf8 | | | 2.36 12350 | en_US.utf8 | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | | | 2.36 12351 | de_CH | 11 | 10 | c | t | 6 | de_CH.utf8 | de_CH.utf8 | | | 2.36 12352 | en_US | 11 | 10 | c | t | 6 | en_US.utf8 | en_US.utf8 | | | 2.36 12353 | und-x-icu | 11 | 10 | i | t | -1 | | | und | | 153.120 12354 | af-x-icu | 11 | 10 | i | t | -1 | | | af | | 153.120.42 12355 | af-NA-x-icu | 11 | 10 | i | t | -1 | | | af-NA | | 153.120.42 12356 | af-ZA-x-icu | 11 | 10 | i | t | -1 | | | af-ZA | | 153.120.42 |
That’s it for now. In one of the next posts we’ll look into the sort order.