When it comes to sorting strings it is all about locales and how PostgreSQL is using them. PostgreSQL uses the locales provided by the underlying operating system, by default this is “C” or “POSIX”. To see how that works lets do some simple demos to start with and then go a bit deeper.

The system I am using here is a Debian 12:

postgres@debian12-pg:/home/postgres/ [pgdev] cat /etc/debian_version 
12.2

I’ve modified the current environment to not set any locale specific environment variables:

postgres@debian12-pg:/home/postgres/ [pgdev] env | grep LC
postgres@debian12-pg:/home/postgres/ [pgdev] 

When we initialize a new PostgreSQL cluster from this environment, what do we get as locale and encoding settings in the new cluster?

postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/xx
postgres@debian12-pg:/home/postgres/ [pgdev] export PGPORT=8888
postgres@debian12-pg:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/xx --log=/dev/null start
postgres@debian12-pg:/home/postgres/ [pgdev] psql -p 8888 -c "\l"
                                                   List of databases
   Name    |  Owner   | Encoding  | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules |   Access privileges   
-----------+----------+-----------+-----------------+---------+-------+------------+-----------+-----------------------
 postgres  | postgres | SQL_ASCII | libc            | C       | C     |            |           | 
 template0 | postgres | SQL_ASCII | libc            | C       | C     |            |           | =c/postgres          +
           |          |           |                 |         |       |            |           | postgres=CTc/postgres
 template1 | postgres | SQL_ASCII | libc            | C       | C     |            |           | =c/postgres          +
           |          |           |                 |         |       |            |           | postgres=CTc/postgres
(3 rows)

As nothing specific was defined in the environment and we did not specify anything with initdb, we got “C” as the locale and “SQL_ASCII” as the encoding. Especially the encoding of “SQL_ASCII” is not a good choice as it is clearly stated in the documentation linked above:

“The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0–127 according to the ASCII standard, while byte values 128–255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.”

In other words, you should avoid that. How can you change the default encoding? There are two way of doing this. The first option is to properly set the locale settings in your environment:

postgres@debian12-pg:/home/postgres/ [pgdev] export LANG=en_US.UTF-8
postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/yy
postgres@debian12-pg:/home/postgres/ [pgdev] export PGPORT=8889
postgres@debian12-pg:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/yy --log=/dev/null start
postgres@debian12-pg:/home/postgres/ [pgdev] psql -p 8889 -c "\l"
                                                       List of databases
   Name    |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale | ICU Rules |   Access privileges   
-----------+----------+----------+-----------------+-------------+-------------+------------+-----------+-----------------------
 postgres  | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | 
 template0 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +
           |          |          |                 |             |             |            |           | postgres=CTc/postgres
 template1 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +
           |          |          |                 |             |             |            |           | postgres=CTc/postgres
(3 rows)

This will give you an encoding of UTF8 and a locale of en_US. The other option is to tell initdb what you want to have:

postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/zz --locale="en_US.UTF8"
postgres@debian12-pg:/home/postgres/ [pgdev] export PGPORT=8890
postgres@debian12-pg:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/zz --log=/dev/null start
postgres@debian12-pg:/home/postgres/ [pgdev] psql -p 8890 -c "\l"
                                                      List of databases
   Name    |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    | ICU Locale | ICU Rules |   Access privileges   
-----------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
 postgres  | postgres | UTF8     | libc            | en_US.UTF8 | en_US.UTF8 |            |           | 
 template0 | postgres | UTF8     | libc            | en_US.UTF8 | en_US.UTF8 |            |           | =c/postgres          +
           |          |          |                 |            |            |            |           | postgres=CTc/postgres
 template1 | postgres | UTF8     | libc            | en_US.UTF8 | en_US.UTF8 |            |           | =c/postgres          +
           |          |          |                 |            |            |            |           | postgres=CTc/postgres
(3 rows)

This gives you the exact same thing. If you want to use your native locale, you can do that as well:

postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/oo --locale="de_CH.UTF8"
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

initdb: error: invalid locale name "de_CH.UTF8"
initdb: hint: If the locale name is specific to ICU, use --icu-locale.

What is the issue here? The answer is quite simple: This locale does not exist on the operating system:

postgres@debian12-pg:/home/postgres/ [pgdev] locale -a
C
C.utf8
en_US.utf8
POSIX

On Debian this needs to be generated by enabling the locale in /etc/locale.gen and then run “locale-gen” afterwards:

postgres@debian12-pg:/home/postgres/ [pgdev] sudo vi /etc/locale.gen 
postgres@debian12-pg:/home/postgres/ [pgdev] grep de_CH /etc/locale.gen 
# de_CH ISO-8859-1
de_CH.UTF-8 UTF-8
postgres@debian12-pg:/home/postgres/ [pgdev] sudo locale-gen
Generating locales (this might take a while)...
  de_CH.UTF-8... done
  en_US.UTF-8... done
Generation complete.
postgres@debian12-pg:/home/postgres/ [pgdev] locale -a
C
C.utf8
de_CH.utf8
en_US.utf8
POSIX

Now, that the new locale is there, it can also be given to initdb (or specified in the environment):

postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/oo --locale="de_CH.UTF8"
postgres@debian12-pg:/home/postgres/ [pgdev] export PGPORT=8891
postgres@debian12-pg:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/oo --log=/dev/null start
postgres@debian12-pg:/home/postgres/ [pgdev] psql -p 8891 -c "\l"
                                                      List of databases
   Name    |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    | ICU Locale | ICU Rules |   Access privileges   
-----------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
 postgres  | postgres | UTF8     | libc            | de_CH.UTF8 | de_CH.UTF8 |            |           | 
 template0 | postgres | UTF8     | libc            | de_CH.UTF8 | de_CH.UTF8 |            |           | =c/postgres          +
           |          |          |                 |            |            |            |           | postgres=CTc/postgres
 template1 | postgres | UTF8     | libc            | de_CH.UTF8 | de_CH.UTF8 |            |           | =c/postgres          +
           |          |          |                 |            |            |            |           | postgres=CTc/postgres
(3 rows)

If you don’t want to use the locales from glibc you can also use ICU (International Components for Unicode). When your version of PostgreSQL is compiled with support for ICU and it is installed on your system then this is another choice you have. On Debian 12 it is this:

postgres@debian12-pg:/home/postgres/ [pgdev] apt info libicu72
Package: libicu72
Version: 72.1-3
Priority: optional
Section: libs
Source: icu
Maintainer: Laszlo Boszormenyi (GCS) [email protected]
Installed-Size: 37.0 MB
Depends: libc6 (>= 2.34), libgcc-s1 (>= 3.0), libstdc++6 (>= 12)
Breaks: libiculx63 (<< 63.1-5), openttd (<< 1.8.0-2~)
Replaces: libiculx63 (<< 63.1-5)
Homepage: https://icu.unicode.org/
Tag: role::shared-lib
Download-Size: 9,376 kB
APT-Manual-Installed: no
APT-Sources: http://deb.debian.org/debian bookworm/main amd64 Packages
Description: International Components for Unicode
ICU is a C++ and C library that provides robust and full-featured
Unicode and locale support. This package contains the runtime
libraries for ICU.

For using this you need to switch the locale provider when you are initializing a new cluster:

postgres@debian12-pg:/home/postgres/ [pgdev] initdb --pgdata=/var/tmp/qq --locale-provider=icu --icu-locale=en-US-x-icu
postgres@debian12-pg:/home/postgres/ [pgdev] export PGPORT=8892
postgres@debian12-pg:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/qq --log=/dev/null start
 postgres@debian12-pg:/home/postgres/ [pgdev] psql -p 8892 -c "\l"
                                                        List of databases
   Name    |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale  | ICU Rules |   Access privileges   
-----------+----------+----------+-----------------+-------------+-------------+-------------+-----------+-----------------------
 postgres  | postgres | UTF8     | icu             | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu |           | 
 template0 | postgres | UTF8     | icu             | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu |           | =c/postgres          +
           |          |          |                 |             |             |             |           | postgres=CTc/postgres
 template1 | postgres | UTF8     | icu             | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu |           | =c/postgres          +
           |          |          |                 |             |             |             |           | postgres=CTc/postgres
(3 rows)

Once a cluster is running you may ask PostgreSQL what is available:

postgres=# select * from pg_collation ;
  oid  |        collname        | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype  |  colliculocale   | collicurules | collversion 
-------+------------------------+---------------+-----------+--------------+---------------------+--------------+-------------+------------+------------------+--------------+-------------
   100 | default                |            11 |        10 | d            | t                   |           -1 |             |            |                  |              | 
   950 | C                      |            11 |        10 | c            | t                   |           -1 | C           | C          |                  |              | 
   951 | POSIX                  |            11 |        10 | c            | t                   |           -1 | POSIX       | POSIX      |                  |              | 
   962 | ucs_basic              |            11 |        10 | c            | t                   |            6 | C           | C          |                  |              | 
   963 | unicode                |            11 |        10 | i            | t                   |           -1 |             |            | und              |              | 153.120
 12348 | C.utf8                 |            11 |        10 | c            | t                   |            6 | C.utf8      | C.utf8     |                  |              | 
 12349 | de_CH.utf8             |            11 |        10 | c            | t                   |            6 | de_CH.utf8  | de_CH.utf8 |                  |              | 2.36
 12350 | en_US.utf8             |            11 |        10 | c            | t                   |            6 | en_US.utf8  | en_US.utf8 |                  |              | 2.36
 12351 | de_CH                  |            11 |        10 | c            | t                   |            6 | de_CH.utf8  | de_CH.utf8 |                  |              | 2.36
 12352 | en_US                  |            11 |        10 | c            | t                   |            6 | en_US.utf8  | en_US.utf8 |                  |              | 2.36
 12353 | und-x-icu              |            11 |        10 | i            | t                   |           -1 |             |            | und              |              | 153.120
 12354 | af-x-icu               |            11 |        10 | i            | t                   |           -1 |             |            | af               |              | 153.120.42
 12355 | af-NA-x-icu            |            11 |        10 | i            | t                   |           -1 |             |            | af-NA            |              | 153.120.42
 12356 | af-ZA-x-icu            |            11 |        10 | i            | t                   |           -1 |             |            | af-ZA            |              | 153.120.42

That’s it for now. In one of the next posts we’ll look into the sort order.