dbi Blog

In one of the last posts we’ve looked at how you tell copy to ignore any rows which do not fit into the target table starting with PostgreSQL 17. While this is great, this also hides information you maybe would like to know: How many rows have been discarded during the load? This commit goes into that direction and extends pg_stat_progress_copy with an additional column: tuples_skipped. This does not record how many rows have been skipped after the load, but at least you now can monitor how many rows have been skipped while the data load is running.

For having a look at this, we cannot use a very small text file as in the last post, this will load too fast:

postgres=# create table import ( a int, b varchar(4), c int, d varchar(4));
CREATE TABLE
postgres=# copy import from '/home/postgres/data.txt' with (delimiter ' ');
COPY 8
-- second/another session
postgres=# select * from pg_stat_progress_copy ;
 pid | datid | datname | relid | command | type | bytes_processed | bytes_total | tuples_processed | tuples_excluded | tuples_skipped 
-----+-------+---------+-------+---------+------+-----------------+-------------+------------------+-----------------+----------------
(0 rows)

To really see something in pg_stat_progress_copy lets generate a simple, but larger, data set:

postgres=# create table t ( a varchar(50), b varchar(50) );
CREATE TABLE
postgres=# insert into t select md5(i::text), md5(i::text) from generate_series(1,3000000) i;
INSERT 0 3000000
postgres=# copy t to '/var/tmp/a';
COPY 3000000
postgres=# \! head /var/tmp/a
c4ca4238a0b923820dcc509a6f75849b        c4ca4238a0b923820dcc509a6f75849b
c81e728d9d4c2f636f067f89cc14862c        c81e728d9d4c2f636f067f89cc14862c
eccbc87e4b5ce2fe28308fd9f2a7baf3        eccbc87e4b5ce2fe28308fd9f2a7baf3
a87ff679a2f3e71d9181a67b7542122c        a87ff679a2f3e71d9181a67b7542122c
e4da3b7fbbce2345d7772b0674a318d5        e4da3b7fbbce2345d7772b0674a318d5
1679091c5a880faf6fb5e6087eb1b2dc        1679091c5a880faf6fb5e6087eb1b2dc
8f14e45fceea167a5a36dedd4bea2543        8f14e45fceea167a5a36dedd4bea2543
c9f0f895fb98ab9159f51fd0297e236d        c9f0f895fb98ab9159f51fd0297e236d
45c48cce2e2d7fbdea1afc51c7c6ad26        45c48cce2e2d7fbdea1afc51c7c6ad26
d3d9446802a44259755d38e6d163e820        d3d9446802a44259755d38e6d163e820

This gives us a simple text file with two hashed columns. Before loading this, we’ll introduce some rows which will not fit into the table:

postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a
postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a
postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a
postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a
postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a
postgres=# \! grep 123456789012345678901234567890123456789012345678901234567890 /var/tmp/a
123456789012345678901234567890123456789012345678901234567890
123456789012345678901234567890123456789012345678901234567890
123456789012345678901234567890123456789012345678901234567890
123456789012345678901234567890123456789012345678901234567890
123456789012345678901234567890123456789012345678901234567890

Those 60 characters will not fit into the first columns and will fail to load. Before loading this, we start a second session and monitor pg_stat_progress_copy:

postgres=# select datname, tuples_processed, tuples_skipped from pg_stat_progress_copy ;
 datname | tuples_processed | tuples_skipped 
---------+------------------+----------------
(0 rows)
postgres=# \watch 1 
 Thu 25 Jan 2024 01:08:31 PM CET (every 1s)

 datname | tuples_processed | tuples_skipped 
---------+------------------+----------------
(0 rows)
...

In the first session we’ll create a new table with the same structure as the previous one and start to load the data:

postgres=# create table z ( like t );
CREATE TABLE
postgres=# copy z from '/var/tmp/a' with ( ON_ERROR 'ignore' );
COPY 3000000

In the second session we’ll see that exactly the five row have been skipped we’ve added above:

 datname | tuples_processed | tuples_skipped 
---------+------------------+----------------
(0 rows)

  Thu 25 Jan 2024 01:11:18 PM CET (every 1s)

 datname  | tuples_processed | tuples_skipped 
----------+------------------+----------------
 postgres |           163000 |              0
(1 row)

  Thu 25 Jan 2024 01:11:19 PM CET (every 1s)

 datname  | tuples_processed | tuples_skipped 
----------+------------------+----------------
 postgres |          2346000 |              5
(1 row)

 Thu 25 Jan 2024 01:11:20 PM CET (every 1s)

 datname | tuples_processed | tuples_skipped 
---------+------------------+----------------
(0 rows)

Nice, thanks to all involved.

Post Views: 2,221