In one of the last posts we’ve looked at how you tell copy to ignore any rows which do not fit into the target table starting with PostgreSQL 17. While this is great, this also hides information you maybe would like to know: How many rows have been discarded during the load? This commit goes into that direction and extends pg_stat_progress_copy with an additional column: tuples_skipped. This does not record how many rows have been skipped after the load, but at least you now can monitor how many rows have been skipped while the data load is running.
For having a look at this, we cannot use a very small text file as in the last post, this will load too fast:
1 2 3 4 5 6 7 8 9 | postgres=# create table import ( a int , b varchar (4), c int , d varchar (4)); CREATE TABLE postgres=# copy import from '/home/postgres/data.txt' with (delimiter ' ' ); COPY 8 -- second/another session postgres=# select * from pg_stat_progress_copy ; pid | datid | datname | relid | command | type | bytes_processed | bytes_total | tuples_processed | tuples_excluded | tuples_skipped -----+-------+---------+-------+---------+------+-----------------+-------------+------------------+-----------------+---------------- (0 rows ) |
To really see something in pg_stat_progress_copy lets generate a simple, but larger, data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | postgres=# create table t ( a varchar (50), b varchar (50) ); CREATE TABLE postgres=# insert into t select md5(i::text), md5(i::text) from generate_series(1,3000000) i; INSERT 0 3000000 postgres=# copy t to '/var/tmp/a' ; COPY 3000000 postgres=# \! head /var/tmp/a c4ca4238a0b923820dcc509a6f75849b c4ca4238a0b923820dcc509a6f75849b c81e728d9d4c2f636f067f89cc14862c c81e728d9d4c2f636f067f89cc14862c eccbc87e4b5ce2fe28308fd9f2a7baf3 eccbc87e4b5ce2fe28308fd9f2a7baf3 a87ff679a2f3e71d9181a67b7542122c a87ff679a2f3e71d9181a67b7542122c e4da3b7fbbce2345d7772b0674a318d5 e4da3b7fbbce2345d7772b0674a318d5 1679091c5a880faf6fb5e6087eb1b2dc 1679091c5a880faf6fb5e6087eb1b2dc 8f14e45fceea167a5a36dedd4bea2543 8f14e45fceea167a5a36dedd4bea2543 c9f0f895fb98ab9159f51fd0297e236d c9f0f895fb98ab9159f51fd0297e236d 45c48cce2e2d7fbdea1afc51c7c6ad26 45c48cce2e2d7fbdea1afc51c7c6ad26 d3d9446802a44259755d38e6d163e820 d3d9446802a44259755d38e6d163e820 |
This gives us a simple text file with two hashed columns. Before loading this, we’ll introduce some rows which will not fit into the table:
1 2 3 4 5 6 7 8 9 10 11 | postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a postgres=# \! sed -i '1500000 i 123456789012345678901234567890123456789012345678901234567890' /var/tmp/a postgres=# \! grep 123456789012345678901234567890123456789012345678901234567890 /var/tmp/a 123456789012345678901234567890123456789012345678901234567890 123456789012345678901234567890123456789012345678901234567890 123456789012345678901234567890123456789012345678901234567890 123456789012345678901234567890123456789012345678901234567890 123456789012345678901234567890123456789012345678901234567890 |
Those 60 characters will not fit into the first columns and will fail to load. Before loading this, we start a second session and monitor pg_stat_progress_copy:
1 2 3 4 5 6 7 8 9 10 11 | postgres=# select datname, tuples_processed, tuples_skipped from pg_stat_progress_copy ; datname | tuples_processed | tuples_skipped ---------+------------------+---------------- (0 rows ) postgres=# \watch 1 Thu 25 Jan 2024 01:08:31 PM CET (every 1s) datname | tuples_processed | tuples_skipped ---------+------------------+---------------- (0 rows ) ... |
In the first session we’ll create a new table with the same structure as the previous one and start to load the data:
1 2 3 4 | postgres=# create table z ( like t ); CREATE TABLE postgres=# copy z from '/var/tmp/a' with ( ON_ERROR 'ignore' ); COPY 3000000 |
In the second session we’ll see that exactly the five row have been skipped we’ve added above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | datname | tuples_processed | tuples_skipped ---------+------------------+---------------- (0 rows ) Thu 25 Jan 2024 01:11:18 PM CET (every 1s) datname | tuples_processed | tuples_skipped ----------+------------------+---------------- postgres | 163000 | 0 (1 row) Thu 25 Jan 2024 01:11:19 PM CET (every 1s) datname | tuples_processed | tuples_skipped ----------+------------------+---------------- postgres | 2346000 | 5 (1 row) Thu 25 Jan 2024 01:11:20 PM CET (every 1s) datname | tuples_processed | tuples_skipped ---------+------------------+---------------- (0 rows ) |
Nice, thanks to all involved.