Often there is the requirement to populate a test or development database with data from production, but this comes with a risk: Do you really want, that developers or testers have access to sensitive data? In a lot of companies this might not be an issue, but for others, sensitive data must not be available to any other database than production. In Oracle there is Data Masking but there is nothing in Community PostgreSQL which helps you with that. Of course you could develop something on your own, but there is another solution: [Data]nymizer. This tool will produce a native dump file, and sensitive data is masked based on flexible rules. Because the result is a dump file, the size of the dump file might become an issue, if your source database is large, and you want to dump the whole database. But usually you do not have sensitive data in all the tables and you have the option to dump only specific tables. Lets have a look at how this can be installed, and how it works.
The installation itself is straight forward:
postgres@debian10pg:/home/postgres/ [pgdev] curl -sSfL https://git.io/pg_datanymizer | sh -s -- -b bin v0.1.0 pg_datanymizer installer: Version v0.1.0 will be installed pg_datanymizer installer: Successfully installed pg_datanymizer 0.1.0 to bin/pg_datanymizer postgres@debian10pg:/home/postgres/ [pgdev] ls -l bin/ total 14452 -rwxr-xr-x 1 postgres postgres 14796464 Feb 24 10:51 pg_datanymizer
To check if it works in general, we can print the help:
postgres@debian10pg:/home/postgres/ [pgdev] bin/pg_datanymizer --help pg_datanymizer 0.1.0 USAGE: pg_datanymizer [OPTIONS] FLAGS: --help Prints help information -V, --version Prints version information OPTIONS: -f, --file Path to dump file, example: /tmp/dump.sql -c, --config Path to config file. Default: ./config.yml -d, --dbname database to dump [default: postgres] -h, --host database server host or socket directory [default: localhost] -W, --password force password prompt (should happen automatically) --pg_dump pg_dump file location [default: pg_dump] -p, --port database server port number -U, --username connect as specified database user ARGS:
There are not too many options and you’ll notice that the tool can be used over the network as well. This is quite important, as access to the production host usually is limited.
Before we proceed with the tool we need some data, so lets use a standard pgbench database for this:
postgres@debian10pg:/home/postgres/ [pgdev] psql psql (14devel) Type "help" for help. postgres=# create database test; CREATE DATABASE postgres=# ! pgbench -i -s 10 test dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data (client-side)... 1000000 of 1000000 tuples (100%) done (elapsed 1.73 s, remaining 0.00 s) vacuuming... creating primary keys... done in 2.83 s (drop tables 0.00 s, create tables 0.03 s, client-side generate 1.77 s, vacuum 0.33 s, primary keys 0.70 s). postgres=# c test You are now connected to database "test" as user "postgres". test=# d List of relations Schema | Name | Type | Owner --------+------------------+-------+---------- public | pgbench_accounts | table | postgres public | pgbench_branches | table | postgres public | pgbench_history | table | postgres public | pgbench_tellers | table | postgres (4 rows) test=#
The “filler” column of pgbench_accounts is not populated by default. Lets assume we have some sensitive data there, e.g. email addresses:
test=# test=# update pgbench_accounts set filler = 'email@_'||md5(bid::text)||'.com'; UPDATE 1000000 test=# select * from pgbench_accounts limit 5; aid | bid | abalance | filler --------+-----+----------+-------------------------------------------------------------------------------------- 999947 | 10 | 0 | email@_d3d9446802a44259755d38e6d163e820.com 999948 | 10 | 0 | email@_d3d9446802a44259755d38e6d163e820.com 999949 | 10 | 0 | email@_d3d9446802a44259755d38e6d163e820.com 999950 | 10 | 0 | email@_d3d9446802a44259755d38e6d163e820.com 999951 | 10 | 0 | email@_d3d9446802a44259755d38e6d163e820.com (5 rows)
Using a simple configuration file like this:
postgres@debian10pg:/home/postgres/ [pgdev] cat config.yaml tables: - name: pgbench_accounts rules: filler: template: format: user-{{_1}}-{{_2}} rules: - random_num: {} - email: kind: Safe
… we can easily obfuscate this:
postgres@debian10pg:/home/postgres/ [pgdev] bin/pg_datanymizer -c config.yaml -U postgres -f output.sql test Prepare data scheme... Fetch tables metadata... [1 / 4] Prepare to dump table: public.pgbench_tellers [Dumping: public.pgbench_tellers] [|##################################################|] 100 of 100 rows [100%] (0s) [Dumping: public.pgbench_tellers] Finished in 0 seconds [Dumping: public.pgbench_accounts] [|##################################################|] 1016397 of 1016397 rows [100%] (0s) [Dumping: public.pgbench_accounts] Finished in 54 seconds [Dumping: public.pgbench_history] [|##################################################|] 0 of 0 rows [100%] (0s) [Dumping: public.pgbench_history] Finished in 0 seconds [Dumping: public.pgbench_branches] [|##################################################|] 10 of 10 rows [100%] (0s) [Dumping: public.pgbench_branches] Finished in 0 seconds Finishing with indexes...
Looking at the result, the filler column contains email addresses, but not with the original values anymore:
postgres@debian10pg:/home/postgres/ [pgdev] grep "@" output.sql | head -10 999947 10 0 [email protected] 999948 10 0 [email protected] 999949 10 0 [email protected] 999950 10 0 [email protected] 999951 10 0 [email protected] 999952 10 0 [email protected] 999953 10 0 [email protected] 110689 2 0 [email protected] 110690 2 0 [email protected] 110691 2 0 [email protected]
“Email” is only one of the available the rules. Have a look at the Readme for the other options.
Filtering tables, either to be included or excluded is possible as well:
postgres@debian10pg:/home/postgres/ [pg14] cat config.yaml filter: only: - public.pgbench_accounts tables: - name: pgbench_accounts rules: filler: template: format: user-{{_1}}-{{_2}} rules: - random_num: {} - email: kind: Safe
Using this configuration, only the pgbench_accounts table will be in the dump file:
postgres@debian10pg:/home/postgres/ [pgdev] bin/pg_datanymizer -c config.yaml -U postgres -f output.sql test Prepare data scheme... Fetch tables metadata... [1 / 4] Prepare to dump table: public.pgbench_tellers [Dumping: public.pgbench_tellers] --- SKIP --- [2 / 4] Prepare to dump table: public.pgbench_history [Dumping: public.pgbench_history] --- SKIP --- [3 / 4] Prepare to dump table: public.pgbench_branches [Dumping: public.pgbench_branches] --- SKIP --- [4 / 4] Prepare to dump table: public.pgbench_accounts [Dumping: public.pgbench_accounts] [|##################################################|] 1016397 of 1016397 rows [100%] (0s) [Dumping: public.pgbench_accounts] Finished in 35 seconds Finishing with indexes... Full Dump finished in 35 seconds
Really a nice tool, and very flexible, if you have the requirement for data anonymization.