Archives des PostgreSQL - dbi Blog

pgvector, a guide for DBA – Part 2: Indexes (update march 2026)

Adrien Obernesser — Sun, 01 Mar 2026 19:09:15 +0000

Introduction

In Part 1 of this series, we covered what pgvector is, how embeddings work, and how to store them in PostgreSQL. We ended with a working similarity search — but on a sequential scan. That works fine for a demo table with 1,000 rows. It does not work for production.

This post is about what comes next: indexes. Specifically, the three index families in the pgvector ecosystem as of February 2026 (HNSW, IVFFlat, and DiskANN), including two DiskANN implementations targeting different deployment models, what they’re good at, where they break, and the patterns you need, whether you’re the DBA tuning them or the developer looking to understand the the strenghts of PostgreSQL as a vector store.

Everything in this post was tested on public dataset: 25,000 Wikipedia articles embedded with OpenAI’s text-embedding-3-large at 3,072 dimensions, the maximum the model supports. The high number of dimension is a choice, to highlight some limitations for pedagogical reasons. You would be ok running and testing with lower dimensions or other embedding models, you might want to look into the RAG series, I will probably make a blog post on how to test embedding models against your data sets.
The environment is PostgreSQL 18 with pgvector 0.8.1 and pgvectorscale 0.9.0.

All the SQL scripts, Python code, and Docker configuration are in the companion lab: lab/06_pgvector_indexes.

The Index Types

Before we dive in, here’s the landscape. pgvector ships with two built-in index types (HNSW and IVFFlat), and two DiskANN implementations are available from different vendors:

	HNSW	IVFFlat	DiskANN (pgvectorscale)	DiskANN (pg_diskann)
Provider	pgvector	pgvector	Timescale	Microsoft
Availability	Built-in	Built-in	Open source, self-hosted	Azure DB for PostgreSQL
Algorithm	Multi-layer graph	Voronoi cell partitioning	Vamana graph + SBQ	Vamana graph + PQ
Best for	General purpose	Fast build	Storage-constrained	Azure + high recall
Build time (25K, 3072d)	29s	5s	49s	N/A (Azure)
Index size	193 MB	193 MB	21 MB	Similar
Query time	2-6 ms	2-10 ms	3 ms	~3 ms

That pgvectorscale number is not a typo. 21 MB vs 193 MB for the same data. W

Note: This post uses pgvectorscale for all DiskANN benchmarks since it’s the open-source, self-hosted option. We’ll compare both DiskANN implementations in detail in Section 3. pg_diskann is available only for Azure Flexible Server for PostgreSQL, the managed instance service from Microsoft.

HNSW: The Default Choice

HNSW (Hierarchical Navigable Small World) is the most commonly recommended index type for vector search. It builds a multi-layered graph where each node connects to its nearest neighbors, and search navigates this graph from top to bottom.

The 2,000-Dimension Wall

Here’s the first thing you’ll hit with modern embedding models:

CREATE INDEX idx_content_hnsw
ON articles USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

ERROR:  column cannot have more than 2000 dimensions for hnsw index

The vector type in pgvector has a 2,000-dimension limit for HNSW indexes. If you’re using text-embedding-3-large (3,072 dimensions), or voyage-3-large at its 2,048-dimension setting, this is a blocker.

The workaround: halfvec.

-- Step 1: Store a half-precision copy
ALTER TABLE articles ADD COLUMN content_halfvec halfvec(3072);
UPDATE articles SET content_halfvec = content_vector::halfvec;

-- Step 2: Index the halfvec column (limit: 4,000 dimensions)
CREATE INDEX idx_content_hnsw_halfvec
ON articles USING hnsw (content_halfvec halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);

Time: 28974.392 ms (00:28.974)

SELECT pg_size_pretty(pg_relation_size('idx_content_hnsw_halfvec')) AS hnsw_size;

 hnsw_size
-----------
 193 MB

29 seconds to build, 193 MB for 25,000 articles at 3,072 dimensions. That’s roughly 8 KB per row in the index alone.

Important: halfvec stores each dimension in 2 bytes instead of 4. You lose some floating-point precision, but in practice the recall difference is negligible for similarity search. The storage savings are real: your halfvec column is 6 KB per row vs 12 KB for the full vector.

Alternative: Instead of a separate column, you can create an expression index that casts on the fly: CREATE INDEX ... ON articles USING hnsw ((content_vector::halfvec(3072)) halfvec_cosine_ops); The trade-off is that your queries must use the matching expression (content_vector::halfvec(3072) <=> ...) for the planner to pick it up, which is harder to read in application code. The separate column approach gives cleaner queries.

Tuning ef_search: The Recall vs Speed Dial

ef_search controls how many candidates the HNSW algorithm considers during search. Higher values mean more candidates examined, better recall, but more work. The default is 40.

-- ef_search = 40 (default)
SET hnsw.ef_search = 40;
EXPLAIN ANALYZE
SELECT id, title, content_halfvec <=> (
    SELECT content_halfvec FROM articles WHERE id = 1
) AS distance
FROM articles
ORDER BY content_halfvec <=> (
    SELECT content_halfvec FROM articles WHERE id = 1
)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec <=> (InitPlan 1).col1)
  Index Searches: 1
  Buffers: shared hit=551
Execution Time: 6.004 ms

-- ef_search = 100
SET hnsw.ef_search = 100;

Index Scan using idx_content_hnsw_halfvec on articles
  Buffers: shared hit=716
Execution Time: 2.365 ms

-- ef_search = 200
SET hnsw.ef_search = 200;

Index Scan using idx_content_hnsw_halfvec on articles
  Buffers: shared hit=883
Execution Time: 2.542 ms

Wait — ef_search=100 was faster than ef_search=40? Not really. Those numbers came from a warm cache (shared hit=551, zero disk reads). The apparent speedup is a cache warming effect, not a property of the algorithm. To prove this, I restarted PostgreSQL and ran the full sweep from cold:

ef_search	Execution Time	Buffers	read (disk)
40 (cold)	91 ms	hit=189, read=362	362 pages from disk
100	33 ms	hit=616, read=132	fewer cold pages
200	22 ms	hit=850, read=65	even fewer
40 (warm)	0.8 ms	hit=551, read=0	all cached

The second run at ef_search=40 clocked 0.8 ms — faster than any ef_search=100 or 200 run. On a warm cache, all three (40/100/200) land in the 0.8-5 ms range. The variance is cache state, not algorithmic shortcuts. The real cost cliff is at ef_search=400 where the optimizer switches plans entirely:

-- ef_search = 400
SET hnsw.ef_search = 400;

Sort  (actual time=364.693..364.695 rows=10.00 loops=1)
  Sort Key: ((articles.content_halfvec <=> (InitPlan 1).col1))
  Sort Method: top-N heapsort  Memory: 25kB
  ->  Seq Scan on articles  (actual time=0.143..356.482 rows=24700.00 loops=1)
        Filter: (content_halfvec IS NOT NULL)
        Rows Removed by Filter: 300
Execution Time: 364.724 ms

The planner chose a Seq Scan + Sort path. At ef_search=400, PostgreSQL’s cost model estimated the index scan would be more expensive than reading every row sequentially, and the query went from 2.5 ms to 365 ms.

This is your optimizer flip-flop. It’s not a bug — it’s the planner doing its job. But it means you need to be aware of the threshold.

ef_search	Execution Time	Plan	Buffers
40	6.0 ms	Index Scan	551
100	2.4 ms	Index Scan	716
200	2.5 ms	Index Scan	883
400	365 ms	Seq Scan	209,192

DBA Takeaway: For HNSW on halfvec(3072), stay in the ef_search 40-200 range. Past that, you’re fighting the optimizer. If you need ef_search > 200 for recall, you probably need a bigger m parameter at build time.

Build Parameters: m and ef_construction

m is the number of connections per node in the graph. ef_construction is the candidate list size during build. Higher values = better graph quality but slower builds and (potentially) larger indexes.

-- m=16, ef_construction=64 (default-ish)
Time: 28,974 ms    Size: 193 MB

-- m=32, ef_construction=128
Time: 54,077 ms    Size: 193 MB

At 25K rows, doubling m doubled the build time but didn’t change the index size. The size effect becomes more visible at larger scales. In general: start with m=16, ef_construction=64 and only increase if recall is insufficient after tuning ef_search.

IVFFlat: Fast Build

IVFFlat (Inverted File with Flat quantization) partitions the vector space into Voronoi cells using k-means clustering, then searches only the cells closest to the query vector.

Same Dimension Limit

IVFFlat has the same 2,000-dimension limit as HNSW for the vector type. Same workaround:

CREATE INDEX idx_content_ivfflat
ON articles USING ivfflat (content_halfvec halfvec_cosine_ops)
WITH (lists = 25);

Time: 5008.765 ms (00:05.009)

SELECT pg_size_pretty(pg_relation_size('idx_content_ivfflat'));
-- 193 MB

5 seconds to build vs 29 for HNSW. The index size is nearly identical (193 MB vs 193 MB), but IVFFlat builds 5.8x faster.

The lists parameter controls the number of Voronoi cells. The pgvector documentation recommends: rows / 1000 for tables up to 1M rows, sqrt(rows) for larger tables. For 25,000 articles: 25000 / 1000 = 25 lists, giving roughly 1,000 rows per cell. A common mistake is applying sqrt(rows) to small tables — that gives 158 lists here, creating cells with only ~167 rows each, which fragments the index and causes the optimizer to flip to sequential scan at surprisingly low probe counts.

Tuning Probes: How Many Cells to Search

probes controls how many Voronoi cells are searched at query time. Default is 1 — fast but low recall. A good starting point is sqrt(lists). Here’s the sweep with lists=25:

SET ivfflat.probes = 1;
-- Index Scan, Execution Time: 1.0 ms, Buffers: 571

SET ivfflat.probes = 2;
-- Index Scan, Execution Time: 3.7 ms, Buffers: 1,944

SET ivfflat.probes = 3;
-- Index Scan, Execution Time: 4.4 ms, Buffers: 2,793

SET ivfflat.probes = 4;
-- Index Scan, Execution Time: 5.9 ms, Buffers: 3,937

And then:

SET ivfflat.probes = 5;

Sort  (actual time=152.548..152.549 rows=10.00 loops=1)
  Sort Key: ((articles.content_halfvec <=> (InitPlan 1).col1))
  ->  Seq Scan on articles  (actual time=0.144..148.844 rows=24700.00 loops=1)
        Filter: (content_halfvec IS NOT NULL)
Execution Time: 152.584 ms

Same story as HNSW. At probes=5 (searching 5 of 25 cells = 20%), the optimizer decided a sequential scan was cheaper. The query went from 6 ms to 153 ms.

probes	Execution Time	Plan
1	1.0 ms	Index Scan
2	3.7 ms	Index Scan
3	4.4 ms	Index Scan
4	5.9 ms	Index Scan
5	153 ms	Seq Scan

Takeaway: For IVFFlat with 25 lists, the optimizer flips between probes=4 and probes=5. With sqrt(25) = 5, you’re right at the tipping point. Use SET LOCAL ivfflat.probes = 3 or 4 for a good recall/speed balance. On larger tables (100K+ rows), the flip happens at much higher probe counts because sequential scans become proportionally more expensive.

SET LOCAL: The Production Pattern

Never set ivfflat.probes at the session level in production. Use SET LOCAL inside a transaction:

BEGIN;
    SET LOCAL ivfflat.probes = 3;
    SELECT id, title, content_halfvec <=> (
        SELECT content_halfvec FROM articles WHERE id = 1
    ) AS distance
    FROM articles
    ORDER BY content_halfvec <=> (
        SELECT content_halfvec FROM articles WHERE id = 1
    )
    LIMIT 10;
COMMIT;

-- Verify: probes reverted to default
SHOW ivfflat.probes;  -- 1

The setting reverts automatically after COMMIT/ROLLBACK. No global state leakage. Do the same for hnsw.ef_search.

DO NOT PLAY AROUND WITH SESSION PARAMETERS ON PRODUCTION. Use SET LOCAL inside a transaction, or set them at the function/procedure level. Session-level changes persist until disconnect and can affect every query on that connection, including connection pooler reuse. Hinting is not a strategy.

DiskANN: Leveraging B-TREE index principle

DiskANN is provided by the pgvectorscale project from Timescale (SQL extension name: vectorscale). It implements the DiskANN algorithm with Statistical Binary Quantization (SBQ) compression built in.

No 2,000-Dimension Wall

Unlike HNSW and IVFFlat, DiskANN supports the vector type natively up to 16,000 dimensions — no halfvec workaround needed:

CREATE INDEX idx_content_diskann ON articles USING diskann (content_vector)
WITH (storage_layout = memory_optimized);

NOTICE:  Starting index build with num_neighbors=50, search_list_size=100,
         max_alpha=1.2, storage_layout=SbqCompression.  -- memory_optimized maps to SBQ
NOTICE:  Indexed 24700 tuples
Time: 49140.736 ms (00:49.141)

SELECT pg_size_pretty(pg_relation_size('idx_content_diskann'));
-- 21 MB

49 seconds to build, but 21 MB. Compare:

        indexname         |  size  | size_bytes
--------------------------+--------+------------
 idx_content_diskann      | 21 MB  |   22,511,616
 idx_content_hnsw_halfvec | 193 MB |  202,350,592
 idx_content_ivfflat      | 193 MB |  202,522,624

DiskANN is 9x smaller than HNSW and IVFFlat on the same data. SBQ compression is the default — storage_layout = memory_optimized is what you get if you don’t specify a layout. Specifying it explicitly (as in the CREATE INDEX above) is good practice for readability. The alternative plain layout stores full vectors in the index and does not compress.

Query performance is competitive:

Index Scan using idx_content_diskann on articles
  Order By: (content_vector <=> (InitPlan 1).col1)
  Buffers: shared hit=1437 read=132
Execution Time: 2.915 ms

3 ms for a 3,072-dimension nearest neighbor search on 25K articles. On the same data, HNSW does it in 2-6 ms and IVFFlat in 2-10 ms. All three are in the same ballpark for query speed.

Takeaway: DiskANN is the right choice when your HNSW index outgrows shared_buffers. At 193 MB for 25K rows, HNSW on halfvec(3072) would reach ~77 GB at 10 million rows, this is well beyond what most buffer pools can keep hot. DiskANN’s 9x compression keeps the navigational structure cached while full vectors stay in the heap, fetched only for the final rescore of a few dozen candidates. Same access pattern as a B-tree: compact index in memory, selective heap lookups on demand. The trade-off is longer build times and fewer operator class options (vector type only, no halfvec/bit/sparsevec).

CREATE INDEX CONCURRENTLY

As of pgvectorscale 0.9.0, DiskANN supports CONCURRENTLY:

CREATE INDEX CONCURRENTLY idx_content_diskann
ON articles USING diskann (content_vector)
WITH (storage_layout = memory_optimized);

This is critical for production — you can build the index without locking the table. HNSW and IVFFlat also support CREATE INDEX CONCURRENTLY.

Why These Dimension Limits Exist: Buffer Pages and Bits

If the 2,000 / 4,000 / 16,000 dimension limits seem arbitrary, they’re not. The intuition comes from how PostgreSQL stores data: in 8 KB buffer pages (8,192 bytes). Every index tuple — including the vector representation in a vector index — has to fit within a page. The fewer bytes per dimension, the more dimensions you can pack.

Here’s the back-of-the-envelope arithmetic:

Encoding	Bytes per dimension	Theoretical max in 8 KB
`vector` (float32)	4 bytes	8,192 / 4 = 2,048
`halfvec` (float16)	2 bytes	8,192 / 2 = 4,096
4-bit quantized (PQ)	0.5 bytes	8,192 * 2 = 16,384
1-bit binary (SBQ)	0.125 bytes	8,192 * 8 = 65,536

The theoretical numbers explain the intuition — why halfvec doubles the limit and quantized encodings push it further. The actual limits are slightly lower because of page headers, tuple overhead, and the index metadata stored alongside each vector. For HNSW, each page also stores neighbor connection lists (up to m * 2 neighbor IDs per node); for IVFFlat, each page carries centroid references and list pointers. These eat into the available space, which is why pgvector’s HNSW sets 2,000 (not 2,048) and pgvectorscale’s DiskANN sets 16,000 (not 16,384). But the pattern is unmistakable: the fewer bits per dimension, the more dimensions you can fit in a page.

This is why DiskANN can handle 16,000 dimensions where HNSW on vector tops out at 2,000 — DiskANN stores a compressed representation in the index page, not the full vector. And why halfvec doubled the HNSW/IVFFlat limit from 2,000 to 4,000: half the bytes per dimension, twice the capacity.
This is more than enough for the vast majority of use cases. Most embedding models in production today default to 768–1536 dimensions, well within the 2,000-dimension limit. This also proves how future proof the curent vector store implementation is on PostgreSQL.

How pgvectorscale Compresses: Statistical Binary Quantization (SBQ)

pgvectorscale’s DiskANN from Tiger Data uses a method called Statistical Binary Quantization (SBQ). The idea is deceptively simple: for each dimension, replace the float value with a 1 or 2-bit code.

1-bit mode (default for dimensions >= 900): Each dimension is compressed to a single bit. But not naively — standard binary quantization uses 0.0 as the threshold (positive = 1, negative = 0), which works poorly because real embedding distributions are rarely centered on zero. SBQ instead computes the per-dimension mean across all vectors during index build and uses that as the threshold:

if value > mean_of_this_dimension → 1
else → 0

A 3,072-dimension float32 vector (12,288 bytes) becomes a 3,072-bit string (384 bytes). That’s 32x compression. During search, the query vector is also SBQ-encoded, and distances are computed using XOR + popcount on the bit strings — which modern CPUs execute in a single instruction.

2-bit mode (default for dimensions < 900): Each dimension gets two bits, encoding four “zones” based on the z-score (how many standard deviations from the mean). This gives finer granularity at 16x compression instead of 32x.

The accuracy loss is real but small. On common benchmarks, SBQ achieves 96-99% recall compared to exact search. The rescore step (controlled by diskann.query_rescore, default 50) compensates: after the graph traversal finds the top-50 candidates using quantized distances, pgvectorscale fetches the full-precision vectors from the heap and re-computes exact distances to produce the final top-10.

What’s stored where:

Location	What’s stored	Accessed when
Index pages	SBQ-compressed vectors + graph edges	Every query (graph traversal)
Heap (table)	Full float32 vectors	Only during rescore (top-N candidates)

This two-tier architecture is why DiskANN achieves 21 MB index size: the index only stores 384-byte compressed vectors, not 12 KB originals. The full vectors stay in the table, fetched only for the final re-ranking of a few dozen candidates.

Microsoft’s pg_diskann: A Different Approach

There’s a second DiskANN implementation for PostgreSQL: Microsoft’s pg_diskann, currently documented and distributed for Azure Database for PostgreSQL Flexible Server. It uses the same Vamana graph algorithm but a fundamentally different compression strategy: Product Quantization (PQ).

Where SBQ asks “is this dimension above or below the mean?”, Product Quantization asks “which of 16 codewords best represents this group of dimensions?”

Here’s how PQ works, step by step:

Divide the vector into chunks. A 3,072-dimension vector is split into, say, 1,024 chunks of 3 dimensions each.
Train a codebook per chunk. For each chunk, k-means clustering finds 16 representative codewords (centroids). Why 16? Because 16 values fit in 4 bits — a single hex digit.
Encode each chunk as a 4-bit symbol. During index build, each chunk of 3 dimensions is replaced by the index (0-15) of its closest codeword. The entire 3,072-dimension vector becomes 1,024 symbols of 4 bits each = 512 bytes.
Decode via lookup table at query time. To compute the distance to a query vector, you pre-compute the distance from the query to all 16 codewords for each chunk, creating a 16-row x 1,024-column lookup table. Then for each stored vector, you sum up the table entries corresponding to its symbols. No floating-point multiplication needed — just table lookups and additions.

The compression is dramatic: 3,072 dimensions * 4 bytes = 12,288 bytes → 512 bytes with PQ. That’s 24x compression, in the same ballpark as SBQ’s 32x.

Comparing the Two DiskANN Implementations

	pgvectorscale (Timescale)	pg_diskann (Microsoft)
Compression	SBQ (1-2 bits/dim)	Product Quantization (4 bits/chunk)
Compression ratio	32x (1-bit) or 16x (2-bit)	~24x (depends on chunks)
How it works	Per-dimension thresholding	Codebook lookup per chunk
Distance computation	XOR + popcount (very fast)	Table lookup + sum
Trainable	Minimal (just means + stddev)	Heavy (k-means per chunk)
Max dimensions	16,000	16,000
Availability	Open source, self-hosted	Azure Database for PostgreSQL
License	PostgreSQL License	Distributed via Azure
Iterative scan	No	Yes (relaxed/strict, ON by default)
PG version	PG 14-18 (self-hosted)	Azure DB for PostgreSQL

Both implementations share the core DiskANN algorithm (Vamana graph) and the two-phase search pattern (compressed scan + full-precision rescore). The difference is how they compress:

SBQ is simpler and faster to build (just compute means). It’s a blunt instrument — 1 bit per dimension loses a lot of information, but XOR + popcount is blazingly fast, and the rescore step recovers accuracy.
PQ is more sophisticated and retains more information per bit (a 4-bit symbol captures relationships between groups of dimensions). It’s slower to build (k-means training) but can achieve better recall at the same compression ratio, especially for vectors with correlated dimensions.

Takeaway: If you’re self-hosting PostgreSQL, pgvectorscale is your DiskANN option — open source, well-maintained, and the SBQ compression is effective. If you’re on Azure Database for PostgreSQL, you also have pg_diskann, whose PQ compression may give better recall on very high-dimensional data. The underlying algorithm is the same; the compression strategy is the difference.

Iterative Scans

This is the most important query-time feature added to pgvector since HNSW support. If you take one thing from this post, let it be this section.

The Problem

Vector search indexes return the K nearest neighbors, then PostgreSQL applies your WHERE clause. If your filter is selective, you get fewer results than requested.

-- Our dataset has 9 categories with varying selectivity
  category   | cnt   | % of 25K
-------------+-------+---------
 History     | 8719  | 34.9%
 General     | 6221  | 24.9%
 Science     | 2232  |  8.9%
 Mathematics |  116  |  0.5%

When you search for the 10 nearest “Science” articles, the HNSW index returns its ef_search nearest neighbors (say, 40), PostgreSQL filters to keep only Science articles, and you get however many matched. For Science (8.9%), you’ll usually get your 10 results. For Mathematics (0.5%), you won’t.

Let’s prove it. Here’s a search for Mathematics articles with ef_search=40, forcing the HNSW index path. (We disable sequential scan here to force the index path on our small 25K dataset. On production tables with 100K+ rows, the optimizer would naturally choose the index without this hint.)

SET enable_seqscan = off;
SET hnsw.iterative_scan = 'off';
SET hnsw.ef_search = 40;

SELECT id, title, category,
       content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1) AS distance
FROM articles
WHERE category = 'Mathematics'
ORDER BY content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec <=> (InitPlan 1).col1)
  Filter: (category = 'Mathematics'::text)
  Rows Removed by Filter: 40
  Index Searches: 1
Execution Time: 0.766 ms
(0 rows)

Zero rows returned. The index fetched 40 candidates, all 40 were filtered out (none were Mathematics), and the query silently returned an empty result set. This is the core problem: your application asked for 10 results and got 0.

Before iterative scans, you had two bad options:

Over-fetch (LIMIT 1000) and hope enough rows match — wasteful and unreliable
Sequential scan — correct but slow on large tables

The Solution: Iterative Index Scans

pgvector 0.8.0 introduced iterative scans. Instead of fetching a fixed batch and filtering, the index keeps fetching more candidates until the filter is satisfied.

SET enable_seqscan = off;
SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.ef_search = 40;

SELECT id, title, category,
       content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1) AS distance
FROM articles
WHERE category = 'Mathematics'
ORDER BY content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec <=> (InitPlan 1).col1)
  Filter: (category = 'Mathematics'::text)
  Rows Removed by Filter: 6809
  Index Searches: 171
Execution Time: 4235.805 ms
(10 rows)

10 rows returned. The index scanned 171 times, examined 6,819 candidates, filtered out 6,809 non-Mathematics rows, and delivered exactly 10 results. It took 4.2 seconds — much slower than the 0.8 ms empty result — but you got a correct answer instead of silence.

At 6,819 tuples scanned, this was well within the default max_scan_tuples of 20,000. On a table with millions of rows and the same 0.5% selectivity, the scan might hit the 20,000 limit before finding 10 matches — you’d get a partial result set. That’s the trade-off the safety valve makes: bounded latency vs guaranteed result count.

That 4.2-second cost reflects the extreme selectivity: Mathematics is only 0.5% of the data, so the index had to traverse deep into the graph. For moderate selectivity like Science (8.9%), the overhead is negligible:

SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.ef_search = 40;

SELECT ... WHERE category = 'Science' ORDER BY ... LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec <=> (InitPlan 1).col1)
  Filter: (category = 'Science'::text)
  Rows Removed by Filter: 26
  Index Searches: 1
Execution Time: 1.058 ms

Same feature, but for Science the index found 10 matching rows in a single search pass — no extra work needed.

Two Modes

relaxed_order: Results are approximately ordered by distance. Slightly faster. Good enough for most use cases.
strict_order: Results are exactly ordered by distance. Slightly slower. Use when ranking precision matters. SET hnsw.iterative_scan = ‘strict_order’;
— Execution Time: 0.885 ms

The Safety Valve: max_scan_tuples

To prevent runaway scans on extremely selective filters (imagine filtering for a category that has 1 row in 10 million), there’s a safety limit:

SET hnsw.max_scan_tuples = 500;   -- Restrictive: stop after 500 index tuples
SET hnsw.max_scan_tuples = 20000; -- Default: generous enough for most workloads
SET hnsw.max_scan_tuples = 0;     -- Unlimited (use with caution)

IVFFlat Too

Same concept, different GUC prefix:

SET ivfflat.iterative_scan = 'relaxed_order';
SET ivfflat.probes = 3;

SELECT ... WHERE category = 'Science' ORDER BY ... LIMIT 10;
-- Execution Time: 2.001 ms

Filtered Results in Action

With iterative scan, every result matches the filter:

SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.ef_search = 100;

SELECT id, title, category,
       round((content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1))::numeric, 4) AS distance
FROM articles
WHERE category = 'Science'
ORDER BY content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

  id   |  title   | category | distance
-------+----------+----------+----------
     1 | April    | Science  |   0.0000
  7862 | April 23 | Science  |   0.2953
  9878 | April 25 | Science  |   0.3076
   469 | May      | Science  |   0.3082
  9880 | April 24 | Science  |   0.3451
   402 | July     | Science  |   0.3453
  5156 | April 4  | Science  |   0.3531
  9530 | April 7  | Science  |   0.3588
 34906 | 2013     | Science  |   0.3674
  9149 | April 8  | Science  |   0.3690

All 10 results are Science. All sorted by cosine distance. No over-fetching, no sequential scan. (The titles are Wikipedia date articles — “April”, “May”, etc. — that happen to be classified under Science in this dataset.)

Multi-Filter Combinations

Iterative scans work with compound WHERE clauses too:

SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.ef_search = 100;

SELECT id, title, category, word_count
FROM articles
WHERE category = 'Science' AND word_count > 1000
ORDER BY content_halfvec <=> (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Limit  (actual time=7.098..7.105 rows=10.00 loops=1)
  ->  Sort  (actual time=7.097..7.101 rows=10.00 loops=1)
        Sort Key: ((content_halfvec <=> (InitPlan 1).col1))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Bitmap Heap Scan on articles  (actual time=0.690..7.027 rows=444.00 loops=1)
              Recheck Cond: ((word_count > 1000) AND (category = 'Science'::text))
              ->  BitmapAnd  (actual time=0.641..0.643 rows=0.00 loops=1)
                    ->  Bitmap Index Scan on idx_articles_word_count  (rows=1806.00)
                    ->  Bitmap Index Scan on idx_articles_category    (rows=2232.00)
Execution Time: 7.542 ms

Here the PostgreSQL optimizer did something clever: it combined two B-tree indexes (BitmapAnd) instead of using the HNSW iterative scan. On a 25K-row table that fits in memory, this is often cheaper. At scale (millions of rows), the iterative scan path wins because the bitmap approach requires reading too many heap pages.

Takeaway: Iterative scans are the answer to “vector search + WHERE doesn’t work.” Enable them with SET LOCAL hnsw.iterative_scan = 'relaxed_order' inside transactions. The safety valve max_scan_tuples = 20000 is a sensible default.

Quantization and Storage

With 3,072-dimension embeddings, storage is the elephant in the room. Here’s what each representation costs per row:

SELECT
  pg_size_pretty(avg(pg_column_size(content_vector)))  AS vector_size,
  pg_size_pretty(avg(pg_column_size(content_halfvec))) AS halfvec_size,
  pg_size_pretty(avg(pg_column_size(content_bq)))      AS binary_size
FROM articles WHERE content_vector IS NOT NULL;

 vector_size | halfvec_size | binary_size
-------------+--------------+-------------
 12 kB       | 6148 bytes   | 392 bytes

Type	Bytes per dimension	Per-row (3072d)	Savings
`vector(3072)`	4 bytes (float32)	12,288 bytes	baseline
`halfvec(3072)`	2 bytes (float16)	6,144 bytes	50%
`bit(3072)`	1/8 byte (1 bit)	384 bytes	97%

At 1 million rows:

Type	Column storage	HNSW index
`vector(3072)`	~12 GB	N/A (2000-dim limit)
`halfvec(3072)`	~6 GB	~7.7 GB
`bit(3072)`	~0.4 GB	~0.8 GB (bit_hamming_ops)

The storage math is brutal for high-dimensional embeddings. This is why DiskANN’s built-in SBQ compression matters: it gets you to 21 MB where HNSW on halfvec costs 193 MB.

Binary Quantize + Re-Ranking

Binary quantization crushes each dimension to a single bit (positive = 1, negative = 0). It’s lossy, but very fast for coarse filtering. The pattern is a two-phase search:

-- Phase 1: Hamming distance on binary (fast, coarse) → 100 candidates
-- Phase 2: Cosine distance on full vector (precise) → 10 results

WITH coarse AS (
    SELECT id, title, content_vector
    FROM articles
    WHERE content_bq IS NOT NULL
    ORDER BY content_bq <~> (
        SELECT binary_quantize(content_vector)::bit(3072)
        FROM articles WHERE id = 1
    )
    LIMIT 100
)
SELECT id, title,
       content_vector <=> (SELECT content_vector FROM articles WHERE id = 1) AS distance
FROM coarse
ORDER BY content_vector <=> (SELECT content_vector FROM articles WHERE id = 1)
LIMIT 10;

Limit  (actual time=29.021..29.026 rows=10.00 loops=1)
  ->  Sort on coarse  (actual time=29.020..29.023 rows=10.00 loops=1)
        ->  Subquery Scan  (actual time=27.805..28.999 rows=100.00 loops=1)
              ->  Sort by content_bq <~>  (actual time=27.744..27.754 rows=100.00 loops=1)
                    ->  Seq Scan on articles  (actual time=0.435..24.288 rows=24700.00 loops=1)
Execution Time: 29.127 ms

29 ms without an index on content_bq. With an HNSW index using bit_hamming_ops, Phase 1 would be sub-millisecond. The re-ranking in Phase 2 only touches 100 full vectors instead of 25,000.

DBA Takeaway: Use halfvec as your default for 3072-dimension embeddings. Use binary quantize + re-ranking when you need to search billions of rows and can tolerate a two-phase approach. DiskANN’s SBQ gives you similar compression automatically.

Operators and Sargability

pgvector provides four distance operators:

Operator	Distance	Operator Class	Use When
`<=>`	Cosine	`vector_cosine_ops`	Normalized embeddings (most common)
`<->`	L2 (Euclidean)	`vector_l2_ops`	Absolute distance matters
`<#>`	Inner Product (negative)	`vector_ip_ops`	Pre-normalized, slight speed edge
`<+>`	L1 (Manhattan)	`vector_l1_ops`	Sparse-like behavior on dense vectors

Wrong Operator = No Index

This is the single most common mistake. If your index uses halfvec_cosine_ops but your query uses <-> (L2), the index cannot be used:

-- CORRECT: cosine operator on cosine index → Index Scan
EXPLAIN (COSTS OFF)
SELECT ... ORDER BY content_halfvec <=> (...) LIMIT 10;

 Limit
   ->  Index Scan using idx_content_hnsw_halfvec on articles
         Order By: (content_halfvec <=> (InitPlan 1).col1)

-- WRONG: L2 operator on cosine index → Seq Scan!
EXPLAIN (COSTS OFF)
SELECT ... ORDER BY content_halfvec <-> (...) LIMIT 10;

 Limit
   ->  Sort
         Sort Key: ((articles.content_halfvec <-> (InitPlan 1).col1))
         ->  Seq Scan on articles

The planner can’t use a cosine-distance index for an L2-distance query. They’re different metrics with different orderings. If you see an unexpected Seq Scan on a vector query, check your operator first.

Sargable Queries: The Cross-Join Trap

This is the pattern I see most often in the wild, and it’s wrong:

-- BAD: cross-join prevents index use
SELECT a.id, a.title,
       a.content_halfvec <=> b.content_halfvec AS distance
FROM articles a, articles b
WHERE b.id = 1
ORDER BY a.content_halfvec <=> b.content_halfvec
LIMIT 10;

Limit
  ->  Sort
        Sort Key: ((a.content_halfvec <=> b.content_halfvec))
        ->  Nested Loop
              ->  Index Scan using articles_pkey on articles b
                    Index Cond: (id = 1)
              ->  Seq Scan on articles a     ← NO INDEX!

The planner sees a.content_halfvec <=> b.content_halfvec as a join condition, not an index-scan ordering. It can’t push the ORDER BY into the vector index because the right-hand side comes from a different table reference.

The fix: use a scalar subquery:

-- GOOD: scalar subquery → index used
SELECT id, title,
       content_halfvec <=> (
           SELECT content_halfvec FROM articles WHERE id = 1
       ) AS distance
FROM articles
ORDER BY content_halfvec <=> (
    SELECT content_halfvec FROM articles WHERE id = 1
)
LIMIT 10;

Limit
  InitPlan 1
    ->  Index Scan using articles_pkey on articles articles_1
          Index Cond: (id = 1)
  ->  Index Scan using idx_content_hnsw_halfvec on articles
        Order By: (content_halfvec <=> (InitPlan 1).col1)

The scalar subquery is evaluated once (InitPlan), then the result is treated as a constant. Now the planner can push the ORDER BY into the HNSW index scan. Same results, but with the index instead of a sequential scan.

Takeaway: For this nearest-neighbor ORDER BY pattern, always use scalar subqueries for vector distance calculations, never cross-joins. This is the vector-search equivalent of writing sargable WHERE clauses for B-tree indexes.

Partial Indexes

If you frequently filter by a specific category, a partial index is dramatically more efficient:

CREATE INDEX idx_content_hnsw_science
ON articles USING hnsw (content_halfvec halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE category = 'Science';

Time: 1420.207 ms (00:01.420)

        indexname         |  size
--------------------------+--------
 idx_content_hnsw_halfvec | 193 MB   ← full table (25,000 rows)
 idx_content_hnsw_science |  17 MB   ← Science only (2,232 rows)

11x smaller, 20x faster to build. If your application queries consistently filter by a known set of categories, partial indexes are the single biggest optimization available. Build one per high-value category.

Monitoring

Index Inventory

SELECT indexname,
       pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
       indexdef
FROM pg_indexes
WHERE tablename = 'articles'
ORDER BY pg_relation_size(indexname::regclass) DESC;

        indexname        |  size   | indexdef
-------------------------+---------+-------------------------------------------
 idx_content_hnsw_halfvec| 193 MB  | USING hnsw ... WITH (m='16', ef_construction='64')
 idx_content_ivfflat     | 193 MB  | USING ivfflat ... WITH (lists='25')
 idx_content_diskann     | 21 MB   | USING diskann ... WITH (storage_layout=memory_optimized)
 articles_pkey           | 1384 kB | USING btree (id)
 idx_articles_category   | 1192 kB | USING btree (category)
 idx_articles_word_count | 904 kB  | USING btree (word_count)

Adding up the three vector indexes (193 + 193 + 21 = 407 MB) plus B-tree indexes (3 MB), the total index footprint is over 410 MB for a 90 MB table. The indexes are ~4.5x the data. This is typical for high-dimensional vector data — plan your storage accordingly.

Note: In practice you’d pick one vector index, not all three. With just HNSW + B-tree indexes, the ratio drops to ~2x.

Settings

Know what settings exist and what they default to:

SELECT name, setting, short_desc
FROM pg_settings
WHERE name ~ '^(hnsw|ivfflat|diskann)\.'
ORDER BY name;

                    name                    | setting | short_desc
--------------------------------------------+---------+-------------------------------------------
 hnsw.ef_search                             | 40      | Dynamic candidate list size for search
 hnsw.iterative_scan                        | off     | Mode for iterative scans
 hnsw.max_scan_tuples                       | 20000   | Max tuples to visit for iterative scans
 hnsw.scan_mem_multiplier                   | 1       | Multiple of work_mem for iterative scans
 ivfflat.iterative_scan                     | off     | Mode for iterative scans
 ivfflat.max_probes                         | 32768   | Max probes for iterative scans
 ivfflat.probes                             | 1       | Number of probes
 diskann.query_search_list_size             | 100     | Search list size for queries
 diskann.query_rescore                      | 50      | Rescore candidates

Takeaway: hnsw.iterative_scan and ivfflat.iterative_scan default to off. If your application relies on vector search with WHERE clauses, you need to explicitly enable iterative scans.

Access Method Capabilities

Not sure which operator class works with which index type? Query the catalog:

SELECT am.amname AS access_method,
       opc.opcname AS operator_class,
       t.typname AS data_type
FROM pg_opclass opc
JOIN pg_am am ON am.oid = opc.opcmethod
JOIN pg_type t ON t.oid = opc.opcintype
WHERE am.amname IN ('hnsw', 'ivfflat', 'diskann')
ORDER BY am.amname, opc.opcname;

Access Method	Data Types	Operator Classes
HNSW	vector, halfvec, bit, sparsevec	18 classes (broadest support)
IVFFlat	vector, halfvec, bit	7 classes
DiskANN	vector only (+ label filtering)	4 classes

HNSW is the most versatile. DiskANN is the most constrained. IVFFlat falls in between.

Build Progress Monitoring

While building a large index:

SELECT phase,
       round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS pct_done,
       tuples_done, tuples_total
FROM pg_stat_progress_create_index;

This works for HNSW and IVFFlat builds (pgvector reports progress). DiskANN builds from pgvectorscale don’t currently report to this view.

Decision Guidelines

Here’s how to choose:

Start with HNSW unless you have a reason not to. It has the broadest operator support, good recall, and predictable performance. Use halfvec if your dimensions exceed 2,000.

Choose IVFFlat if build time matters more than query time. IVFFlat builds 5-6x faster than HNSW. If your data distribution shifts materially, plan a REINDEX to refresh clustering quality (centroid drift). A practical signal: if recall drops without any change in query patterns, or if you’ve inserted more than ~30% new rows since the last build, the centroids are likely stale. For a deeper look at how embedding model upgrades trigger reindexing at scale, see Embedding Versioning with pgvector.

Choose DiskANN if storage is the constraint. The 9x compression is decisive at scale. It handles high dimensions natively (no halfvec needed) and supports CONCURRENTLY for production deployments.

Enable iterative scans for vector search + filtering in production. The trade-off is potentially higher latency on very selective filters (the index does more work to find matching rows), but that’s usually the right trade-off for correctness. Tune max_scan_tuples / max_probes to bound worst-case work. Use relaxed_order by default, strict_order when ranking precision matters.

Use partial indexes for category-specific searches. An 11x size reduction and 20x faster build is hard to argue with.

Use SET LOCAL for all vector parameter changes in production AFTER having tested them.

Summary

Feature	HNSW	IVFFlat	DiskANN (pgvectorscale)	DiskANN (pg_diskann)
Provider	pgvector	pgvector	Timescale (open source)	Microsoft (Azure DB for PG)
Max dims (vector)	2,000	2,000	16,000	16,000
Max dims (halfvec)	4,000	4,000	N/A	N/A
Compression	Via halfvec	Via halfvec	SBQ (1-2 bits/dim)	PQ (4 bits/chunk)
Build time (25K, 3072d)	29s	5s	49s	N/A
Index size	193 MB	193 MB	21 MB	Similar
Query time	2-6 ms	2-10 ms	3 ms	~3 ms
Key tuning param	ef_search	probes	query_search_list_size	search list / PQ params
Iterative scan	Yes	Yes	No	Yes (ON by default)
CONCURRENTLY	Yes	Yes	Yes (0.9.0+)	Yes
Data types	vector, halfvec, bit, sparsevec	vector, halfvec, bit	vector only	vector only

Dimension limits are largely explained by PostgreSQL’s 8 KB page size and encoding density (exact cutoffs are implementation-defined):

Encoding	Bits/dim	Theoretical max	Actual limit	Who uses it
float32 (`vector`)	32	2,048	2,000	HNSW, IVFFlat
float16 (`halfvec`)	16	4,096	4,000	HNSW, IVFFlat
PQ symbol	4	16,384	16,000	pg_diskann
SBQ binary	1	65,536	16,000	pgvectorscale

The compression story is ultimately a story about how many bits of information you need per dimension to navigate the index. pgvector stores full-precision values; DiskANN stores just enough to find the right neighborhood, then goes back to the heap for exact distances.
But here’s the thing: exact results are not necessarily what you want. In a RAG pipeline, the retrieved documents are context for a language model that will synthesize and rephrase an answer — not return rows verbatim. Whether your top-10 results are ranked by exact cosine distance or by a 96%-accurate approximation rarely changes the generated answer. The same is true for recommendations, semantic deduplication, and most classification workflows: the downstream consumer is tolerant of small ranking variations.

The real production trade-off is not precision vs approximation — it’s the balance between retrieval speed, resource efficiency, and result quality at your scale. An HNSW index that doesn’t fit in shared_buffers and hits disk on every query will give you worse effective results than a DiskANN index that stays cached and returns slightly less precise distances in a fraction of the time. The best retrieval is the one that actually runs within your latency budget.

The lab with all SQL scripts, and Python embedding pipeline are available here : lab/06_pgvector_indexes.

All benchmarks: PostgreSQL 18, pgvector 0.8.1, pgvectorscale 0.9.0, 25K Wikipedia articles, 3072d text-embedding-3-large embeddings, (4 vCPUs, 8 GB RAM).

L’article pgvector, a guide for DBA – Part 2: Indexes (update march 2026) est apparu en premier sur dbi Blog.

PostgreSQL Anonymizer: Simple Data Masking for DBAs

Joan Frey — Fri, 27 Feb 2026 10:08:50 +0000

Sensitive data (names, emails, phone numbers, personal identifiers…) should not be freely exposed outside production. When you refresh a production database to a test or staging environment, or when analysts need access to real-looking data, anonymization becomes critical.

The PostgreSQL Anonymizer extension (often called anon) is an open‑source extension that lets you mask, fake, shuffle, or generalize data directly inside PostgreSQL, using simple SQL rules. This article explains what it is, how it works, how to install it on PostgreSQL 18.1 running on Red Hat Enterprise Linux 10.1, and how to use it with clear, beginner‑friendly examples.

The target audience is everyone, but especially beginner DBAs who want a practical, command‑line–focused introduction.

I. What is PostgreSQL Anonymizer?

PostgreSQL Anonymizer is an extension that helps protect sensitive data by replacing it with fake or obfuscated values.

Instead of exporting data and anonymizing it with scripts, the rules live inside the database itself. You declare how a column should be anonymized, and PostgreSQL applies that rule automatically.

Typical use cases:

Refreshing production data into test / staging environments
Giving developers or analysts access to realistic but non‑sensitive data
Producing anonymized database dumps for external sharing
Helping comply with GDPR and other privacy regulations

The extension supports three main approaches:

Static masking – permanently replaces data in tables
Dynamic masking – masks data on the fly for specific users
Anonymous dumps – exports an already anonymized dump

II. How does it work ?

PostgreSQL Anonymizer uses PostgreSQL’s security labels mechanism. You attach a label to a column that says: “When this data is anonymized, use this function.”

Example:

SECURITY LABEL FOR anon ON customer.email IS 'MASKED WITH FUNCTION anon.fake_email()';

Once declared:

Static masking rewrites the table using those rules
Dynamic masking rewrites query results for masked users
Dumps automatically apply the same rules

The rules stay attached to the schema, not to scripts or applications.

III. Installing PostgreSQL Anonymizer on RHEL 10.1 (PostgreSQL 18.1)

In this guide, PostgreSQL 18.1 is already installed following dbi services standard. PostgreSQL binaries are located in:

/u01/app/postgres/product/18/db_1/bin

The PostgreSQL data directory (PGDATA) is:

/u02/pgdata/18/demo-cluster

We will only focus on installing and enabling the PostgreSQL Anonymizer extension.

1. Install the Anonymizer extension

Since we are installing from source, we’ll start by installing the necessary prerequisites. We’ll use cargo to handle the PGRX system requirements. You can find the official documentation here, but keep in mind that I’ve updated the commands to reflect a more recent version.

#Cargo install

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
cargo --version

# postgresql_anonymizer install

cargo install cargo-pgrx --version 0.16.1 --locked
cargo pgrx init --pg18 /u01/app/postgres/product/18/db_1/bin/pg_config
git clone https://gitlab.com/dalibo/postgresql_anonymizer.git
cd postgresql_anonymizer/
make extension PG_CONFIG=/u01/app/postgres/product/18/db_1/bin/pg_config PGVER=pg18
sudo make install PG_CONFIG=/u01/app/postgres/product/18/db_1/bin/pg_config PGVER=pg18

2. Enable Anonymizer in postgresql.conf

Because PostgreSQL Anonymizer hooks into query execution, it must be loaded at session start. Edit the configuration file:

vi /u02/pgdata/18/demo-cluster/postgresql.conf

Add or update:

session_preload_libraries = 'anon'

Restart PostgreSQL:

pg_ctl -D /u02/pgdata/18/demo-cluster restart

3. Install and enable the Anonymizer extension

Connect as the PostgreSQL superuser:

psql -U postgres

Create a database:

CREATE DATABASE anonymizer_demo;

Create and initialize the extension:

postgres=# \c anonymizer_demo
You are now connected to database "anonymizer_demo" as user "postgres".
anonymizer_demo=# CREATE EXTENSION anon CASCADE;
CREATE EXTENSION

anonymizer_demo=# \dx
                                  List of installed extensions
  Name   | Version | Default version |   Schema   |                 Description
---------+---------+-----------------+------------+---------------------------------------------
 anon    | 3.0.0   | 3.0.0           | public     | Anonymization & Data Masking for PostgreSQL
 plpgsql | 1.0     | 1.0             | pg_catalog | PL/pgSQL procedural language
(2 rows)

anonymizer_demo=# SELECT anon.init();
 init
------
 t
(1 row)

anon.init() loads fake data dictionaries (names, companies, cities, etc.) used by anonymization functions.

IV. Demo: anonymizing a simple table

For this demo, I will keep it simple and create everything inside the postgres database and default schema, but I recommend you to follow the best practices and use a dedicated database, user and schema.

Create sample data:

CREATE TABLE customer (
  id         SERIAL PRIMARY KEY,
  first_name TEXT,
  last_name  TEXT,
  birthdate  DATE,
  email      TEXT,
  company    TEXT
);

INSERT INTO customer (first_name, last_name, birthdate, email, company) VALUES
('Alice', 'Martin', '1987-02-14', 'alice.martin@example.com', 'Acme Corp'),
('Bob',   'Dupont', '1979-11-03', 'bob.dupont@example.com',   'Globex');

anonymizer_demo=# select * from customer;
 id |  full_name   | birthdate  |          email           |  company
----+--------------+------------+--------------------------+-----------
  1 | Alice Martin | 1987-02-14 | alice.martin@example.com | Acme Corp
  2 | Bob Dupont   | 1979-11-03 | bob.dupont@example.com   | Globex
(2 rows)

1. Static masking (permanent anonymization)

Static masking is a “fire and forget” approach where the original sensitive data is physically overwritten on the disk with faked or scrambled values. This process is destructive. Once the data is masked, the original values are gone forever. This should only be performed on non-production environments (like Staging or Dev) or on a backup copy of your database.

Before applying any rules, you must explicitly enable the static masking engine at both the database and role levels. This acts as a safety switch to prevent accidental data loss.

-- Enable the extension for the current database
anonymizer_demo=# ALTER DATABASE anonymizer_demo SET anon.static_masking = TRUE;
ALTER DATABASE

-- Grant the postgres user permission to execute static masking operations
anonymizer_demo=# ALTER ROLE postgres SET anon.static_masking = TRUE;
ALTER ROLE

Next, you define how the data should be transformed. We use SECURITY LABEL to attach masking logic to specific columns. This doesn’t change the data yet; it simply tells the anon extension which functions to use during the anonymization process.

-- Replace names with realistic dummy values
anonymizer_demo=# SECURITY LABEL FOR anon ON COLUMN customer.first_name IS 'MASKED WITH FUNCTION anon.dummy_first_name()';
SECURITY LABEL
anonymizer_demo=# SECURITY LABEL FOR anon ON COLUMN customer.last_name IS 'MASKED WITH FUNCTION anon.dummy_last_name()';
SECURITY LABEL

-- Generate a random date within a specific age range (1950-2000)
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.birthdate IS 'MASKED WITH FUNCTION anon.random_date_between(''1950-01-01'', ''2000-12-31'')';
SECURITY LABEL

-- Generate syntactically correct but fake emails and company names
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.email IS 'MASKED WITH FUNCTION anon.fake_email()';
SECURITY LABEL
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.company IS 'MASKED WITH FUNCTION anon.fake_company()';
SECURITY LABEL

This is the final execution step. Running anonymize_database() triggers the engine to scan your rules and overwrite the table data globally. Depending on the size of your database, this may take some time as it performs UPDATE operations on the disk.

anonymizer_demo=# SELECT anon.anonymize_database();
 anonymize_database
--------------------
 t
(1 row)

The data rules have now been applied and the data have been anonymized. You can check the result with the following query:

anonymizer_demo=# SELECT * FROM customer;
 id | birthdate  |        email         |    company     | first_name | last_name
----+------------+----------------------+----------------+------------+------------
  1 | 1960-11-30 | lpeters@example.net  | Brown and Sons | Chyna      | Mertz
  2 | 1997-09-08 | avaughan@example.com | Rice PLC       | Damien     | Williamson
(2 rows)

You can now turn off static masking to continue with the next demo, dynamic masking:

anonymizer_demo=# ALTER SYSTEM SET anon.static_masking TO off;
ALTER SYSTEM
anonymizer_demo=# ALTER ROLE postgres SET anon.static_masking TO off;
ALTER ROLE

2. Dynamic masking

Dynamic masking allows you to hide sensitive information from specific users (like developers or analysts) while preserving the original data for administrators or the application itself. The masking happens in memory at the moment the query is executed.

First, we tell the database to activate the transparent masking engine. This allows the anon extension to intercept queries from specific roles and apply masking rules before the results are returned.

anonymizer_demo=# ALTER DATABASE anonymizer_demo SET anon.transparent_dynamic_masking = TRUE;
ALTER DATABASE

To see show case masking in action, we will use two types of users: a masked user who sees fake data, and an unmasked user (like the superuser) who sees the actual data stored on disk.

In this step, we create demo_user and “tag” them with a security label that forces the masking engine to engage whenever they log in.

anonymizer_demo=# CREATE ROLE demo_user LOGIN;
CREATE ROLE
anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_user IS 'MASKED';
SECURITY LABEL
anonymizer_demo=# GRANT pg_read_all_data to demo_user;
GRANT ROLE

anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_user IS 'MASKED';
SECURITY LABEL

-- As PostgreSQL user:
-- Ensure the postgres user remains unmasked so we can see the 'real' data

anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE postgres IS NULL;
SECURITY LABEL

We don’t need to redefine our masking rules (for names, emails, etc.) because the SECURITY LABEL definitions we created in the static masking section are still stored in the database schema.

Watch what happens when we query the table as demo_user:

anonymizer_demo=> select * from customer;
 id | birthdate  |          email           |     company      | first_name | last_name
----+------------+--------------------------+------------------+------------+-----------
  1 | 1957-02-07 | walterkristi@example.org | Fernandez-Tucker | Emelie     | Rohan
  2 | 1950-05-15 | hannah76@example.com     | Leonard Group    | Jane       | Durgan
(2 rows)

If we run the exact same command again, the output changes:

anonymizer_demo=> select * from customer;
 id | birthdate  |         email          |    company    | first_name | last_name
----+------------+------------------------+---------------+------------+------------
  1 | 1963-12-26 | blairpeter@example.com | Simpson Group | Tod        | Balistreri
  2 | 1971-09-14 | steven27@example.com   | Davis-Hardin  | Sonny      | Wintheiser
(2 rows)

Because the data is being generated “on-the-fly” by the masking functions, the results are dynamic. Each request produces a fresh set of data.

Now, let’s switch back to the postgres user. Since we removed the MASKED label from this role, the engine steps aside and shows us the actual data residing on the disk (which, in this case, is the data we masked statically in the previous step).

anonymizer_demo=# SELECT * FROM customer;
 id | birthdate  |        email         |    company     | first_name | last_name
----+------------+----------------------+----------------+------------+------------
  1 | 1960-11-30 | lpeters@example.net  | Brown and Sons | Chyna      | Mertz
  2 | 1997-09-08 | avaughan@example.com | Rice PLC       | Damien     | Williamson
(2 rows)

3. Anonymized dump

An Anonymized Dump allows you to export your database into a .sql file where the sensitive data is already replaced by fake values. This is incredibly powerful because you can create a “safe” backup that can be shared with developers or consultants without ever giving them access to your live production server.

Rather than using a superuser, we create a specific role whose sole purpose is to retrieve the masked version of the data during the export process.

-- Create a specialized user for the dump process
anonymizer_demo=# CREATE ROLE demo_ano_dumper LOGIN PASSWORD 'secret';
CREATE ROLE

-- Force the masking engine to be active for this user session
anonymizer_demo=# ALTER ROLE demo_ano_dumper SET anon.transparent_dynamic_masking = TRUE;
ALTER ROLE

-- Apply the MASKED label to the role
anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_ano_dumper IS 'MASKED';
SECURITY LABEL

-- Grant permission to read all tables
anonymizer_demo=# GRANT pg_read_all_data TO demo_ano_dumper;
GRANT

Now, we use the standard pg_dump utility. Because we are logging in as demo_ano_dumper, the anon extension intercepts the data export on-the-fly. We use a few specific flags to ensure the resulting file is clean and doesn’t contain the masking logic itself:

/u01/app/postgres/product/18/db_1/bin/pg_dump anonymizer_demo --username=demo_ano_dumper --password --no-security-labels --exclude-extension="anon" --file=anonymized_dump.sql

--no-security-labels: Prevents the “MASKED” tags from being exported (the new database doesn’t need to know how the data was masked).

--exclude-extension="anon": Ensures the recipient doesn’t need the anon extension installed to restore the file.

If you open the generated anonymized_dump.sql file in a text editor, you will see that the COPY commands contain the fake data, not the original sensitive information.

--
-- Data for Name: customer; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.customer (id, birthdate, email, company, first_name, last_name) FROM stdin;
1       1980-12-17      cannonlauren@example.com        Carr-Doyle      Katharina       Shanahan
2       1998-03-27      sfox@example.net        Pollard and Sons        Ayla    Spencer
\.

V. Conclusion

In modern development, the goal is to work with realistic data without the real-world risk. PostgreSQL Anonymizer bridges this gap by allowing you to transform sensitive production information into safe, functional datasets.

Now that we’ve explained how to use pg_anonymizer and covered all three methods, here is a quick guide on when to use each:

Static Masking: Best for “cleaning” a staging database after a production refresh.
Dynamic Masking: Best for internal users (DBAs, support staff) who need to work on the live database but shouldn’t see production data.
Anonymized Dump: Best for sharing data with external partners or creating local development environments.

In the end, whether you choose Static, Dynamic, or Dump masking, the benefits remain the same:

Utility: Because the data is masked with realistic functions (like fake_email or dummy_first_name), your application logic—like email validation or UI layout—still works perfectly.
Compliance: Meet GDPR, HIPAA, and internal security requirements by default.
Safety: Developers and analysts can work on real-world bugs and features without ever seeing a customer’s actual PII (Personally Identifiable Information).

I hope you enjoyed this guide and found these examples clear and easy to follow! My goal was to show that data privacy doesn’t have to be painful for your development workflow. Don’t forget to follow the extension latest news on the official website: https://postgresql-anonymizer.readthedocs.io/en/latest/

If you have any questions about these commands or how to implement them in your own environment, feel free to reach out

L’article PostgreSQL Anonymizer: Simple Data Masking for DBAs est apparu en premier sur dbi Blog.

RAG Series – Embedding Versioning LAB

Adrien Obernesser — Sun, 22 Feb 2026 21:36:36 +0000

Introduction

This is Part 2 of the embedding versionin, in Part 1, I covered the theory: why event-driven embedding refresh matters, the three levels of architecture (triggers, logical replication, Flink CDC), and how to detect and skip insignificant changes. If you haven’t read it, go there first, this post won’t through the entire intent of the designs but just demonstrate how it can work.

Here, I’m going to run the whole thing on the Wikipedia dataset from the pgvector_RAG_search_lab repository. 25,000 articles, triggers, OpenAI API calls, real numbers.

The goal is to answer the questions you’d actually have when implementing this:

How do you adapt the schema to an existing table that wasn’t designed for versioning?
What do the SKIP vs EMBED decisions actually look like with real data?
Does SELECT FOR UPDATE SKIP LOCKED really work with concurrent workers?
What does the freshness monitoring report show in practice?
How does the quality feedback loop close the circle?

All the code is in the lab/05_embedding_versioning/ directory of the repository.

What’s in the lab directory

Before diving in, here’s what each file does — so you know what you’re running:

lab/05_embedding_versioning/
├── schema.sql                          # DDL: tables, triggers, indexes
├── worker.py                           # Embedding worker (claims queue items, calls OpenAI, writes vectors)
├── change_detector.py                  # Compares new vs old embeddings to decide SKIP or EMBED
├── freshness_monitor.py                # Generates a full health report on embedding staleness
└── examples/
    ├── simulate_document_changes.py    # Generates a realistic mix of article mutations
    ├── targeted_mutations.py           # Applies specific change types to specific articles
    ├── demo_skip_locked.py             # Demonstrates concurrent worker queue distribution
    ├── demo_trigger_flow.py            # End-to-end: UPDATE → trigger → queue → embed
    └── demo_quality_drift.py           # Simulates declining search quality + automatic re-queuing

Every script connects to the local wikipedia database and uses the same embedding queue. They’re designed to run sequentially — each step builds on the state left by the previous one.

The Starting Point

My lab environment runs PostgreSQL 17.6 with pgvector 0.8.0 and pgvectorscale (DiskANN). The articles table already has 25,000 Wikipedia articles with dense and sparse embeddings from the previous labs (the sparsevec(30522) column holds SPLADE sparse vectors — 30,522 is the BERT WordPiece vocabulary size):

wikipedia=# \d articles
                          Table "public.articles"
         Column         |       Type       | Collation | Nullable | Default
------------------------+------------------+-----------+----------+---------
 id                          | integer          |           | not null |
 url                    | text             |           |          |
 title                  | text             |           |          |
 content                | text             |           |          |
 title_vector           | vector(1536)     |           |          |
 content_vector         | vector(1536)     |           |          |
 vector_id              | integer          |           |          |
 content_tsv            | tsvector         |           |          |
 title_content_tsvector | tsvector         |           |          |
 content_sparse         | sparsevec(30522) |           |          |
 title_vector_3072      | vector(3072)     |           |          |
 content_vector_3072    | vector(3072)     |           |          |
Indexes:
    "articles_pkey" PRIMARY KEY, btree (id)
    "articles_content_3072_diskann" diskann (content_vector_3072)
    "articles_sparse_hnsw" hnsw (content_sparse sparsevec_cosine_ops)
    "articles_title_vector_3072_diskann" diskann (title_vector_3072)
    "idx_articles_content_tsv" gin (content_tsv)
    "idx_articles_title_content_tsvector" gin (title_content_tsvector)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON articles FOR EACH ROW ...
    tsvupdate BEFORE INSERT OR UPDATE ON articles FOR EACH ROW ...

No content_hash, no updated_at, no versioned embeddings. This is the reality of most existing deployments — you need to retrofit versioning without breaking what’s already working.

Step 1: Apply the Versioning Schema

What `schema.sql` does

The schema file adapts the generic pattern from Part 1 to the existing articles table. It runs inside a single transaction and performs these operations in order:

Adds two columns to articles: content_hash TEXT and updated_at TIMESTAMPTZ DEFAULT now()
Creates a BEFORE trigger (trg_content_hash) that automatically computes md5(content) before every INSERT or UPDATE of the content column — this is our change detection fingerprint
Backfills content_hash for all 25,000 existing articles with UPDATE articles SET content_hash = md5(content)
Creates article_embeddings_versioned — the versioned embeddings table with model_name, model_version, source_hash, is_current, and a partial DiskANN index on WHERE is_current = true
Creates embedding_queue — the work queue with status, content_hash, change_type, claimed_at, and retry tracking
Creates embedding_change_log — records every SKIP/EMBED decision with similarity scores for audit
Creates retrieval_quality_log — for the quality feedback loop (Step 9b)
Creates an AFTER trigger (trg_queue_embedding) that fires on INSERT OR UPDATE OF content and inserts a queue entry automatically

Key differences from the “clean” schema in Part 1:

No generated column for content_hash: GENERATED ALWAYS AS (md5(content)) STORED would rewrite the entire 25K-row table. The BEFORE trigger achieves the same result without a table rewrite — important for large production tables.
Column-targeted trigger: AFTER UPDATE OF content instead of AFTER UPDATE. The trigger only fires when the content column is touched — title-only or metadata-only updates are ignored at the PostgreSQL level, not inside application code.
Table naming: article_embeddings_versioned (not document_embeddings) to match the existing articles table naming convention.

psql -d wikipedia -f lab/05_embedding_versioning/schema.sql

BEGIN
ALTER TABLE
CREATE FUNCTION
DROP TRIGGER
CREATE TRIGGER
UPDATE 25000
CREATE TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE TABLE
CREATE INDEX
CREATE INDEX
CREATE TABLE
CREATE TABLE
CREATE FUNCTION
DROP TRIGGER
CREATE TRIGGER
COMMIT

Let me walk through the important lines:

ALTER TABLE — adds content_hash and updated_at columns
CREATE FUNCTION + CREATE TRIGGER (first pair) — the BEFORE trigger that computes md5(content)
UPDATE 25000 — the backfill. This is the most expensive line: PostgreSQL computes MD5 for every article and writes the hash. On 25K rows it takes a few seconds; on millions of rows, plan a maintenance window
CREATE TABLE + CREATE INDEX (×3) — the versioned embeddings table with its partial DiskANN index, version lookup index, and staleness detection index
CREATE FUNCTION + CREATE TRIGGER (second pair) — the AFTER trigger that queues embedding work

After applying, the table now has versioning infrastructure:

wikipedia=# \d articles
         Column         |           Type           | Nullable | Default
------------------------+--------------------------+----------+---------
 ...existing columns...
 content_hash           | text                     |          |
 updated_at             | timestamp with time zone |          | now()
Referenced by:
    TABLE "article_embeddings_versioned" CONSTRAINT ... FOREIGN KEY (article_id) ...
    TABLE "embedding_queue" CONSTRAINT ... FOREIGN KEY (article_id) ...
Triggers:
    trg_content_hash BEFORE INSERT OR UPDATE OF content ON articles ...
    trg_queue_embedding AFTER INSERT OR UPDATE OF content ON articles ...
    tsvectorupdate BEFORE INSERT OR UPDATE ON articles ...
    tsvupdate BEFORE INSERT OR UPDATE ON articles ...

Two new triggers alongside the existing tsvector triggers. They coexist without conflict because trg_content_hash is BEFORE (updates the hash) and trg_queue_embedding is AFTER (queues the embedding work using the already-computed hash).

Five new tables: article_embeddings_versioned, embedding_queue, embedding_change_log, retrieval_quality_log, and the queue’s indexes.

Step 2: Test the Trigger Manually

Before running anything complex, verify the trigger actually works. This is just a sanity check — one UPDATE, then look at the queue:

wikipedia=# SELECT id, title, content_hash FROM articles WHERE id = 1;
 id | title |           content_hash
----+-------+----------------------------------
  1 | April | 47761052aee1158134fc07f3f7337952

wikipedia=# UPDATE articles SET content = content || ' [test trigger]' WHERE id = 1;
UPDATE 1

wikipedia=# SELECT id, article_id, status, content_hash, change_type, queued_at
  FROM embedding_queue ORDER BY queued_at DESC LIMIT 5;
 id | article_id | status  |           content_hash           |  change_type   |           queued_at
----+------------+---------+----------------------------------+----------------+-------------------------------
  1 |          1 | pending | 59e5ebe6fa9fce7ab87beccf6523dda6 | content_update | 2026-02-18 14:38:01.626792+00

What happened here, step by step:

We checked article 1 (“April”) — its content_hash was 4776...
We appended ' [test trigger]' to its content
The BEFORE trigger (trg_content_hash) fired first, recomputing content_hash to 59e5... (the new MD5)
The AFTER trigger (trg_queue_embedding) fired next, inserting a row into embedding_queue with the new hash and change_type = 'content_update'
The queue entry has status = 'pending' — nothing has processed it yet

The change_type column is important: it’s how we’ll later distinguish content-triggered re-embeddings from quality-triggered ones (Step 9b).

Step 3: Simulate 50 Document Mutations

Real knowledge bases don’t get 1 change at a time. The simulate_document_changes.py script generates a realistic mix of changes to random articles.

What the script does

The script picks 50 random articles from the database and applies one of five mutation types to each, chosen by a weighted random distribution that mimics real-world editing patterns:

typo_fix (most common): appends a period or fixes a word — the kind of minor edit that shouldn’t trigger re-embedding
paragraph_add: appends a substantial paragraph (3-5 sentences) — new information that changes the semantic content
section_rewrite: replaces a portion of the article with new text — significant semantic shift
major_rewrite: rewrites most of the article — entirely new embedding needed
metadata_only: changes only the title (not the content) — should NOT trigger the embedding pipeline at all

python examples/simulate_document_changes.py --count 50

Mutation Summary:
----------------------------------------
  major_rewrite        3
  metadata_only        6
  paragraph_add       15
  section_rewrite      4
  typo_fix            22
  TOTAL               50

This distribution is realistic: most changes are minor fixes, a smaller portion adds new content, and a few are major rewrites. The 6 metadata_only changes simulate edits to fields other than content — think correcting a title or updating a URL.

Now check the queue:

wikipedia=# SELECT status, count(*) FROM embedding_queue GROUP BY status;
 status  | count
---------+-------
 pending |    44

50 mutations, but only 44 queue entries. Where did the other 6 go?

The 6 metadata_only mutations changed only the title (not content), so the trigger — which fires on UPDATE OF content — didn’t fire for them. Those 6 changes never reached the embedding pipeline. This is the first cost optimization, and it happens at the PostgreSQL trigger level with zero application code, zero API calls, zero overhead.

Why this matters: In a real knowledge base, a meaningful fraction of updates are metadata-only — tags, categories, status flags, author fields (in some orgs, 30-50% of all UPDATEs). Filtering them at the trigger level means your embedding worker never even sees them.

Step 4: Change Detection Without a Baseline

Now let’s run the change detector to see which items should be embedded vs skipped.

What `change_detector.py` does

The change detector is the “smart filter” in our pipeline. For each pending queue item, it:

Fetches the article’s current content from the articles table
Looks up the most recent embedding for that article in article_embeddings_versioned
If no previous embedding exists: marks the item as EMBED (similarity = 0.0) — there’s nothing to compare against
If a previous embedding exists: generates a new embedding for the current content via OpenAI, computes the cosine similarity between old and new embeddings, and applies the threshold:
- Similarity ≥ 0.95 → SKIP (the semantic meaning barely changed, re-embedding would be wasteful)
- Similarity < 0.95 → EMBED (the meaning shifted enough to warrant a new embedding)
Logs every decision to embedding_change_log with the similarity score — this is your audit trail

Multi-chunk articles: When an article has multiple chunks (like “Dean Martin” with 3), the detector compares against chunk_index = 0 only — the lead section, which concentrates the article’s core topic. This is a deliberate tradeoff: it’s fast (one comparison, not N), and for Wikipedia-style content where the introduction summarizes the whole article, it’s a reliable proxy. For corpora where meaning is spread more evenly across chunks, you’d want a centroid approach (average the L2-normalized chunk vectors) or max pairwise similarity across corresponding chunks. The threshold may need recalibration depending on which strategy you choose.

The --analyze-queue flag tells it to analyze all pending items without actually embedding anything. Think of it as a dry run that records decisions.

python change_detector.py --analyze-queue

2026-02-18 14:43:22 [DETECTOR] INFO Analyzing 44 pending queue items (threshold=0.95)
2026-02-18 14:43:22 [DETECTOR] INFO Article 6607: EMBED (similarity=0.0000)
2026-02-18 14:43:22 [DETECTOR] INFO Article 36870: EMBED (similarity=0.0000)
...all 44 show similarity=0.0000...
2026-02-18 14:43:22 [DETECTOR] INFO Results: 44 EMBED, 0 SKIP

Every single article shows similarity=0.0. Why?

Because article_embeddings_versioned is empty. There are no previous embeddings to compare against. The change detector hit step 3 for every article: “no previous embedding exists → must EMBED.”

This is an important operational insight: the change detector needs a baseline to work. On the very first run — or when you deploy to a new system — everything must be embedded. The SKIP optimization only kicks in on subsequent changes, after embeddings exist to compare against. If you’re migrating from a system that already has embeddings in a different format, you’d need to populate the source_hash column from those existing embeddings first to bootstrap the comparison.

Step 5: Create Baseline Embeddings

Now we need to establish that baseline. Let’s run the worker for one small batch.

What `worker.py` does

The worker is the component that actually calls the OpenAI API and writes embeddings to PostgreSQL. Here’s its internal flow:

Claim items from the queue using SELECT ... FOR UPDATE SKIP LOCKED — this is the concurrency primitive from Part 1. Multiple workers can run simultaneously, and each gets a non-overlapping set of items.
For each claimed item: fetch the article content, split it into chunks (2000-character windows with overlap), and call the OpenAI text-embedding-3-small API to generate a 1536-dimensional vector for each chunk.
Write the embeddings to article_embeddings_versioned with is_current = true, model_name, model_version, and source_hash (the content’s MD5 at the moment of embedding).
Mark old embeddings for the same article as is_current = false (soft delete — they’re kept for rollback).
Update the queue item to status = 'completed' with processed_at = now().

The --once flag means “process one batch and exit” (instead of running in an infinite polling loop). The --batch-size 10 flag means “claim up to 10 items at a time.”

python worker.py --once --batch-size 10

2026-02-18 14:45:58 [8526] INFO Worker worker-once claimed 10 items
2026-02-18 14:46:02 [8526] INFO Article 6607: embedded 1 chunks
2026-02-18 14:46:02 [8526] INFO Article 36870: embedded 1 chunks
2026-02-18 14:46:05 [8526] INFO Article 19078: embedded 1 chunks
2026-02-18 14:46:05 [8526] INFO Article 7947: embedded 1 chunks
2026-02-18 14:46:05 [8526] INFO Article 75802: embedded 2 chunks
2026-02-18 14:46:05 [8526] INFO Article 5150: embedded 1 chunks
2026-02-18 14:46:06 [8526] INFO Article 55579: embedded 1 chunks
2026-02-18 14:46:06 [8526] INFO Article 92697: embedded 1 chunks
2026-02-18 14:46:06 [8526] INFO Article 49417: embedded 3 chunks
2026-02-18 14:46:06 [8526] INFO Article 70595: embedded 1 chunks
Processed 10 items

Reading the output:

claimed 10 items — the worker took 10 items from the queue using SKIP LOCKED. If another worker ran simultaneously, it would get different items.
Article 6607: embedded 1 chunks — this article’s content fit within a single 2000-character chunk. One API call, one embedding vector stored.
Article 75802: embedded 2 chunks — “Brandenburg Gate” was longer and required two chunks. Two API calls, two embedding vectors, both linked to the same article with chunk_index 0 and 1.
Article 49417: embedded 3 chunks — “Dean Martin” was the longest article in this batch, requiring three chunks.

Let’s verify the data in PostgreSQL:

wikipedia=# SELECT count(DISTINCT article_id) AS articles, count(*) AS chunks
  FROM article_embeddings_versioned WHERE is_current = true;
 articles | chunks
----------+--------
       10 |     13

10 articles, 13 chunks. The numbers match the worker output.

Total time: ~8 seconds for 10 articles. The bottleneck is the OpenAI API call (~300-600ms per embedding request), not PostgreSQL. In this lab, the trigger overhead, queue operations, and embedding writes were all negligible compared to API latency. If you need faster throughput, the answer is more workers (see Step 8) or a local embedding model — not database optimization.

Step 6: The Real Demo — SKIP vs EMBED

Now we have a baseline: 10 articles with embeddings and known source_hash values. This is the step where the change detector can finally do its job properly.

What `targeted_mutations.py` does

This script applies specific, known mutation types to the 10 articles we just embedded. Unlike simulate_document_changes.py (which picks random articles and random mutations), this script is deterministic — we control exactly what changes happen so we can verify the detector’s decisions:

5 articles: append a single period character (.) to the content — the smallest possible content change. This is a typo-level edit that should not change the semantic meaning at all.
3 articles: append a substantial paragraph (~100 words of new information) — this adds genuine semantic content that should shift the embedding.
2 articles: rewrite the second half of the content — a major structural change that dramatically alters the meaning.

python examples/targeted_mutations.py

Embedded articles: [5150, 6607, 7947, 19078, 36870, 49417, 55579, 70595, 75802, 92697]
  Article 5150: appended period (typo fix)
  Article 6607: appended period (typo fix)
  Article 7947: appended period (typo fix)
  Article 19078: appended period (typo fix)
  Article 36870: appended period (typo fix)
  Article 49417: appended major paragraph
  Article 55579: appended major paragraph
  Article 70595: appended major paragraph
  Article 75802: rewrote second half
  Article 92697: rewrote second half
Done - 10 targeted mutations applied

Each of these UPDATEs fires the trigger, which creates a new queue entry. But now — unlike Step 4 — we have existing embeddings to compare against.

Now run the change detector again:

What happens inside the detector this time

For each of the 10 mutated articles, the detector:

Takes the article’s current (modified) content
Generates a new embedding via OpenAI
Retrieves the existing embedding from article_embeddings_versioned
Computes cosine similarity between old and new
Applies the 0.95 threshold

For the 34 other pending items (from Step 3, still without baseline embeddings), it still returns similarity=0.0.

python change_detector.py --analyze-queue

...34 articles without baseline still show EMBED (similarity=0.0000)...

2026-02-18 14:54:03 [DETECTOR] INFO Article 5150: SKIP (similarity=0.9981)
2026-02-18 14:54:03 [DETECTOR] INFO Article 6607: SKIP (similarity=0.9994)
2026-02-18 14:54:03 [DETECTOR] INFO Article 7947: SKIP (similarity=0.9993)
2026-02-18 14:54:03 [DETECTOR] INFO Article 19078: SKIP (similarity=0.9993)
2026-02-18 14:54:04 [DETECTOR] INFO Article 36870: SKIP (similarity=0.9997)
2026-02-18 14:54:04 [DETECTOR] INFO Article 49417: EMBED (similarity=0.9263)
2026-02-18 14:54:04 [DETECTOR] INFO Article 55579: EMBED (similarity=0.9255)
2026-02-18 14:54:04 [DETECTOR] INFO Article 70595: EMBED (similarity=0.9369)
2026-02-18 14:54:04 [DETECTOR] INFO Article 75802: EMBED (similarity=0.6256)
2026-02-18 14:54:04 [DETECTOR] INFO Article 92697: EMBED (similarity=0.5090)
2026-02-18 14:54:04 [DETECTOR] INFO Results: 39 EMBED, 5 SKIP

Reading the results

What the similarity numbers mean:

0.998–0.999 (typo fixes): The old and new embeddings are nearly identical. Adding a period barely shifts the vector in 1536-dimensional space. The detector correctly says: “this content hasn’t meaningfully changed — skip the re-embed.” That avoids 5 unnecessary write operations, index churn, and version flips.
0.925–0.937 (paragraph additions): Adding 100 words of new information shifts the embedding enough to drop below 0.95. The detector correctly says: “the semantic content changed — re-embed.” The new paragraph about Dean Martin’s film career or Brandenburg Gate’s Cold War history needs to be reflected in the vector.
0.509–0.626 (section rewrites): Rewriting half the article dramatically changes the meaning. These similarities are far below the threshold — clearly needing re-embedding.
0.0 (no baseline): The 34 articles from Step 3 that still have no embeddings. Can’t compare what doesn’t exist yet.

Cost honesty note: The detector uses embedding similarity, which means it calls OpenAI once per article to generate the comparison vector — even for articles it ultimately SKIPs. So SKIP doesn’t eliminate API spend; it eliminates unnecessary writes, index churn, and version flips. For single-chunk articles (the majority in this lab), the detection call is the same cost as the embedding call itself. The real savings show up with multi-chunk articles: the detector spends 1 API call to decide, versus N calls to re-embed all chunks. In production, you’d add cheaper pre-filters first, content_hash comparison (free, catches identical content), text diff ratio (cheap, catches typos), and reserve embedding-similarity checks for borderline cases where the content changed but the semantic impact is unclear. That’s the graduation path Part 1 describes.

The key insight: there’s a clean gap between the typo group (lowest: 0.9981) and the paragraph group (highest: 0.9369). That gap from 0.937 to 0.998 is where our 0.95 threshold sits. It doesn’t fall in ambiguous territory. The change types cluster naturally, which is what makes threshold-based detection practical in the real world.

The queue now reflects the decisions:

wikipedia=# SELECT status, count(*) FROM embedding_queue GROUP BY status;
  status   | count
-----------+-------
 skipped   |     5
 completed |    10
 pending   |    39

5 skipped: the typo-level changes — unnecessary writes avoided, no quality loss
10 completed: the baseline embeddings from Step 5
39 pending: 34 no-baseline articles + 5 newly-detected EMBED items, waiting for the worker

The skipped status is an audit trail — you can always go back and see what was skipped, when, and at what similarity score (recorded in embedding_change_log).

Step 7: Freshness Monitoring Report

In production, you need a dashboard — not individual log lines. The freshness_monitor.py script consolidates all the monitoring queries from Part 1 into a single diagnostic report.

What `freshness_monitor.py` does

The script runs five monitoring queries against the database and formats them into a human-readable report:

Freshness summary: How many articles have embeddings? How many are stale (content changed since last embedding)?
Stale articles detail: Which specific articles have drifted — showing both the current content hash and the embedding’s source hash so you can see the mismatch
Queue health: Breakdown by status with timestamps — tells you if items are stuck or if the queue is draining properly
Version coverage: Which embedding models are in use and how many articles/chunks each covers
Change detection decisions: Aggregated SKIP/EMBED statistics with average similarity scores

python freshness_monitor.py --report

Embedding Freshness Report — 2026-02-18 14:57:53

============================================================
  Freshness Summary
============================================================
  Total articles:        25000
  With embeddings:       10  (0.0%)
  Without embeddings:    24990
  Stale embeddings:      10  (100.0%)

Reading this: Only 10 of 25,000 articles have versioned embeddings (from Step 5). All 10 are “stale” because we just mutated all of them in Step 6. In a real deployment, you’d see something like “23,450 with embeddings (93.8%), 312 stale (1.3%)” — and you’d alert if stale exceeded, say, 5%.

============================================================
  Stale Articles (content changed since embedding)
============================================================
  ID    | Title                                  | Current Hash     | Embed Hash       | ...
  ------+----------------------------------------+------------------+------------------+----
  5150  | 1787                                   | 5b14bc4a2d...    | 11e81bc4de...    | ...
  6607  | Needle                                 | 3ebb3c3cbb...    | 5c5290b5a7...    | ...
  49417 | Dean Martin                            | 7061f1803f...    | f7fd9f30e6...    | ...
  75802 | Brandenburg Gate                       | 7da53df7a0...    | 5a2dcc01f9...    | ...
  ...6 more...

The Current Hash and Embed Hash columns are the two MD5 fingerprints. When they don’t match, it means the article’s content has changed since we last generated its embedding. Article 5150 (“1787”) shows different hashes even though we only appended a period — the MD5 captures any content change, even trivial ones. The change detector is what decides whether the difference matters semantically (and it said SKIP for this one).

============================================================
  Queue Health
============================================================
  Status    | Count | Oldest                 | Newest
  ----------+-------+------------------------+------------------------
  pending   | 39    | 2026-02-18 14:39:59    | 2026-02-18 14:53:55
  completed | 10    | 2026-02-18 14:39:58    | 2026-02-18 14:39:59
  skipped   | 5     | 2026-02-18 14:53:55    | 2026-02-18 14:53:55

The queue is healthy but has a backlog. 39 items pending, oldest from ~15 minutes ago. In production, you’d watch the gap between “Oldest” and “Newest” — if the oldest item keeps getting older while new items are added, your workers can’t keep up. That’s when you scale up workers (see Step 8) or increase batch size.

The 10 completed items are from Step 5, the 5 skipped from Step 6’s change detector.

============================================================
  Embedding Version Coverage
============================================================
  Model Version          | Articles | Chunks | Current
  -----------------------+----------+--------+--------
  text-embedding-3-small | 10       | 13     | 13

Only one model version in use, covering 10 articles with 13 chunks, all current. During a blue-green model upgrade (Part 1’s model versioning section), you’d see two rows here — v1 and v2 — and track coverage convergence.

============================================================
  Change Detection Decisions
============================================================
  Decision | Count | Avg Similarity
  ---------+-------+---------------
  EMBED    | 83    | 0.0473
  SKIP     | 5     | 0.9992

The average similarity for EMBED decisions is 0.0473 because most of those 83 decisions had similarity=0.0 (no baseline). The 5 SKIPs have an average of 0.9992 — confirming these were truly trivial changes. In a mature deployment, the EMBED average similarity would be higher (0.7–0.9 range) and the SKIP/EMBED ratio would tell you how efficient your threshold is.

Step 8: SKIP LOCKED — Multi-Worker Concurrency

This is the demo that proves the theory from Part 1’s deep dive on SELECT FOR UPDATE SKIP LOCKED.

What `demo_skip_locked.py` does

The script launches multiple Python threads that each behave like independent embedding workers. Each thread:

Opens its own database connection
Runs UPDATE embedding_queue SET status='processing' WHERE queue_id IN (SELECT queue_id FROM embedding_queue WHERE status='pending' ORDER BY queued_at FOR UPDATE SKIP LOCKED LIMIT n) — the exact same claim query the real worker uses (note the ORDER BY queued_at — without it, selection order is not deterministic and oldest-first is not guaranteed)
Records which queue_id values it got
Does NOT actually call OpenAI (this is a concurrency demo, not an embedding demo)

After all threads finish, the script checks for overlap: did any two workers claim the same item? The answer should always be zero.

python examples/demo_skip_locked.py --workers 4 --items 39

============================================================
  Demo: SKIP LOCKED Multi-Worker Concurrency
  Workers: 4  |  Target items: 39
============================================================

Launching 4 workers (each requesting up to 14 items)...

  demo-worker-0: claimed 14 items  (articles: [96746, 37330, 67708, 32834, 46541]...)
  demo-worker-1: claimed 14 items  (articles: [57924, 20028, 65749, 92016, 24921]...)
  demo-worker-2: claimed 11 items  (articles: [66390, 27221, 30148, 97917, 30449]...)
  demo-worker-3: claimed 0 items   (articles: [])

========================================
  Total items claimed:  39
  Unique articles:      39
  Elapsed time:         0.05s

  ZERO OVERLAP — SKIP LOCKED working correctly!
============================================================

Reading the output

14 + 14 + 11 + 0 = 39 — every pending item was claimed exactly once
Zero overlap — no item was processed by more than one worker
0.05 seconds — the entire distribution happened in 50 milliseconds
Worker 3 got 0 items: This is actually the ideal outcome. The first 3 workers were fast enough to drain the queue before Worker 3’s SELECT ... SKIP LOCKED could find any unlocked rows. In a real deployment where each item takes 300-500ms (OpenAI API call), all 4 workers would stay busy and you’d see approximately even distribution.

Why SKIP LOCKED and not regular FOR UPDATE? With regular FOR UPDATE, Worker 1 would lock rows and Worker 2 would wait (block) until Worker 1’s transaction commits. With SKIP LOCKED, Worker 2 skips the locked rows and grabs the next available ones immediately. No blocking, no deadlocks, no coordination.

This is pure PostgreSQL. No Redis, no RabbitMQ, no SQS. One SQL query, one feature (SKIP LOCKED), and you have a production-grade concurrent work queue. If you need to process your embedding queue faster, just add workers — throughput scales linearly.

Step 9a: End-to-End Trigger Flow

Every previous step ran parts of the pipeline in isolation. This demo shows the complete lifecycle of a single article change — from UPDATE to searchable embedding.

What `demo_trigger_flow.py` does

The script picks one article and walks through the full pipeline synchronously:

Checks the queue for this article (should be empty)
Updates the article’s content (appending demo text)
Verifies the trigger fired by checking the queue again (should now have a pending entry)
Shows the article’s metadata (new content_hash, updated_at)
Runs the worker for exactly this one item (calls OpenAI, writes embeddings)
Verifies the embeddings are in article_embeddings_versioned

python examples/demo_trigger_flow.py

============================================================
  Demo: End-to-End Trigger Flow
  Article: [86698] Thin film transistor liquid crystal display
============================================================

1. Queue entries (pending) for article 86698 BEFORE update: 0

Nothing in the queue yet — clean starting state.

2. Updated article content (appended demo text)

An UPDATE articles SET content = content || '...' WHERE id = 86698 just ran. Two triggers fired: trg_content_hash (BEFORE, recomputed the MD5) and trg_queue_embedding (AFTER, inserted a queue entry).

3. Trigger fired! Queue entry created:
   Queue ID:     57
   Status:       pending
   Content Hash: b5a7c0820832fd54...
   Queued At:    2026-02-18 15:05:29.303062+00:00

The trigger did its job. A new pending item is in the queue with the article’s current content hash. Note the timestamp — in this lab, the trigger overhead was negligible.

4. Article metadata updated:
   Content Hash: b5a7c0820832fd54...
   Updated At:   2026-02-18 15:05:29.303062+00:00

The article’s content_hash matches the queue entry’s hash — they were set by the same trigger. This hash will later be stored as source_hash on the embedding, creating the audit chain: “this embedding was generated from this exact version of the content.”

5. Running worker for one batch...
   Article 86698: embedded 3 chunks
   Processed 1 items

The worker claimed this item, called OpenAI 3 times (3 chunks), and wrote the embeddings to article_embeddings_versioned.

6. Embeddings for article 86698:
   Current chunks: 3
   Last created:   2026-02-18 15:05:29.388968+00:00

============================================================
  Demo complete!
============================================================

The complete flow — from content modification to searchable embeddings — took about 1 second. The latency breakdown: ~50ms for PostgreSQL (trigger + queue + insert), ~900ms for OpenAI (3 embedding API calls). In a production system with a continuously running worker, this latency would be the norm for every content change.

Step 9b: Quality Feedback Loop

The final piece, and the one that closes the architecture. Everything so far reacts to content changes. But what if the embeddings are technically “fresh” (content hasn’t changed) yet search quality is degrading? Maybe the model isn’t capturing certain topics well, or the chunking strategy doesn’t work for some article types.

What `demo_quality_drift.py` does

This script simulates the quality feedback loop described in Part 1’s monitoring section. It works in four phases:

Phase 1 — Simulate retrieval quality logs: The script generates 20 fake search queries with associated quality metrics (nDCG, precision@k, user satisfaction scores). It deliberately creates a pattern where quality metrics decline for certain topics — simulating what would happen if embeddings for some subject areas became less effective over time.

Phase 2 — Quality analysis: The script scans retrieval_quality_log looking for queries with poor results: low nDCG scores (below a configurable threshold) or negative user feedback. It identifies 8 queries where quality dropped.

Phase 3 — Article correlation: For each poor-performing query, the script finds related articles using title ILIKE '%keyword%' matching. This is a simplified version of what a production system would do (where you’d use the query’s actual retrieved results instead of keyword matching). It identifies 29 articles that might be causing poor search results.

Phase 4 — Automatic re-queuing: All 29 articles are inserted into embedding_queue with change_type = 'quality_reembed' instead of the usual 'content_update'. This distinction is critical — it means the re-embedding is happening not because the content changed, but because the quality metrics flagged a problem.

python examples/demo_quality_drift.py

The demo runs through all four phases and produces a final queue state:

wikipedia=# SELECT change_type, status, count(*) 
  FROM embedding_queue GROUP BY change_type, status ORDER BY change_type, status;
  change_type    |  status   | count
-----------------+-----------+-------
 content_update  | completed |    50
 content_update  | skipped   |     5
 quality_reembed | pending   |    29

Reading the queue state

Three distinct categories tell the full pipeline story:

50 content_update / completed: the normal pipeline flow — content changed, trigger fired, worker embedded. This is Layers 1 and 2 doing their job.
5 content_update / skipped: the typo-level changes from Step 6 — the change detector said “not worth re-embedding.” This is Layer 2’s cost optimization.
29 quality_reembed / pending: the feedback loop’s contribution — these articles weren’t re-queued because their content changed (it may not have). They were re-queued because search quality dropped for queries related to them.

Why the quality_reembed change type matters: When the worker processes these items, it bypasses the change significance detector. If the detector were to analyze them, it might say “similarity=0.998 → SKIP” because the content barely changed. But that’s the whole point — the content didn’t change, yet the embeddings aren’t serving search well. The quality feedback overrides the filter.

This is the three-layer architecture from Part 1 working in practice:

Triggers (Layer 1): react to content changes immediately — the broadest net
Change significance (Layer 2): filter out trivial changes, saving API cost — the optimization layer
Quality feedback (Layer 3): catch what the filter missed or what wasn’t about content changes at all — the safety net

Each layer compensates for the blind spots of the previous one.

Key Takeaways

1. The trigger is smarter than you think. Using UPDATE OF content means metadata-only changes never touch the embedding pipeline. In our test, 12% of mutations (6 out of 50) were filtered out at the trigger level, before any Python code ran. In a real knowledge base with tag edits, status changes, and metadata updates, this fraction could be substantially higher.

2. The change detector needs a baseline. On the first run, every article shows similarity=0.0 because there’s nothing to compare against. This is correct behavior, but you need to plan for the initial backfill being 100% EMBED. Budget the API cost and time accordingly.

3. The 0.95 threshold is validated. Typo-level changes (appending a period) scored 0.998+, paragraph additions scored ~0.93, and section rewrites scored 0.51–0.63. There’s a clear gap between “trivial” and “significant” that the threshold exploits. You don’t need machine learning or complex heuristics — cosine similarity with a simple threshold works.

4. SKIP LOCKED is production-ready. 4 workers, 39 items, zero overlap, 0.05 seconds. No external dependencies, no coordination service. This is the simplest correct way to build a concurrent work queue in PostgreSQL. Need more throughput? Add workers.

5. Quality metrics close the loop. The change significance filter reduces unnecessary writes and index churn, but it can’t know if a small change was semantically important — or if the embedding was poor to begin with. The quality feedback loop catches those cases by correlating low-quality retrievals with specific articles and forcing re-embedding. Three layers, each compensating for the blind spots of the previous one.

6. The bottleneck is the API, not PostgreSQL. 10 articles embedded in ~8 seconds, with each OpenAI call taking 300-600ms. In this lab, PostgreSQL’s trigger + queue overhead was negligible compared to API latency. If you need faster throughput, add workers (SKIP LOCKED scales linearly) or switch to a local embedding model like nomic-embed-text via Ollama.

Running It Yourself

git clone https://github.com/boutaga/pgvector_RAG_search_lab.git
cd pgvector_RAG_search_lab

# Ensure Wikipedia database is loaded (see Lab 2 in README)
# You'll need: PostgreSQL 17+, pgvector, pgvectorscale, an OpenAI API key

# Step 1: Apply schema
psql -d wikipedia -f lab/05_embedding_versioning/schema.sql

# Step 3: Simulate changes (Step 2 is a manual SQL test)
python lab/05_embedding_versioning/examples/simulate_document_changes.py --count 50

# Step 4: Run change detector (all EMBED on first run — no baseline yet)
python lab/05_embedding_versioning/change_detector.py --analyze-queue

# Step 5: Create baseline embeddings (requires OPENAI_API_KEY env var)
python lab/05_embedding_versioning/worker.py --once --batch-size 10

# Step 6: Apply targeted mutations, then re-run detector
python lab/05_embedding_versioning/examples/targeted_mutations.py
python lab/05_embedding_versioning/change_detector.py --analyze-queue

# Step 7: Full freshness report
python lab/05_embedding_versioning/freshness_monitor.py --report

# Step 8: SKIP LOCKED concurrency demo
python lab/05_embedding_versioning/examples/demo_skip_locked.py --workers 4

# Step 9a: End-to-end trigger flow
python lab/05_embedding_versioning/examples/demo_trigger_flow.py

# Step 9b: Quality feedback loop
python lab/05_embedding_versioning/examples/demo_quality_drift.py

What’s Next

In the next post, I’ll explore benchmarking pgvectorscale’s StreamingDiskANN at scale — with real numbers on query latency, recall, index build time, and memory footprint at different dataset sizes. We’ll use the same Wikipedia dataset and the versioned embedding infrastructure from this lab.

L’article RAG Series – Embedding Versioning LAB est apparu en premier sur dbi Blog.

RAG Series – Embedding Versioning with pgvector: Why Event-Driven Architecture Is a Precondition to AI data workflows

Adrien Obernesser — Sun, 22 Feb 2026 14:23:42 +0000

Introduction

“Make it simple.” This is a principle I keep repeating, and I’ll repeat it again here. Because when it comes to keeping your RAG system’s embeddings fresh, the industry has somehow made it complicated. External orchestrators, custom Python cron jobs, microservices that call microservices, Airflow DAGs with 47 tasks, all to answer a simple question: when my source data changes, how do I update the corresponding embeddings?

If you’ve followed this RAG series from Naive RAG through Hybrid Search, Adaptive RAG, and Agentic RAG, you’ve seen how retrieval quality is the backbone of any RAG system. But here’s what I didn’t cover explicitly: what happens when your retrieval quality silently degrades because your embeddings are stale?

This is the silent killer of RAG in production. Nobody complains about the embedding pipeline, they complain that the chatbot gives wrong answers. And by the time you trace it back to stale embeddings, the trust is already gone.

In this post, I want to bridge two worlds that I’ve been working in simultaneously: the CDC/event-driven pipelines I demonstrated in my PostgreSQL CDC to JDBC Sink and Oracle to PostgreSQL Migration with Flink CDC posts, and the RAG/pgvector world from this series.

The thesis is straightforward: if you’re serious about production RAG, you need event-driven embedding refresh. Batch re-embedding is technical debt waiting to happen. Event-driven architecture and data pipelines are a precondition to hosting similarity search. Organizations that are still 100% batch-processed are all migrating towards event-driven because of a probable need for live KPIs instead of daily refreshes. This is facilitated by the current maturity of the solutions that are out there. The “hidden” bonus of streaming data from your data sources to a data lake and to your data marts is that it facilitates refreshes of embeddings as well.

This is Part 1 covering the architecture and design patterns I feel are relevant. In Part 2, I walk through a hands-on LAB on 25,000 Wikipedia articles with real output, actual numbers, and some of the edge cases you would encounter applying this in practice.

The Problem: Stale Embeddings

Let me paint a picture that I’ve seen in real consulting engagements.

A company builds a RAG system for internal documentation. Knowledge base: 50,000 documents in PostgreSQL. Embeddings generated with text-embedding-3-small, stored in pgvector. Everything works great on day one.

Three months later:

2,000 documents have been updated
500 new documents have been added
300 documents have been deprecated
The embedding pipeline? It ran once during initial setup. Maybe someone re-ran it manually last month. Maybe not.

The result: your vector index is lying to you. Similarity search returns chunks from outdated documents. The LLM generates answers based on stale context. Users lose trust.

This is not a hypothetical. This is the reality of most RAG deployments I’ve encountered.

Why batch re-embedding doesn’t scale

The naive approach is: “just re-embed everything periodically.” Let’s do the math.

For 50,000 documents, assuming an average of 10 chunks per document:

500,000 chunks to embed, ~500 tokens each — that’s 250 million tokens
At ~$0.02 per 1M tokens with text-embedding-3-small: ~$5 per full re-embed (not terrible)
The OpenAI embeddings endpoint accepts arrays of inputs, so you can batch ~100 chunks per request. That’s ~5,000 requests. At Tier 1’s 3,000 RPM, RPM isn’t the bottleneck — TPM is. Depending on your tier’s token-per-minute limit (check your project limits), the real constraint is how fast the API will accept 250M tokens. Depending on your usage tier, this could take anywhere from under an hour to several hours of wall-clock time.
During which, if you’re replacing embeddings in-place (the typical batch approach), your index is in a partially-stale state — some embeddings are new, some are old. The versioned schema I’ll show below avoids this, but most batch implementations don’t bother with versioning.
In our lab experience, heavy churn from bulk re-inserts can degrade StreamingDiskANN recall (pgvectorscale). The index handles incremental updates well, but re-embedding 500K rows at once is not “incremental” — validate this on your own workload and treat large backfills as an operational event.

Now multiply this by:

Multiple embedding models you might want to test
Multiple environments (dev, staging, production)
Frequency: weekly? daily? hourly?

The cost isn’t the API calls. The cost is the operational complexity: coordinating the backfill, monitoring progress, handling rate limit errors, and — critically — the lack of observability into which documents actually changed. Batch treats every document the same, whether it was modified yesterday or hasn’t been touched in six months.

The deeper problem: you can’t fix what you don’t measure

But there’s a problem that comes before stale embeddings, and in my consulting experience, it’s far more common: most organizations don’t measure retrieval quality at all. They deploy a RAG system, it works in demo, it goes to production, and then nobody instruments it. There is no precision@k, no nDCG, no confidence scoring. The embedding pipeline might be stale, or it might be fine — they literally cannot tell.

In the Adaptive RAG post, I introduced the metrics framework that makes retrieval quality measurable: precision@k (are the retrieved documents relevant?), recall@k (are we finding all the relevant documents?), nDCG@k (are the best results ranked first?), and confidence scores (how certain is the system about its top result?). In the Agentic RAG post, I added decision metrics on top of that — tracking whether the agent made the right call about when to retrieve. The evaluation framework in the pgvector_RAG_search_lab repository (lab/evaluation/metrics.py, compare_search_configs.py, k_balance_experiment.py) implements all of this concretely.

These metrics were originally designed to compare search strategies and tune parameters. But here’s the connection to embedding freshness that I want to make explicit: the same metrics that tell you whether your search is working also tell you whether your embeddings are drifting. If your weekly nDCG is declining, if your confidence distribution is shifting toward lower values, if precision@10 is dropping for a subset of queries — those are the leading indicators that your embeddings are falling behind your content. Not the queue depth, not the pipeline latency. The quality metrics.

I have seen architectures where teams built elaborate embedding pipelines — cron jobs, Airflow DAGs, custom orchestrators — but never implemented the measurement layer. The pipeline runs on schedule, embeddings get refreshed, and everyone assumes it’s working. But without retrieval quality metrics, you have no way to know if you are going in the right direction. You might be re-embedding documents that don’t need it (wasting API spend) and missing documents that do (degrading search quality). Worse, I have seen setups where the metrics exist but are so poorly instrumented — wrong ground truth sets, no temporal dimension, no per-topic breakdown — that the numbers are misleading. An aggregate nDCG of 0.82 can hide the fact that an entire topic cluster has dropped to 0.45.

Building the pipeline is one thing. Proving you’re going in the right direction is everything.

This is why this post covers both. The first two-thirds address the pipeline: how to detect changes, how to queue and process them, how to decide what’s worth re-embedding. But the final section — Monitoring Embedding Freshness — is where it all comes together. That’s where the retrieval quality metrics from the Adaptive RAG post become operational canaries for embedding drift. The pipeline reacts to content changes; the monitoring layer tells you whether the pipeline is actually keeping your RAG system healthy. You need both.

The Solution: Event-Driven Embedding Refresh

The answer is the same pattern I demonstrated in the CDC posts: react to changes as they happen.

Instead of asking “when should I re-embed?”, the question becomes: “a row changed — which embeddings need updating?”

Here’s the architecture I’m proposing:

There are three levels of sophistication here, and I want to walk through each one because not every project needs the most complex solution.

Level 1: PostgreSQL Triggers — The Simplest Path

If your source data and embeddings live in the same PostgreSQL instance (which they probably do if you’ve been following this series), you don’t need Flink. You don’t need Kafka. You need a trigger.

Schema design with versioning

First, let’s design a proper embedding table that supports versioning. This is the piece most tutorials skip:

-- Source table (your knowledge base)
CREATE TABLE documents (
    doc_id          BIGSERIAL PRIMARY KEY,
    title           TEXT NOT NULL,
    content         TEXT NOT NULL,
    category        TEXT,
    content_hash    TEXT GENERATED ALWAYS AS (md5(content)) STORED,
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    is_active       BOOLEAN NOT NULL DEFAULT true
);

-- Embedding table with versioning support
CREATE TABLE document_embeddings (
    embedding_id    BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
    chunk_index     INT NOT NULL,
    chunk_text      TEXT NOT NULL,
    embedding       vector(1536),       -- text-embedding-3-small
    model_name      TEXT NOT NULL DEFAULT 'text-embedding-3-small',
    model_version   TEXT NOT NULL DEFAULT 'v1',
    source_hash     TEXT NOT NULL,       -- md5 of the source content at embed time
    embedded_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    is_current      BOOLEAN NOT NULL DEFAULT true,
    
    UNIQUE(doc_id, chunk_index, model_name, model_version)
);

-- Index for similarity search (only current embeddings)
-- Using pgvectorscale's StreamingDiskANN for better performance at scale
CREATE INDEX idx_embeddings_diskann ON document_embeddings 
    USING diskann (embedding vector_cosine_ops)
    WHERE is_current = true;

-- Index for version lookups
CREATE INDEX idx_embeddings_version ON document_embeddings (doc_id, model_version, is_current);

-- Index for staleness detection
CREATE INDEX idx_embeddings_staleness ON document_embeddings (source_hash, is_current)
    WHERE is_current = true;

-- Safety: prevent two "current" chunk sets for the same doc + model space
CREATE UNIQUE INDEX uq_doc_current_per_space
    ON document_embeddings (doc_id, model_name, model_version, chunk_index)
    WHERE is_current;

A few things to notice here:

content_hash: a generated column that gives us a fast way to detect if content actually changed (not just updated_at). If you’re adding this to an existing table with data, note that ALTER TABLE ... ADD COLUMN ... GENERATED ALWAYS AS ... STORED requires touching/recomputing all rows — plan a maintenance window, or use a BEFORE UPDATE trigger with NEW.content_hash := md5(NEW.content) instead. Both approaches are functionally equivalent.
source_hash on the embedding: captures what the source content looked like when the embedding was generated
is_current: soft versioning — old embeddings are kept for rollback. The partial unique index uq_doc_current_per_space guarantees at the database level that you can never have two “current” chunk sets for the same document within the same model space — even if your application has a bug.
Partial DiskANN index: only indexes current embeddings, so similarity search is clean and performant at scale. Partial indexes (CREATE INDEX ... WHERE ...) are standard PostgreSQL — validated in our lab with pgvectorscale’s StreamingDiskANN (see Part 2 — Lab Walkthrough). If your pgvectorscale version doesn’t support partial predicates, pgvector’s HNSW partial index is an equivalent fallback.
model_version: critical for model upgrades (more on this later)

The embedding queue pattern

Rather than embedding synchronously in a trigger (which would block the transaction and hit external APIs), we use a queue pattern:

-- Queue table for pending embedding work
CREATE TABLE embedding_queue (
    queue_id        BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL REFERENCES documents(doc_id),
    change_type     TEXT NOT NULL DEFAULT 'content_update',
    content_hash    TEXT,
    queued_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    claimed_at      TIMESTAMPTZ,            -- set when a worker claims the item
    processed_at    TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT 'pending' 
                    CHECK (status IN ('pending', 'processing', 'completed', 'failed', 'skipped')),
    error_message   TEXT,
    retry_count     INT NOT NULL DEFAULT 0
);

CREATE INDEX idx_queue_pending ON embedding_queue (status, queued_at) 
    WHERE status = 'pending';

Note the skipped status — this is used by the change significance detector (covered later) when it determines that a content change is too minor to warrant re-embedding. The item stays in the queue for audit purposes, but no embedding API call is made.

The trigger

CREATE OR REPLACE FUNCTION fn_queue_embedding_update()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO embedding_queue (doc_id, change_type, content_hash)
        VALUES (NEW.doc_id, 'content_update', NEW.content_hash);
        RETURN NEW;
        
    ELSIF TG_OP = 'UPDATE' THEN
        -- Only queue if content actually changed (not just metadata)
        IF OLD.content_hash IS DISTINCT FROM NEW.content_hash THEN
            INSERT INTO embedding_queue (doc_id, change_type, content_hash)
            VALUES (NEW.doc_id, 'content_update', NEW.content_hash);
        END IF;
        RETURN NEW;
        
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO embedding_queue (doc_id, change_type, content_hash)
        VALUES (OLD.doc_id, 'delete', OLD.content_hash);
        RETURN OLD;
    END IF;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_embedding_queue
    AFTER INSERT OR UPDATE OR DELETE ON documents
    FOR EACH ROW
    EXECUTE FUNCTION fn_queue_embedding_update();

The key insight here is the content_hash comparison on UPDATE. If someone updates the category or title but the actual content hasn’t changed, we don’t waste an API call re-embedding identical text. This is a simple optimization but it saves real money at scale. In my lab tests on 25K Wikipedia articles, 12% of simulated mutations were metadata-only — the trigger correctly skipped all of them.

An alternative approach that’s even more targeted: use AFTER INSERT OR UPDATE OF content to only fire the trigger when the content column is modified. This is what I did in the LAB (see Part 2) because the articles table didn’t have a content_hash column originally. Both approaches achieve the same goal.

DBA note on UPDATE OF: PostgreSQL’s column-specific trigger fires based on the SET list of the UPDATE command, not the actual row diff. If a BEFORE UPDATE trigger on another function silently modifies NEW.content without content appearing in the original SET clause, an AFTER UPDATE OF content trigger won’t fire — the content changed, but PostgreSQL doesn’t know. This is documented behavior. The content_hash comparison approach above doesn’t have this blind spot, because it compares actual values regardless of which columns were in the SET list.

The worker (Python)

The worker process polls the queue and generates embeddings. This is intentionally simple — no frameworks, no dependencies beyond psycopg and openai:

#!/usr/bin/env python3
"""
embedding_worker.py — polls the embedding_queue and processes pending items.
Run as: python3 embedding_worker.py
        python3 embedding_worker.py --once --batch-size 10  (single batch, for testing)
        python3 embedding_worker.py --workers 4             (multi-process)
"""

import os, time, hashlib, json
import psycopg
from openai import OpenAI

DB_URL = os.environ["DATABASE_URL"]
client = OpenAI()

MODEL_NAME    = "text-embedding-3-small"
MODEL_VERSION = "v1"
CHUNK_SIZE    = 500   # tokens (approximate via chars / 4)
CHUNK_OVERLAP = 50
BATCH_SIZE    = 10    # queue items per cycle
POLL_INTERVAL = 5     # seconds


def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    """Simple character-based chunking. Replace with your preferred strategy."""
    char_size = size * 4  # rough token-to-char ratio
    char_overlap = overlap * 4
    chunks = []
    start = 0
    while start < len(text):
        end = start + char_size
        chunks.append(text[start:end])
        start = end - char_overlap
    return chunks


def generate_embeddings(texts: list[str]) -> list[list[float]]:
    """Batch embedding call to OpenAI."""
    response = client.embeddings.create(
        input=texts,
        model=MODEL_NAME
    )
    return [item.embedding for item in response.data]


def process_insert_or_update(conn, doc_id: str, content_hash: str):
    """Generate fresh embeddings for a document."""
    with conn.cursor() as cur:
        # Fetch current document content
        cur.execute(
            "SELECT content FROM documents WHERE doc_id = %s AND is_active = true",
            (doc_id,)
        )
        row = cur.fetchone()
        if not row:
            return  # document was deleted or deactivated since queuing
        
        content = row[0]
        
        # Verify content hasn't changed again since queuing
        current_hash = hashlib.md5(content.encode()).hexdigest()
        if current_hash != content_hash:
            return  # content changed again, a newer queue entry will handle it
        
        # Check if embeddings already exist for this hash (idempotency)
        # Scoped to model_name + model_version so parallel shadow-mode workers
        # don't falsely consider each other's embeddings as "already done"
        cur.execute(
            """SELECT 1 FROM document_embeddings 
               WHERE doc_id = %s AND source_hash = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true
               LIMIT 1""",
            (doc_id, content_hash, MODEL_NAME, MODEL_VERSION)
        )
        if cur.fetchone():
            return  # already embedded with this content
        
        # Chunk and embed
        chunks = chunk_text(content)
        embeddings = generate_embeddings(chunks)
        
        # Mark old embeddings as not current — scoped to this model space only
        # so shadow-mode v2 embeddings aren't flipped by v1 workers (or vice versa)
        cur.execute(
            """UPDATE document_embeddings 
               SET is_current = false 
               WHERE doc_id = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true""",
            (doc_id, MODEL_NAME, MODEL_VERSION)
        )
        
        # Insert new embeddings
        for idx, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                """INSERT INTO document_embeddings 
                   (doc_id, chunk_index, chunk_text, embedding, 
                    model_name, model_version, source_hash)
                   VALUES (%s, %s, %s, %s, %s, %s, %s)""",
                (doc_id, idx, chunk, emb, MODEL_NAME, MODEL_VERSION, content_hash)
            )
        
        conn.commit()


def process_delete(conn, doc_id: str):
    """Mark embeddings as not current when source is deleted."""
    with conn.cursor() as cur:
        cur.execute(
            """UPDATE document_embeddings 
               SET is_current = false 
               WHERE doc_id = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true""",
            (doc_id, MODEL_NAME, MODEL_VERSION)
        )
        conn.commit()


def poll_and_process():
    """Main loop: claim a batch, process, repeat."""
    with psycopg.connect(DB_URL) as conn:
        while True:
            with conn.cursor() as cur:
                # Claim a batch (SELECT FOR UPDATE SKIP LOCKED)
                cur.execute("""
                    UPDATE embedding_queue 
                    SET status = 'processing', claimed_at = now()
                    WHERE queue_id IN (
                        SELECT queue_id FROM embedding_queue
                        WHERE status = 'pending'
                        ORDER BY queued_at
                        LIMIT %s
                        FOR UPDATE SKIP LOCKED
                    )
                    RETURNING queue_id, doc_id, change_type, content_hash
                """, (BATCH_SIZE,))
                
                batch = cur.fetchall()
                conn.commit()
            
            if not batch:
                time.sleep(POLL_INTERVAL)
                continue
            
            for queue_id, doc_id, change_type, content_hash in batch:
                try:
                    if change_type in ('content_update',):
                        process_insert_or_update(conn, doc_id, content_hash)
                    elif change_type == 'delete':
                        process_delete(conn, doc_id)
                    
                    with conn.cursor() as cur:
                        cur.execute(
                            """UPDATE embedding_queue 
                               SET status = 'completed', processed_at = now()
                               WHERE queue_id = %s""",
                            (queue_id,)
                        )
                        conn.commit()
                        
                except Exception as e:
                    conn.rollback()
                    with conn.cursor() as cur:
                        cur.execute(
                            """UPDATE embedding_queue 
                               SET status = CASE WHEN retry_count >= 3 THEN 'failed' ELSE 'pending' END,
                                   retry_count = retry_count + 1,
                                   error_message = %s
                               WHERE queue_id = %s""",
                            (str(e), queue_id)
                        )
                        conn.commit()
                    print(f"Error processing queue_id={queue_id}: {e}")


if __name__ == "__main__":
    print("Embedding worker started. Polling...")
    poll_and_process()

What I like about this pattern:

It’s transactional: the trigger and the queue insert are in the same transaction. If the INSERT/UPDATE fails, no queue entry is created.
It’s idempotent: the worker checks content_hash before embedding, so duplicate queue entries are harmless.
It uses SELECT FOR UPDATE SKIP LOCKED for safe concurrency (see below).
It handles retries gracefully: failed items go back to pending with a counter.

Deep dive: SELECT FOR UPDATE SKIP LOCKED

This is the core of why this queue pattern works, and it’s a PostgreSQL feature that most people underuse. Let me explain it properly because it’s one of those things that looks simple in the SQL but has profound implications for how you scale workers.

The problem: you want to run multiple embedding workers in parallel to process the queue faster. But if two workers pick the same queue item, you’ve wasted an API call (double embedding) or worse, you get race conditions on the document_embeddings table.

The classic solutions are:

External locking (Redis, ZooKeeper): adds infrastructure, adds failure modes
Application-level partitioning (worker 1 handles doc_id % 3 = 0, worker 2 handles doc_id % 3 = 1…): rigid, doesn’t adapt to load
SELECT FOR UPDATE: locks the rows, but the second worker blocks and waits until the first one commits. This serializes your workers — you’re back to single-threaded throughput.

SKIP LOCKED changes everything. Here’s what happens step by step:

Timeline:
─────────────────────────────────────────────────────────────────

Worker A (t=0):
    BEGIN;
    UPDATE embedding_queue SET status = 'processing', claimed_at = now()
    WHERE queue_id IN (
        SELECT queue_id FROM embedding_queue
        WHERE status = 'pending'
        ORDER BY queued_at
        LIMIT 5
        FOR UPDATE SKIP LOCKED    -- ← locks rows 1,2,3,4,5
    )
    RETURNING queue_id, doc_id, change_type, content_hash;
    
    → Returns: queue_id 1, 2, 3, 4, 5
    → These 5 rows are now locked by Worker A's transaction

Worker B (t=1, while Worker A is still processing):
    BEGIN;
    UPDATE embedding_queue SET status = 'processing', claimed_at = now()
    WHERE queue_id IN (
        SELECT queue_id FROM embedding_queue
        WHERE status = 'pending'
        ORDER BY queued_at
        LIMIT 5
        FOR UPDATE SKIP LOCKED    -- ← sees rows 1-5 are locked, SKIPS them
    )
    RETURNING queue_id, doc_id, change_type, content_hash;
    
    → Returns: queue_id 6, 7, 8, 9, 10
    → No blocking, no waiting, no conflict

Worker C (t=2):
    → Gets queue_id 11, 12, 13, 14, 15
    → Same story: zero contention

The key behaviors:

FOR UPDATE: tells PostgreSQL “I intend to modify these rows, lock them for me”
SKIP LOCKED: tells PostgreSQL “if a row is already locked by someone else, don’t wait — just pretend it doesn’t exist and move to the next one”

This means:

Workers never block each other — no waiting, no deadlocks
Workers never process the same item — each item is claimed by exactly one worker
You can scale horizontally just by starting more worker processes
If a worker crashes mid-processing, its transaction is rolled back, the locks are released, and the rows become visible to other workers again (the status was already set to 'processing' via the UPDATE, so you’d need a cleanup mechanism for crashed workers — more on that below)

What happens without SKIP LOCKED?

Let’s compare. With plain FOR UPDATE (no SKIP LOCKED):

Worker A (t=0):  SELECT ... FOR UPDATE LIMIT 5;  → gets rows 1-5, locks them
Worker B (t=1):  SELECT ... FOR UPDATE LIMIT 5;  → tries row 1... BLOCKED ⏳
                                                    waits for Worker A to COMMIT
Worker A (t=10): COMMIT;                          → releases locks
Worker B (t=10): → finally gets rows 1-5 (but they're already processed!)
                 → returns empty set because status is no longer 'pending'
                 → wasted 10 seconds waiting for nothing

With SKIP LOCKED:

Worker A (t=0):  SELECT ... FOR UPDATE SKIP LOCKED LIMIT 5;  → gets rows 1-5
Worker B (t=1):  SELECT ... FOR UPDATE SKIP LOCKED LIMIT 5;  → gets rows 6-10 instantly
                 → zero wait time, immediate useful work

This is exactly the behavior you want for a work queue.

The crash recovery problem

There’s one subtlety: if Worker A claims rows 1-5, sets their status = 'processing', and then crashes (process killed, OOM, network failure), those rows are stuck in 'processing' forever. The PostgreSQL locks are released (transaction was rolled back), but the status column still says 'processing'.

You need a reaper — a periodic cleanup that reclaims stale items:

-- Reclaim items stuck in 'processing' for more than 5 minutes
-- (embedding should never take that long)
-- Uses claimed_at, not queued_at — an item queued 30 minutes ago
-- but claimed 10 seconds ago should NOT be reclaimed
UPDATE embedding_queue 
SET status = 'pending',
    retry_count = retry_count + 1,
    error_message = 'reclaimed: worker timeout after 5 minutes'
WHERE status = 'processing' 
AND claimed_at < now() - INTERVAL '5 minutes';

Run this every minute via pg_cron or a simple cron job. It’s a safety net, not the primary flow.

Why this is better than external queue systems

For this specific use case (embedding queue), SKIP LOCKED gives you an in-database work queue with:

ACID guarantees: the queue and the embeddings are in the same database, same transaction scope
No external dependencies: no Redis, no RabbitMQ, no SQS
Exactly-once semantics: combined with the content_hash idempotency check
Observability: it’s just a table — SELECT count(*) FROM embedding_queue WHERE status = 'pending' is your queue depth, queryable from any SQL client or monitoring tool

The limitation is throughput: if you’re processing millions of queue items per second, you want Kafka or SQS. For an embedding queue where each item takes 100-500ms to process (API call), PostgreSQL can easily handle thousands of items per minute. That’s more than enough for any knowledge base I’ve seen in production.

What I don’t like:

It’s polling-based: the worker checks every 5 seconds. For most use cases this is fine, but if you need sub-second latency, you want LISTEN/NOTIFY.
It requires a separate process to run. In production, that means a systemd service or a Kubernetes deployment.

Upgrading to LISTEN/NOTIFY

If you want to eliminate polling and react instantly, PostgreSQL’s LISTEN/NOTIFY mechanism is your friend. Add this to the trigger:

-- Add to fn_queue_embedding_update(), after each INSERT INTO embedding_queue:
-- Use NEW.doc_id for INSERT/UPDATE, OLD.doc_id for DELETE
PERFORM pg_notify('embedding_work', json_build_object(
    'doc_id', COALESCE(NEW.doc_id, OLD.doc_id),
    'operation', TG_OP
)::text);

And in the worker, replace the time.sleep() loop with:

import select

conn = psycopg.connect(DB_URL, autocommit=True)
conn.execute("LISTEN embedding_work")

while True:
    if select.select([conn], [], [], 5.0) == ([], [], []):
        # Timeout — check for any missed items anyway
        process_pending_batch(conn)
    else:
        conn.execute("SELECT 1")  # consume notifications
        for notify in conn.notifies():
            process_pending_batch(conn)

This gives you near-real-time embedding refresh with zero polling overhead.

Level 2: Logical Replication — Cross-Database Embedding Sync

Now let’s go a level up. What if your source data lives in a different PostgreSQL instance than your vector store? Or what if the team that manages the knowledge base doesn’t want triggers on their production tables?

This is where PostgreSQL logical replication becomes the CDC mechanism. It’s built into PostgreSQL, it reads the WAL, and it has near-zero impact on the source.

The setup

On the source (knowledge base database):

-- Ensure WAL is configured for logical replication
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- Restart required

-- Create a publication for the documents table
CREATE PUBLICATION pub_documents FOR TABLE documents;

On the target (vector database, different instance):

-- Create the same documents table structure (or a subset)
CREATE TABLE documents_replica (
    doc_id          BIGINT PRIMARY KEY,
    title           TEXT NOT NULL,
    content         TEXT NOT NULL,
    content_hash    TEXT,
    updated_at      TIMESTAMPTZ,
    is_active       BOOLEAN
);

-- Create the subscription
CREATE SUBSCRIPTION sub_documents
    CONNECTION 'host=source-db port=5432 dbname=knowledge_base user=replicator password=...'
    PUBLICATION pub_documents
    WITH (copy_data = true);  -- initial snapshot

Now the documents_replica table on your vector database is automatically kept in sync via WAL streaming. Every INSERT, UPDATE, DELETE on the source is replicated in near-real-time.

From here, you add the same trigger + queue + worker pattern from Level 1, but on the documents_replica table. The source database team doesn’t need to know or care about your embedding pipeline.

Architecture

Why this is powerful:

Zero impact on source: no triggers, no extra connections, just WAL reading
Separation of concerns: the DBA managing the knowledge base doesn’t need to understand embeddings
Built-in catch-up: if the embedding worker goes down, logical replication buffers changes in the WAL. When it comes back, all changes are processed in order
No external dependencies: this is pure PostgreSQL, no Kafka, no Flink, no cloud services

Limitations:

Logical replication is PG → PG only (unlike Flink CDC which can source from Oracle, MySQL, etc.)
DDL is not replicated: if the source adds a column, you need to handle it manually
The replication slot retains WAL until consumed — ⚠️ Production pitfall: if the subscriber is down for too long, WAL can fill up the source disk. Set max_slot_wal_keep_size (PG 13+) to cap retention, and monitor pg_replication_slots for inactive slots. DBAs: this is the #1 risk with logical replication.

Level 3: Flink CDC — When the Source Isn’t PostgreSQL (and When to Skip Re-embedding)

If your knowledge base lives in Oracle, MySQL, or you need to fan out to multiple targets (pgvector + Elasticsearch + data lake), then we’re back in the territory of my CDC posts.

But here’s where it gets really interesting. Flink CDC gives us something that the trigger and logical replication approaches don’t: access to both the before and after images of every row change. Debezium, which Flink CDC uses under the hood, captures the full row state before and after the UPDATE. This means we can evaluate whether a change is significant enough to warrant re-embedding — directly inside the pipeline, before hitting any embedding API.

Why this matters

Not every UPDATE to a document requires a new embedding. Think about it:

Someone fixes a typo: “PostgreSLQ” → “PostgreSQL” — probably not worth re-embedding
Someone updates a metadata field (status, last_reviewed_by) — definitely not worth re-embedding (metadata filtering should be done in the WHERE claude)
Someone rewrites two paragraphs and adds a new section — yes, re-embed
Someone changes a single KPI number in a financial report — depends on context, but the semantic meaning shifted

In a busy knowledge base, most row-level changes are minor. If your pipeline blindly re-embeds on every UPDATE, you’re burning API credits, creating unnecessary load on the embedding worker, and churning your DiskANN index for no semantic gain. The question is: can we be smarter about this?

The architecture with change significance filtering

The key insight: separate the data replication (all changes) from the embedding trigger (only significant changes). The data mart gets everything — it’s a faithful replica. But the embedding queue only receives changes where the content actually shifted enough to matter semantically.

Change significance: the approaches

There are several ways to evaluate whether a change is “significant enough” for re-embedding. I want to walk through each one because they have very different trade-offs.

Approach 1: Column-aware filtering (simplest, start here)

The cheapest filter: only trigger re-embedding when specific content columns change. If someone updates status, last_reviewed_by, category, or any metadata field, skip the embedding entirely.

In Flink SQL, Debezium CDC exposes op (operation type) and you can access both the old and new values. Here’s how to implement it:

-- CDC source table with before/after access
CREATE TABLE src_documents (
    doc_id          BIGINT,
    title           STRING,
    content         STRING,
    category        STRING,
    status          STRING,
    updated_at      TIMESTAMP(3),
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    'connector' = 'postgres-cdc',
    'hostname' = '172.19.0.4',
    'port' = '5432',
    'username' = 'postgres',
    'password' = '...',
    'database-name' = 'knowledge_base',
    'schema-name' = 'public',
    'table-name' = 'documents',
    'slot.name' = 'flink_documents_slot',
    'decoding.plugin.name' = 'pgoutput',
    'scan.incremental.snapshot.enabled' = 'true'
);

-- JDBC sink for ALL changes (data mart replication)
CREATE TABLE dm_documents (
    doc_id          BIGINT,
    title           STRING,
    content         STRING,
    category        STRING,
    status          STRING,
    updated_at      TIMESTAMP(3),
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:postgresql://172.20.0.4:5432/vector_db',
    'table-name' = 'documents_replica',
    'username' = 'postgres',
    'password' = '...',
    'driver' = 'org.postgresql.Driver'
);

-- Replicate everything to the data mart
INSERT INTO dm_documents SELECT * FROM src_documents;

For the embedding queue, we need to be selective. This is where a Flink SQL view or a ProcessFunction comes in. Since Flink SQL CDC doesn’t natively expose the before-image in the SELECT, the simplest approach is to use the content_hash strategy from Level 1: the trigger on documents_replica compares content_hash and only queues when it actually changed.

But if you want the filtering to happen inside Flink (before hitting the target at all), you need a UDF.

Approach 2: Text diff ratio (UDF — the sweet spot)

This is where it gets interesting. We register a custom Flink UDF that computes the similarity ratio between the old and new content, and only emits the row to the embedding queue when the change exceeds a threshold.

/**
 * Flink UDF: computes text change ratio between two strings.
 * Returns a value between 0.0 (completely different) and 1.0 (identical).
 * 
 * Uses a simplified approach: character-level diff ratio.
 * For production, consider token-level or sentence-level comparison.
 */
@FunctionHint(output = @DataTypeHint("DOUBLE"))
public class TextChangeRatio extends ScalarFunction {
    
    public Double eval(String before, String after) {
        if (before == null || after == null) return 0.0;
        if (before.equals(after)) return 1.0;
        
        // Longest Common Subsequence ratio
        int lcs = lcsLength(before, after);
        int maxLen = Math.max(before.length(), after.length());
        
        return maxLen == 0 ? 1.0 : (double) lcs / maxLen;
    }
    
    private int lcsLength(String a, String b) {
        // Optimized for streaming: use rolling array, not full matrix
        int[] prev = new int[b.length() + 1];
        int[] curr = new int[b.length() + 1];
        for (int i = 1; i <= a.length(); i++) {
            for (int j = 1; j <= b.length(); j++) {
                if (a.charAt(i-1) == b.charAt(j-1)) {
                    curr[j] = prev[j-1] + 1;
                } else {
                    curr[j] = Math.max(prev[j], curr[j-1]);
                }
            }
            int[] tmp = prev; prev = curr; curr = tmp;
            java.util.Arrays.fill(curr, 0);
        }
        return prev[b.length()];
    }
}

-- Register the UDF
CREATE FUNCTION text_change_ratio AS 'com.example.TextChangeRatio';

Now, the challenge here is that Flink SQL CDC doesn’t directly expose the “before” image as a column you can SELECT. The changelog stream has INSERT (+I), UPDATE_BEFORE (-U), UPDATE_AFTER (+U), and DELETE (-D) operations, but in a standard SELECT * FROM cdc_table, you only see the latest state.

To access both before and after, you have two options:

Option A: Stateful ProcessFunction (Java/Python)

This is the cleanest approach. You write a KeyedProcessFunction that maintains the previous state of each document in Flink’s managed state, and compares it with the incoming change:

# Pseudocode for the ProcessFunction approach
class ChangeSignificanceFilter(KeyedProcessFunction):
    
    def __init__(self, threshold=0.95):
        self.threshold = threshold  # 0.95 = skip if 95%+ similar
    
    def open(self, runtime_context):
        # Flink managed state: stores last known content per doc_id
        self.last_content = runtime_context.get_state(
            ValueStateDescriptor("last_content", Types.STRING())
        )
    
    def process_element(self, row, ctx):
        doc_id = row['doc_id']
        new_content = row['content']
        old_content = self.last_content.value()
        
        # Always update state
        self.last_content.update(new_content)
        
        if old_content is None:
            # INSERT: always emit (new document)
            yield Row(doc_id=doc_id, needs_embedding=True, 
                      change_ratio=0.0, operation='INSERT')
            return
        
        if old_content == new_content:
            # Content identical: metadata-only change, skip
            return
        
        # Compute change ratio
        ratio = text_similarity(old_content, new_content)
        
        if ratio < self.threshold:
            # Significant change: emit for re-embedding
            yield Row(doc_id=doc_id, needs_embedding=True,
                      change_ratio=round(1.0 - ratio, 4), 
                      operation='UPDATE')
        else:
            # Minor change (typo fix, formatting): skip embedding
            # Optionally log for monitoring
            yield Row(doc_id=doc_id, needs_embedding=False,
                      change_ratio=round(1.0 - ratio, 4),
                      operation='SKIP')


def text_similarity(a: str, b: str) -> float:
    """Fast similarity using difflib SequenceMatcher."""
    from difflib import SequenceMatcher
    return SequenceMatcher(None, a, b).ratio()

Option B: Self-join with temporal table (Flink SQL)

If you want to stay in pure SQL, you can maintain a “previous version” table and join against it:

-- Maintain a snapshot of previous content in a JDBC-backed table
CREATE TABLE content_snapshots (
    doc_id      BIGINT,
    content     STRING,
    content_md5 STRING,
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:postgresql://172.20.0.4:5432/vector_db',
    'table-name' = 'content_snapshots',
    'username' = 'postgres',
    'password' = '...',
    'driver' = 'org.postgresql.Driver'
);

-- Write ALL changes to the snapshot table (upsert)
INSERT INTO content_snapshots
SELECT doc_id, content, MD5(content) FROM src_documents;

Then in the embedding trigger on the target side, compare the incoming content_md5 against the previously stored one. If they differ, queue for embedding. This is essentially what the Level 1 trigger does, but now the CDC pipeline is handling the cross-database transport.

Approach 3: Structural change analysis (most sophisticated)

For knowledge bases with structured content (Markdown, HTML, technical documentation), you can go deeper than raw text diff. Analyze what kind of change happened:

def analyze_change_significance(old_content: str, new_content: str) -> dict:
    """
    Analyze the structural significance of a content change.
    Returns a dict with metrics to decide whether re-embedding is needed.
    """
    import re
    from difflib import SequenceMatcher
    
    result = {
        'char_ratio': SequenceMatcher(None, old_content, new_content).ratio(),
        'paragraphs_added': 0,
        'paragraphs_removed': 0,
        'paragraphs_modified': 0,
        'heading_changed': False,
        'needs_embedding': False
    }
    
    # Split into paragraphs
    old_paras = [p.strip() for p in old_content.split('\n\n') if p.strip()]
    new_paras = [p.strip() for p in new_content.split('\n\n') if p.strip()]
    
    old_set = set(old_paras)
    new_set = set(new_paras)
    
    result['paragraphs_added'] = len(new_set - old_set)
    result['paragraphs_removed'] = len(old_set - new_set)
    
    # Check if headings changed (strong signal for semantic shift)
    old_headings = set(re.findall(r'^#{1,3}\s+(.+)$', old_content, re.MULTILINE))
    new_headings = set(re.findall(r'^#{1,3}\s+(.+)$', new_content, re.MULTILINE))
    result['heading_changed'] = old_headings != new_headings
    
    # Decision logic
    if result['heading_changed']:
        result['needs_embedding'] = True
        result['reason'] = 'heading_changed'
    elif result['paragraphs_added'] > 0 or result['paragraphs_removed'] > 0:
        result['needs_embedding'] = True
        result['reason'] = 'structural_change'
    elif result['char_ratio'] < 0.90:
        result['needs_embedding'] = True
        result['reason'] = 'significant_text_change'
    else:
        result['needs_embedding'] = False
        result['reason'] = 'minor_change'
    
    return result

The idea here is that structural changes (new headings, added/removed sections) almost always shift the semantic meaning enough to warrant re-embedding, while inline text modifications need to cross a threshold.

Choosing the right threshold

This is the part where I have to be honest: I don’t have a definitive answer on the optimal threshold. It depends on your data, your embedding model, and your quality requirements.

What I can tell you from experimentation:

Change Type	Text Diff Ratio	Should Re-embed?	Why
Typo fix (“PostgreSLQ” → “PostgreSQL”)	0.99+	No	Semantic meaning unchanged
Reformatting (whitespace, punctuation)	0.95+	No	Embedding models are robust to formatting
Single sentence rewritten	0.85-0.95	Maybe	Depends on the sentence’s importance
Paragraph added/removed	0.70-0.85	Yes	New information or removed context
Major rewrite (>30% changed)	<0.70	Absolutely	Different document semantically
Metadata-only (status, category)	1.0 (content)	No	Content columns unchanged

My starting recommendation: set the threshold at 0.95 (i.e., re-embed when more than 5% of the text changed). Then monitor your RAG quality metrics (nDCG, retrieval precision from rag series – adaptive RAG) and adjust. If you’re missing relevant results, lower the threshold. If you’re burning too many API credits on trivial changes, raise it.

I validated these numbers on the Wikipedia dataset in Part 2 of this post. The results cleanly confirmed the 0.95 threshold: typo fixes scored 0.998+ (SKIP), paragraph additions scored ~0.93 (EMBED), and section rewrites scored 0.51–0.63 (definitely EMBED).

The monitoring table

Whatever approach you choose, log the decisions. This is invaluable for tuning:

CREATE TABLE embedding_change_log (
    log_id          BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL,
    similarity      NUMERIC(5,4),       -- 0.0000 to 1.0000
    decision        TEXT NOT NULL,       -- 'EMBED' or 'SKIP'
    reason          TEXT,                -- 'structural_change', 'minor_change', etc.
    old_content_md5 TEXT,
    new_content_md5 TEXT,
    details         JSONB,              -- optional: paragraph_similarity, char_diff, etc.
    decided_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- How many re-embeddings are we avoiding?
SELECT decision, count(*), 
       round(100.0 * count(*) / sum(count(*)) OVER (), 1) AS pct
FROM embedding_change_log
WHERE decided_at > now() - INTERVAL '7 days'
GROUP BY decision;

-- Result example:
--  decision | count | pct
-- ----------+-------+------
--  SKIP     |  1847 | 73.2
--  EMBED    |   675 | 26.8

In this example, 73% of the content changes were minor enough to skip. That’s 73% fewer embedding API calls, 73% less index churn, and a quieter, more stable RAG system.

A note on baseline: the first run

One thing that’s not obvious until you deploy this: the change detector needs existing embeddings to compare against. On the very first run, or for any document that has never been embedded, the similarity will be 0.0 (no previous embedding to compare), and the decision will always be EMBED. The SKIP optimization only kicks in on subsequent changes after a baseline exists.

This is correct behavior, but it means your initial backfill will process everything regardless of the threshold setting. Plan for it.

Full architecture recap

I won’t repeat the full Flink setup here, refer to my CDC to JDBC Sink and Oracle to PostgreSQL Migration with Flink CDC posts for the step-by-step LAB. The addition here is the significance filter sitting between the CDC source and the embedding sink.

One option I want to flag but that I haven’t fully tested at scale: embedding directly in the Flink pipeline. You could write a custom ProcessFunction that calls the embedding API and writes both the source data and the embeddings to the target in one atomic checkpoint. This eliminates the queue entirely. The concern is rate limiting and latency of embedding API calls within a streaming pipeline, if the API is slow, it creates backpressure all the way to the CDC source. For now, I’d recommend the JDBC sink + trigger + worker approach as the safer pattern, and explore inline embedding only if you have a local embedding model (like Ollama) with predictable latency.

Model Versioning: The Upgrade Problem

Everything above handles content changes. But there’s another dimension: model changes.

When you upgrade from text-embedding-3-small to text-embedding-3-large, or from v1 to v2 of any model, all your existing embeddings become incompatible. This is not optional. Different models produce different vector spaces. You cannot mix embeddings from different models in the same index — the similarity scores become meaningless.

This is why the model_version column in our schema matters. Here’s the upgrade procedure:

Step 1: Deploy new embeddings alongside old ones

-- Create a new worker (or update the config) with the new model
-- MODEL_VERSION = "v2"
-- MODEL_NAME = "text-embedding-3-large"

-- The worker will populate document_embeddings with model_version = 'v2'
-- while model_version = 'v1' embeddings remain untouched and is_current = true

Step 2: Build a separate index for the new model

-- New partial index for v2 embeddings (3072 dimensions for text-embedding-3-large)
CREATE INDEX idx_embeddings_diskann_v2 ON document_embeddings 
    USING diskann (embedding vector_cosine_ops)
    WHERE is_current = true AND model_version = 'v2';

Step 3: Run both in parallel (shadow mode)

During shadow mode, both v1 and v2 have is_current = true, that’s intentional. Your search queries must always scope by model_version, not just is_current. Each partial index covers one version, so PostgreSQL uses the correct index when the query includes AND model_version = 'v2'.

# In your RAG query pipeline, query both and compare
results_v1 = search(query, model_version='v1')
results_v2 = search(query, model_version='v2')

# Log both, serve v1 to users, compare nDCG scores
log_comparison(query, results_v1, results_v2)

Step 4: Cut over

-- Once confident, mark v1 as not current
UPDATE document_embeddings 
SET is_current = false 
WHERE model_version = 'v1';

-- Drop old index
DROP INDEX idx_embeddings_diskann;

-- Optionally archive old embeddings
-- DELETE FROM document_embeddings WHERE model_version = 'v1';

Step 5: Update the worker config

Switch the worker to produce v2 embeddings for all new changes going forward.

The point is: with the versioned schema and partial indexes, model upgrades become a blue-green deployment for embeddings. No downtime, no inconsistent state, full rollback capability. This is exactly the same principle as the PostgreSQL 17→18 blue-green upgrade I wrote about, applied to vector data.

A Note on pgai Vectorizer

I want to mention pgai Vectorizer by Timescale because it solves a lot of what I’ve described above out of the box. It uses PostgreSQL triggers internally, handles automatic synchronization, supports chunking configuration, and manages the embedding lifecycle with a declarative SQL command:

SELECT ai.create_vectorizer(
    'documents'::regclass,
    loading     => ai.loading_column('content'),
    destination => ai.destination_table('document_embeddings'),
    embedding   => ai.embedding_openai('text-embedding-3-small', 768),
    chunking    => ai.chunking_recursive_character_text_splitter(500, 50)
);

After this, any INSERT/UPDATE/DELETE on documents automatically triggers re-embedding. The vectorizer worker handles batching, rate limit retries, and error recovery. It’s essentially the Level 1 pattern I described, but packaged as a production-ready tool, and since April 2025, it works with any PostgreSQL database (not just Timescale Cloud) via a Python library.

Why I still showed you the manual approach first: because in consulting, I rarely see a greenfield setup. Most projects have constraints, specific PostgreSQL versions, restricted extensions, air-gapped environments, or the need to integrate with an existing CDC pipeline. Understanding the underlying pattern lets you adapt it to your context. pgai Vectorizer is excellent if it fits your deployment, but the principles remain the same regardless of the tooling.

Monitoring Embedding Freshness

One more thing that nobody talks about: how do you know your embeddings are stale?

There are two categories of signals: infrastructure signals (is the pipeline healthy?) and quality signals (is retrieval degrading?). Most teams only monitor the first. The second is what actually matters to your users.

Infrastructure signals: pipeline health

Here are the queries I use in production to monitor the embedding pipeline itself:

-- 1. Documents with no current embeddings
SELECT d.doc_id, d.title, d.updated_at
FROM documents d
LEFT JOIN document_embeddings e 
    ON d.doc_id = e.doc_id AND e.is_current = true
WHERE e.embedding_id IS NULL AND d.is_active = true
ORDER BY d.updated_at DESC;

-- 2. Documents where content changed since last embedding
-- Uses LATERAL join to pick one representative row per document deterministically,
-- avoiding edge cases where chunks have mixed source_hash values (partial retries, etc.)
SELECT d.doc_id, d.title,
       d.updated_at AS doc_updated,
       e.embedded_at AS last_embedded,
       d.updated_at - e.embedded_at AS staleness
FROM documents d
JOIN LATERAL (
    SELECT embedded_at, source_hash
    FROM document_embeddings
    WHERE doc_id = d.doc_id
      AND is_current = true
    ORDER BY embedded_at DESC
    LIMIT 1
) e ON true
WHERE d.is_active = true
  AND d.content_hash IS DISTINCT FROM e.source_hash
ORDER BY staleness DESC;

-- 3. Queue health check
SELECT status, count(*), 
       avg(EXTRACT(EPOCH FROM (processed_at - claimed_at)))::int AS avg_processing_sec,
       avg(EXTRACT(EPOCH FROM (claimed_at - queued_at)))::int AS avg_wait_sec
FROM embedding_queue
WHERE queued_at > now() - INTERVAL '24 hours'
GROUP BY status;

-- 4. Embedding coverage by model version
SELECT model_version, 
       count(DISTINCT doc_id) AS documents,
       count(*) AS total_chunks,
       count(*) FILTER (WHERE is_current) AS current_chunks
FROM document_embeddings
GROUP BY model_version;

Put these in a Grafana dashboard or your monitoring of choice. The staleness query (#2) is your early warning system — if documents are drifting from their embeddings, something is wrong in your pipeline.

But here’s the thing: a healthy pipeline doesn’t guarantee good retrieval. Your queue could be empty, your workers could be processing in sub-second latency, and your embeddings could still be degraded. Why? Because the pipeline only tells you that something was embedded — not that the embeddings are good.

Quality signals: when your RAG tells you embeddings are stale

This is the section I promised earlier when I said building the pipeline is one thing, but proving you’re going in the right direction is everything. This is where the work we did in the Adaptive RAG post becomes critical. The metrics we introduced there, precision@k, recall@k, nDCG@k, and confidence scores are not just evaluation tools for tuning your search weights. They are early warning signals for embedding drift.

Think about what happens when embeddings go stale:

A document was updated with important new information, but the embedding still reflects the old content
Similarity search retrieves the document (the old embedding is close enough), but the chunk text no longer matches the query’s intent
The LLM generates an answer based on outdated context
Precision drops: retrieved documents are less relevant
nDCG drops: the ranking quality degrades because truly relevant (updated) documents are ranked lower than stale ones that happen to have closer embeddings
Confidence drops: the gap between top results narrows, the system becomes less certain

The pattern is subtle but measurable. Here’s how to capture it.

Retrieval quality logging table

Extend the evaluation log from the Adaptive RAG post to include a timestamp dimension that allows you to track drift over time:

CREATE TABLE retrieval_quality_log (
    log_id          BIGSERIAL PRIMARY KEY,
    query_text      TEXT NOT NULL,
    query_type      TEXT,                -- 'factual', 'conceptual', 'exploratory'
    search_method   TEXT NOT NULL,       -- 'adaptive', 'hybrid', 'naive'
    confidence      NUMERIC(4,3),        -- 0.000 to 1.000
    precision_at_10 NUMERIC(4,3),
    recall_at_10    NUMERIC(4,3),
    ndcg_at_10      NUMERIC(4,3),
    avg_similarity  NUMERIC(4,3),        -- average cosine similarity of top-10
    top1_score      NUMERIC(4,3),        -- score of the #1 result
    score_gap       NUMERIC(4,3),        -- gap between #1 and #2 (confidence signal)
    logged_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Index for time-series analysis
CREATE INDEX idx_quality_log_time ON retrieval_quality_log (logged_at DESC);

The drift detection queries

Now the interesting part. These queries detect embedding staleness through retrieval quality degradation, not through pipeline metrics:

-- 5. Weekly nDCG trend — is ranking quality degrading over time?
SELECT date_trunc('week', logged_at) AS week,
       round(avg(ndcg_at_10), 3) AS avg_ndcg,
       round(avg(precision_at_10), 3) AS avg_precision,
       round(avg(confidence), 3) AS avg_confidence,
       count(*) AS queries
FROM retrieval_quality_log
WHERE logged_at > now() - INTERVAL '3 months'
GROUP BY week
ORDER BY week;

-- What you're looking for:
-- A slow, steady decline in avg_ndcg and avg_confidence over weeks
-- This is the signature of embedding drift — the pipeline is running,
-- but the embeddings are gradually falling behind the content

-- 6. Confidence distribution shift — are more queries becoming uncertain?
SELECT date_trunc('week', logged_at) AS week,
       round(100.0 * count(*) FILTER (WHERE confidence >= 0.7) / count(*), 1) 
           AS pct_high_confidence,
       round(100.0 * count(*) FILTER (WHERE confidence BETWEEN 0.5 AND 0.7) / count(*), 1) 
           AS pct_medium_confidence,
       round(100.0 * count(*) FILTER (WHERE confidence < 0.5) / count(*), 1) 
           AS pct_low_confidence
FROM retrieval_quality_log
WHERE logged_at > now() - INTERVAL '3 months'
GROUP BY week
ORDER BY week;

-- If pct_low_confidence is climbing week over week, your embeddings 
-- are losing alignment with the actual content

Closing the loop: from quality signal to re-embedding trigger

Here’s where this connects back to the event-driven architecture. The quality metrics don’t just sit in a dashboard, they can trigger re-embedding for documents that the pipeline’s change significance filter might have skipped.

Remember the threshold discussion from Level 3: we set a 0.95 similarity ratio as the default, meaning changes under 5% are skipped. But what if a 3% change in a critical document is causing retrieval failures?

The feedback loop:

In practice, you would implement this as a periodic job (daily or weekly) that correlates low-quality retrievals with stale documents. The correlation can be as simple as ILIKE matching query terms against document titles, or as sophisticated as tracking which document IDs were returned in low-confidence results. The key is that change_type = 'quality_reembed' is a distinct signal in your queue — it tells you the re-embedding was triggered by quality degradation, not by a content change event.

This is the complete picture: the event-driven pipeline handles the primary flow (react to data changes), the change significance filter optimizes it (skip trivial changes), and the quality monitoring loop catches what the filter missed. Three layers, each progressively more sophisticated, each compensating for the blind spots of the previous one.

As I wrote in the Adaptive RAG post: an old BI principle is to know your KPI, what it really measures but also when it fails to measure. The infrastructure metrics (queue depth, latency, skip rate) measure pipeline health. The quality metrics (precision, nDCG, confidence) measure what your users experience. You need both.

Summary

Throughout this post, we’ve covered a progression from simple to complex:

Level 1 — Triggers + Queue: Best for single-database setups. Zero external dependencies. PostgreSQL does the heavy lifting. Use LISTEN/NOTIFY for sub-second latency. This covers 80% of use cases.

Level 2 — Logical Replication: Best when source and vector databases are separate PostgreSQL instances. The source team doesn’t need to modify anything. Built-in WAL-based CDC with automatic catch-up.

Level 3 — Flink CDC + Change Significance Filtering: Best for heterogeneous sources (Oracle, MySQL) or fan-out to multiple targets. The change significance filter is the key addition — by comparing before/after images in the pipeline, you skip re-embedding for minor changes (typo fixes, metadata-only updates, formatting), which in practice eliminates 60-80% of unnecessary embedding API calls. Start with column-aware filtering, graduate to text diff ratio with a 0.95 threshold, and tune based on your RAG quality metrics.

Model Versioning: Regardless of which level you choose, version your embeddings. Track model_name, model_version, and source_hash. Use partial DiskANN indexes (pgvectorscale). Treat model upgrades like blue-green deployments.

Measurement: None of the above matters if you don’t instrument retrieval quality. The precision@k, recall@k, nDCG@k, and confidence metrics from the Adaptive RAG post aren’t a nice-to-have — they’re the only way to know whether your pipeline is actually keeping your RAG system healthy. Track them over time. Break them down by topic. Watch for drift. If you build the pipeline without the measurement layer, you’re flying blind. The evaluation framework in the pgvector_RAG_search_lab repository (lab/evaluation/) gives you a concrete starting point.

The core principle: Event-driven architecture is a precondition for production RAG — but it’s not sufficient. The moment you accept batch re-embedding as “good enough,” you’re accepting that your RAG system will silently degrade between batches. The trigger/CDC approach doesn’t just keep embeddings fresh — it gives you observability into what changed, when it was embedded, and whether the change was significant enough to matter. But the pipeline only proves that work was done. The quality metrics prove the work was effective. Log every decision. Measure the skip rate. Tune the threshold. Track nDCG weekly. This is how you operationalize RAG.

If you’ve been building RAG systems without thinking about embedding freshness, now is the time to retrofit it. And if you’re starting a new RAG project — please, design the embedding pipeline as event-driven from day one. Your future self will thank you. One of the thing that I didn’t mention as well is what should be embedded ? This is not really a technical question in the sense that it is more link to your knowledge of the data, your business, your applications, your data workflows…

What’s Next

In Part 2, I apply everything from this post to the 25,000-article Wikipedia dataset from the pgvector_RAG_search_lab repository. You’ll see:

How to adapt the schema to an existing table (no greenfield luxury)
Real SKIP vs EMBED decisions with actual similarity scores
The SKIP LOCKED multi-worker demo with 4 concurrent workers and zero overlap
A complete freshness monitoring report
The quality feedback loop triggering re-embeddings automatically

A probable futur blog post would I guess, benchmarks with pgvector and diskann indexes…

L’article RAG Series – Embedding Versioning with pgvector: Why Event-Driven Architecture Is a Precondition to AI data workflows est apparu en premier sur dbi Blog.

RAG, MCP, Skills — Three Paradigms for LLMs Talking to Your Database, and Why Governance Changes Everything

Adrien Obernesser — Sun, 15 Feb 2026 18:10:05 +0000

Introduction

Throughout my RAG series, I’ve explored how to build retrieval-augmented generation systems with PostgreSQL and pgvector — from Naive RAG through Hybrid Search, Adaptive RAG, Agentic RAG, and Embedding Versioning (not yet released) . The focus has always been on retrieval quality: precision, recall, nDCG, confidence.

But I want to take a step back.

In conversations with colleagues, clients, and fellow PostgreSQL enthusiasts — most recently at the PG Day at CERN — I keep running into the same fundamental question: which paradigm should we use to connect LLMs to our data? Not “which is best for retrieval quality” but “which can we actually deploy in a regulated environment where data governance isn’t optional?”

Because in banking, healthcare, or any FINMA-regulated context, the question is never just “does it work?” The question is: can you prove it works safely, explain how it works, and guarantee it respects access controls?

Today, three paradigms dominate how LLMs interact with structured and unstructured data:

RAG (Retrieval-Augmented Generation): the LLM receives pre-retrieved context, never touches the database directly
MCP (Model Context Protocol): the LLM generates and executes queries against live data sources through standardized tool interfaces
Skills: the LLM follows procedural instructions that define not just what to answer, but how to perform tasks step by step

Each one draws the trust boundary in a fundamentally different place. And that changes everything for governance.

But let me be clear upfront: these three paradigms are complementary, not competing. There is overlap between them, and in some cases one can partially do what another does. But they each have distinct strengths, and the real architectural question is not “which one should I pick?” but “where in my system does each one belong?” A well-designed LLM-powered application might use RAG for knowledge retrieval, MCP for analytical queries, and Skills for procedural workflows — all within the same product, each governed appropriately for its trust model. The mistake I see teams make is treating this as an either/or decision when it’s fundamentally a composition problem.

It’s also worth noting that the boundaries between these paradigms are already blurring. Some MCP servers now leverage RAG components built-in — vector search as a tool the LLM can call, rather than a separate pipeline. And over the past months, MCP itself has become increasingly abstracted: higher-level orchestration layers sit on top, hiding the protocol details from both the LLM and the developer. The consequence is that the frontiers between RAG, MCP, and Skills are becoming harder to draw cleanly.

This mirrors something I’ve observed more broadly in IT. Before AI-driven data workflows, roles had clear separations and were (too) easily siloed: infrastructure admins managed servers, developers wrote application code, DBAs owned the database layer. Each team had its perimeter, its tools, its responsibilities. With AI data workflows, the overlap and shared responsibilities are growing — the DBA needs to understand how the LLM consumes data, the developer needs to understand database-level access controls, the security team needs to understand embedding pipelines and prompt behavior. The silos don’t hold anymore. And I think this is actually one of the positive consequences of this shift: at least in IT, people will have to talk more and break silos. The governance challenges I’ll describe in this post are not solvable by any single team in isolation.

A Quick Overview: What Each Paradigm Does

Before diving into governance, let’s make sure we’re on the same page about what each paradigm actually does. I’ll keep this brief because the nuances matter more than the definitions.

RAG — The LLM Reads What You Give It

The LLM never sees your database. It sees chunks of text that a retrieval layer selected based on vector similarity. The database is behind a wall — the RAG pipeline decides what crosses it.

This is the paradigm I’ve spent the most time on in my series, and there’s a reason: from a DBA’s perspective, it’s the most controllable. Row-Level Security, data masking, column filtering, TDE — all the tools we’ve spent years mastering apply directly, because the retrieval layer queries the database on behalf of the user, and the database enforces its own rules.

MCP — The LLM Talks to Your Database

With MCP (Anthropic’s Model Context Protocol), the LLM doesn’t receive pre-filtered chunks. It receives tools — standardized interfaces to data sources, APIs, and services. The LLM decides which tool to call, formulates the query (often SQL), and interprets the results.

This is powerful. The LLM can explore data, join across tables, aggregate, filter — things that are impossible in a RAG pipeline where the retrieval is limited to vector similarity over pre-chunked text.

But the trust boundary has moved. The LLM is now an active agent inside your data perimeter, not a passive consumer of pre-filtered context.

Skills — The LLM Follows Your Procedures

Skills go beyond querying data. A Skill is a set of instructions that tells the LLM not just what to answer, but how to perform a task — step by step, with specific tools, in a specific order, following specific constraints. Think of it as encoding a procedure, a runbook, or an expertise template that the LLM executes.

For example: a Skill might instruct the LLM to read a client’s portfolio from the database, apply risk weighting rules from a compliance document, generate a report in a specific format, and flag any positions that exceed regulatory thresholds — all as a single orchestrated workflow.

The trust boundary hasn’t just moved — it has expanded. The LLM is no longer just reading data or generating queries. It’s executing multi-step procedures that combine data access, computation, and output generation.

The Trust Boundary Problem

Here’s the framework I use when advising clients on these paradigms. It comes down to one question: where does the security perimeter sit, and who crosses it?

Feature	RAG	MCP	Skills
Trust boundary	Pipeline acts as gatekeeper	LLM acts as query agent	LLM acts as procedure executor with tool access
LLM sees	Pre-filtered chunks only	Raw query results from live DB	Raw data + intermediate computation results
DBA controls	✅ Full (RLS, masking, filtering at retrieval time)	⚠️ Partial (connection-level, but LLM can infer beyond single query)	❌ Minimal (depends on Skill design and LLM behavioral compliance)

Let me unpack each one.

RAG: The DBA’s Comfort Zone

I’ll be direct: RAG is where I’m most comfortable as a DBA, and there are solid reasons for that.

Why governance is straightforward

In a RAG architecture, the retrieval layer is a regular database client. It connects with a specific role, it runs queries (vector similarity search), and it returns results. Everything the DBA has built over the past decades applies:

Row-Level Security (RLS): You can enforce per-user access policies at the PostgreSQL level. If user A shouldn’t see documents classified as “confidential-level-3”, the RLS policy filters them out before the embedding search even returns results. The LLM never sees them — not in the chunks, not in the context, not in the answer.

-- Example: RLS on the document_embeddings table
CREATE POLICY embedding_access ON document_embeddings
    FOR SELECT
    USING (
        doc_id IN (
            SELECT doc_id FROM documents 
            WHERE classification_level <= current_setting('app.user_clearance')::int
        )
    );

Data masking: If certain fields need to be masked (client names, account numbers), you apply masking rules at the view or column level. The chunks the LLM receives already have masked data. There’s no risk of the LLM “accidentally” revealing something it shouldn’t — it never had it.

Audit logging: Every query to the vector store is a standard SQL query. pgaudit, log_statement, your existing monitoring — it all works. You can trace exactly which chunks were retrieved for which query at which time.

TDE (Transparent Data Encryption): Embeddings at rest are encrypted like any other column. pgcrypto or filesystem-level encryption applies without modification.

The measurement advantage

This is the point I made in my Adaptive RAG post: RAG has a well-defined quality measurement framework.

You build golden question/answer pairs with domain experts. You run the retrieval pipeline against these pairs. You measure precision@k, recall@k, nDCG@k. You track these over time (as I described in the embedding versioning post).

The key property is that the output space is bounded: given a query, the system returns a ranked list of chunks. You can evaluate whether the right chunks were returned, in the right order, with the right confidence. This is a well-studied problem in information retrieval — decades of academic work support it.

The limitations

RAG is not universally applicable, and I don’t want to oversell it:

It can’t do aggregation: “What’s the total exposure across all clients in sector X?” requires a SQL query, not a similarity search over chunks.
It can’t join across data sources: if the answer requires combining data from multiple tables or systems, the pre-chunked approach breaks down.
It’s bound by the chunking strategy: if the relevant information spans multiple chunks, or if the chunking split a critical paragraph, retrieval quality suffers regardless of how good your embeddings are.
It requires upfront indexing: every document must be chunked and embedded before it can be retrieved. The event-driven pipelines I described in the embedding versioning post address the freshness problem, but the fundamental constraint remains.

These limitations are exactly what push teams toward MCP.

MCP: When the LLM Writes the Queries

MCP is compelling because it removes the constraints I just listed. The LLM can query live data, aggregate, join, filter — anything SQL can express. For analytical workloads, it’s transformative.

But the governance implications are profound.

The identity problem: who is the user?

In a traditional application, the database sees a connection from a known application user with a defined role. RLS policies evaluate current_user or current_setting('app.user_id'), and access is scoped accordingly.

With MCP, the LLM is the intermediary. The MCP server connects to PostgreSQL, but who is the user? Is it:

The human who asked the question?
The LLM instance processing the request?
The MCP server’s service account?

In most current MCP implementations, it’s the service account. The MCP server connects with a single set of credentials, and the LLM’s queries execute with that role’s permissions. This creates a fundamental problem in regulated environments: the database cannot distinguish between queries on behalf of different users with different access levels.

You can work around this with application-level filtering — the MCP server injects WHERE clauses based on the requesting user’s profile. But this moves access control from the database (where it’s declarative, auditable, and enforced by the engine) to the application layer (where it depends on correct implementation, and where the LLM might find creative ways around it).

This is where I have a hard time enforcing data privacy rules outside of the RAG paradigm. As a DBA, tools like RLS, TDE, data masking, and filtering are well-known and battle-tested. In an MCP paradigm, these tools still exist at the database level, but the trust that they’ll be invoked correctly now depends on the MCP server’s implementation and the LLM’s behavioral compliance.

The inference problem

This is the subtler issue and, in my opinion, the more dangerous one for FINMA-regulated environments.

Even if you perfectly control what data the LLM can query, the LLM can infer information by combining results from multiple queries. Consider:

Query 1: “How many clients are in sector Energy?” → 3
Query 2: “What’s the total exposure in sector Energy?” → CHF 450M
Query 3: “What’s the largest single position across all sectors?” → CHF 200M

Individually, each query might be within the user’s access rights. But combined, the LLM can infer: “One client in the Energy sector represents nearly half the total exposure — approximately CHF 200M.” This might be information the user is not authorized to derive.

This is not a new problem — it’s the classic inference attack from database security literature. But LLMs make it dramatically easier because:

They’re excellent at combining information from multiple queries
They do it naturally, without being explicitly instructed to
The user might not even realize they received derived information they shouldn’t have

In a RAG architecture, the inference surface is much smaller because the LLM only sees pre-selected chunks, not raw query results it can freely combine.

The auditability challenge

In FINMA’s operational risk framework (and Basel III/IV more broadly), financial institutions must maintain comprehensive audit trails for data access. This means you need to answer: who accessed what data, when, and for what purpose?

With MCP:

The “who” is partially obscured (service account vs. actual user)
The “what” is a generated SQL query that might be complex and non-obvious
The “for what purpose” is the natural language question, which may not clearly map to the data accessed

You can log everything — the natural language input, the generated SQL, the results, the final answer. But the audit trail is now multi-layered and interpretive: a compliance officer reviewing the logs needs to understand not just that a query ran, but why the LLM chose that particular query, and whether the answer faithfully represents the data without unauthorized inference.

Compare this to RAG, where the audit trail is: “User asked X. The system retrieved chunks Y1, Y2, Y3 from documents the user has access to. The LLM generated answer Z based on those chunks.” Simpler. More linear. Easier to review.

Where MCP shines despite the challenges

I don’t want to paint MCP as undeployable. It solves real problems:

Analytical queries: “Show me the monthly trend of client onboarding over the past year” — this requires aggregation that RAG simply cannot do
Exploratory data access: when users don’t know exactly what they’re looking for, the ability to query flexibly is invaluable
Multi-source integration: MCP servers can connect to multiple backends (PostgreSQL, APIs, file systems) through a single standardized protocol
Reduced indexing overhead: no need to chunk, embed, and maintain vector indexes — the data is queried live

The measurement story is also more tractable than Skills (see below). Academic work on text-to-SQL evaluation provides established benchmarks. Projects like WikiSQL and BIRD-SQL offer golden datasets of natural language questions paired with correct SQL queries and expected results.

However — and this is a crucial point I raised during discussion with colleagues — these academic benchmarks don’t transfer to your specific domain. WikiSQL doesn’t know your banking schema. BIRD-SQL doesn’t understand your compliance rules. You still need to build your own evaluation dataset with domain experts, golden queries, and expected results specific to your data model.

The good news: the methodology transfers. You know what to measure (query correctness, result accuracy, execution safety). You know how to structure the evaluation (golden pairs). The work is in building the domain-specific test suite, not in inventing the evaluation framework.

Skills: When the LLM Executes Procedures

Skills represent the most powerful — and the most challenging — paradigm from a governance perspective.

In a Skill, the LLM is not just retrieving information or generating queries. It’s following a procedural set of instructions that might involve: reading data, applying business logic, making decisions based on intermediate results, generating outputs in specific formats, and potentially writing data back.

The instruction adherence problem

With RAG, you measure retrieval quality. With MCP, you measure query correctness. With Skills, you need to measure something much harder: did the LLM follow the procedure correctly?

Two responses can both be “compliant” with a Skill’s instructions and yet differ significantly in quality. Consider a Skill that instructs the LLM to:

Read a client’s transaction history from the database
Apply anti-money-laundering (AML) screening rules
Flag suspicious patterns according to FINMA guidelines
Generate a structured report

Response A might flag 5 transactions as suspicious, with clear reasoning linked to specific FINMA rules. Response B might flag 3 of the same 5, miss 2 edge cases, but also flag 1 false positive — with equally valid-sounding reasoning. Both followed the Skill’s instructions. Both produced structured reports. But one is meaningfully better than the other.

How do you measure this? This is the question that keeps me up at night, and I don’t have a clean answer.

For RAG, we have precision, recall, nDCG — well-defined metrics with decades of research behind them. For MCP/text-to-SQL, we have execution accuracy, result set matching, and query equivalence. For Skills, we need to evaluate:

Instruction adherence: did the LLM follow each step in order?
Completeness: were all required steps executed?
Correctness of intermediate decisions: at each decision point, did the LLM make the right call?
Output quality: does the final deliverable meet the specification?
Safety: did the LLM stay within the boundaries of the Skill, or did it improvise?

This is by far more subjective and creates more edge cases. You can build evaluation frameworks for each of these dimensions, but the composite “is this response good?” judgment requires human review at a level that doesn’t scale easily.

The behavioral safety gap

In regulated environments, the process matters as much as the result. In banking, it’s not enough to produce the correct AML report — you must produce it through the correct procedure, applying the correct rules, in the correct order, with the correct audit trail.

Skills encode this process. But the LLM is a probabilistic system. It will sometimes:

Reorder steps when it “thinks” a different order is equivalent (it might not be, regulatorily)
Skip a step it deems unnecessary (it might be required for compliance)
Interpolate between the Skill’s instructions and its general training (introducing reasoning that wasn’t prescribed)
Handle edge cases creatively (which might mean non-compliantly)

In a FINMA audit, “the AI decided to skip step 3 because it seemed redundant” is not an acceptable explanation. The procedure exists for a reason. Compliance is about following the prescribed process, not just arriving at a plausible result.

The data access surface

Skills often need broad data access to perform their multi-step procedures. An AML screening Skill might need access to transaction history, client profiles, country risk ratings, and regulatory threshold configurations. This is a wide data surface — wider than most RAG retrieval patterns and potentially wider than individual MCP queries.

The challenge is that this access is implicit in the Skill design, not explicit in a per-query access control check. When the Skill says “read the client’s transaction history,” the underlying MCP call or database query is executed with whatever permissions the Skill’s execution context has. There’s no natural point where a per-user RLS check happens unless you’ve specifically engineered it into the Skill’s execution layer.

The Governance Matrix

Let me bring this together in a framework that I’ve found useful when discussing these paradigms with CISOs and compliance teams:

Governance Dimension	RAG	MCP	Skills
Data access control	✅ Database-native (RLS, views, masking) — data is filtered before the LLM sees it	⚠️ Depends on MCP server implementation; LLM sees raw query results	❌ Broad access often required; implicit in Skill design
Inference protection	✅ Limited surface — LLM sees only pre-selected chunks	⚠️ LLM can combine results from multiple queries to derive unauthorized information	❌ Multi-step procedures inherently combine information
Audit trail clarity	✅ Linear: query → chunks → answer. Easy to review	⚠️ Multi-step: question → SQL(s) → results → answer. Requires interpretation	❌ Complex: task → steps → intermediate results → decisions → output. Hard to audit
Identity propagation	✅ Retrieval runs as the user (or user-scoped service)	⚠️ MCP server connects with service account; user identity must be passed through	⚠️ Execution context may not map to individual user identity
Relevance measurement	✅ Mature: precision, recall, nDCG on golden datasets	⚠️ Text-to-SQL benchmarks exist but must be domain-adapted	❌ Instruction adherence is subjective; no standard metrics
Behavioral predictability	✅ Output bounded by retrieved context	⚠️ LLM chooses which queries to run; output depends on query strategy	❌ LLM executes procedures; may reorder, skip, or improvise steps
Regulatory explainability	✅ “The system retrieved these documents and generated this answer”	⚠️ “The system ran these queries and synthesized this answer”	❌ “The system followed this procedure, making these intermediate decisions”
Data residency / TDE	✅ Standard PostgreSQL encryption; chunks are just rows	✅ Standard — queries execute within the DB perimeter	⚠️ Intermediate results may exist outside the DB during processing
Anonymization-utility balance	✅ Transformation at embedding time; LLM only sees pre-abstracted chunks	⚠️ Transformation at query time; every query path must return abstracted data	❌ Transformation needed at every procedure step; hard to guarantee consistency

The trend is clear: as you move from RAG to MCP to Skills, the governance burden shifts from the database to the application layer, and the DBA’s ability to enforce controls diminishes.

This doesn’t mean MCP and Skills are unusable in regulated environments. It means the governance responsibility shifts, and different teams need to own different pieces.

The Real Challenge: Anonymization Is Not Enough

The governance matrix above might give the impression that the hard problem is access control — deciding whether the LLM sees the data. In practice, at least in the enterprise environments where I consult, the harder problem is what happens to the data before the LLM sees it.

Everyone agrees on the baseline: strip PII before sending anything to an external LLM. Replace client names with tokens, mask account numbers, redact personal identifiers. This is table stakes, and there are mature tools for it — both at the PostgreSQL level (dynamic masking, views) and at the application layer (regex-based scrubbing, NER-based entity detection).

But here’s what nobody warns you about: if you anonymize too aggressively, the LLM can’t do its job. And if you don’t anonymize enough, you’ve just sent regulated data to a third-party API.

The balance is not a binary “mask or don’t mask.” It’s a spectrum of semantic-preserving transformation — and finding the right point on that spectrum is, in my experience, the most under-discussed practical challenge of deploying LLMs in regulated environments.

The anonymization-utility trade-off

Let me illustrate with a concrete example from a banking context. Suppose a relationship manager asks the RAG system: “What are the key risk factors for this client’s portfolio?”

The original chunk from the knowledge base might contain:

Client: Jean-Pierre Müller (ID: CH-98234)
Portfolio value: CHF 4.2M
Concentrated position: 47% in Nestlé S.A. (NESN.SW)
Recent transactions: Sold CHF 200K of Roche Holding AG on 2025-11-15
KYC renewal due: 2026-03-01
Risk rating: Medium-High (upgraded from Medium on 2025-09-20)

Now let’s look at what happens at different levels of anonymization:

Level 1 — PII-only masking (replace identifiers, keep everything else):

Client: [CLIENT_A] (ID: [REDACTED])
Portfolio value: CHF 4.2M
Concentrated position: 47% in Nestlé S.A. (NESN.SW)
Recent transactions: Sold CHF 200K of Roche Holding AG on 2025-11-15
KYC renewal due: 2026-03-01
Risk rating: Medium-High (upgraded from Medium on 2025-09-20)

The LLM can still reason perfectly about the portfolio risk. It sees the concentration in a specific stock, the transaction history, the risk upgrade. This is semantically rich — the answer will be excellent.

But the problem: Nestlé + Roche + CHF 4.2M + specific dates might be enough to re-identify the client through correlation. A determined actor with access to client lists could narrow this down to a handful of people, potentially one. The PII is gone, but the data fingerprint remains.

Level 2 — Aggressive anonymization (mask everything that could identify):

Client: [CLIENT_A] (ID: [REDACTED])
Portfolio value: [AMOUNT]
Concentrated position: [PERCENTAGE] in [COMPANY_A]
Recent transactions: Sold [AMOUNT] of [COMPANY_B] on [DATE]
KYC renewal due: [DATE]
Risk rating: [RATING] (upgraded from [PREVIOUS_RATING] on [DATE])

Now the data is safe from re-identification. But the LLM is blind. It can’t tell you that a 47% concentration in a single stock is a risk factor, because it doesn’t know it’s 47%. It can’t assess whether the recent sale was material, because it doesn’t know the amount relative to the portfolio. It can’t flag the KYC timeline. The answer will be generic boilerplate about portfolio risk — useless to the relationship manager.

Level 3 — Semantic-preserving abstraction (this is where the real work happens):

Client: [CLIENT_A] (ID: [REDACTED])
Portfolio value: CHF 4-5M range
Concentrated position: >40% in a single Swiss large-cap equity
Recent transactions: Significant sale in Swiss pharma sector, Q4 2025
KYC renewal due: Q1 2026
Risk rating: Medium-High (recently upgraded)

Now we’ve achieved something interesting. The data is:

Not re-identifiable: ranges, sectors, and quarters instead of exact values, company names, and dates
Semantically sufficient: the LLM can still reason that >40% in a single stock is a concentration risk, that a significant sale in pharma might be rebalancing, that a recent risk upgrade plus upcoming KYC renewal warrants attention
Contextually accurate: the abstraction preserves the relationships between data points (concentration → risk rating → KYC timeline)

This is the sweet spot — but getting there requires deliberate design, not just running a PII scanner.

Building the abstraction layer: data mapping in PostgreSQL

In practice, I implement this as a transformation layer in PostgreSQL — views or functions that produce the “LLM-safe” version of the data, with the abstraction rules encoded declaratively.

-- Abstraction mapping for amounts
CREATE OR REPLACE FUNCTION anonymize_amount(amount NUMERIC, currency TEXT DEFAULT 'CHF')
RETURNS TEXT AS $$
BEGIN
    -- Preserve magnitude and currency, remove precision
    RETURN CASE
        WHEN amount < 100000 THEN currency || ' <100K'
        WHEN amount < 500000 THEN currency || ' 100K-500K range'
        WHEN amount < 1000000 THEN currency || ' 500K-1M range'
        WHEN amount < 5000000 THEN currency || ' 1-5M range'
        WHEN amount < 10000000 THEN currency || ' 5-10M range'
        WHEN amount < 50000000 THEN currency || ' 10-50M range'
        ELSE currency || ' 50M+'
    END;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Abstraction mapping for dates (reduce to quarter)
CREATE OR REPLACE FUNCTION anonymize_date(d DATE)
RETURNS TEXT AS $$
BEGIN
    RETURN 'Q' || EXTRACT(QUARTER FROM d)::TEXT || ' ' || EXTRACT(YEAR FROM d)::TEXT;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Abstraction mapping for percentages (bucketize)
CREATE OR REPLACE FUNCTION anonymize_percentage(pct NUMERIC)
RETURNS TEXT AS $$
BEGIN
    RETURN CASE
        WHEN pct < 10 THEN '<10%'
        WHEN pct < 25 THEN '10-25%'
        WHEN pct < 40 THEN '25-40%'
        WHEN pct < 60 THEN '>40%'          -- "concentrated"
        WHEN pct < 80 THEN '>60%'          -- "highly concentrated"
        ELSE '>80%'                         -- "dominant position"
    END;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Sector mapping: company → sector (prevents re-identification via company name)
CREATE TABLE company_sector_map (
    company_name    TEXT PRIMARY KEY,
    sector          TEXT NOT NULL,       -- 'Swiss pharma', 'European tech', etc.
    market_cap_tier TEXT NOT NULL        -- 'large-cap', 'mid-cap', 'small-cap'
);

-- The LLM-safe view: this is what gets chunked and embedded (for RAG)
-- or what the MCP server exposes (for MCP)
CREATE OR REPLACE VIEW v_portfolio_llm_safe AS
SELECT
    p.client_id,                                          -- internal ref only
    anonymize_amount(p.portfolio_value) AS portfolio_size,
    anonymize_percentage(p.concentration_pct) 
        || ' in a single ' 
        || csm.market_cap_tier 
        || ' ' || csm.sector AS concentration_description,
    anonymize_amount(t.amount) || ' in ' || csm_t.sector 
        || ', ' || anonymize_date(t.trade_date) AS recent_activity,
    anonymize_date(p.kyc_renewal_date) AS kyc_timeline,
    p.risk_rating,
    CASE WHEN p.risk_rating_changed_at > now() - INTERVAL '6 months'
         THEN 'recently upgraded' ELSE 'stable' 
    END AS risk_trend
FROM portfolios p
LEFT JOIN company_sector_map csm ON p.top_holding = csm.company_name
LEFT JOIN LATERAL (
    SELECT amount, trade_date, company_name 
    FROM transactions 
    WHERE client_id = p.client_id 
    ORDER BY trade_date DESC LIMIT 1
) t ON true
LEFT JOIN company_sector_map csm_t ON t.company_name = csm_t.company_name;

The key design principles:

Bucketize, don’t mask. Instead of replacing CHF 4.2M with [REDACTED], replace it with CHF 1-5M range. The LLM loses precision but retains the magnitude — and magnitude is what drives most reasoning. The bucket boundaries should be defined with the business: what granularity does the LLM need to produce useful answers?

Abstract to sector, not to nothing. Instead of removing Nestlé S.A., replace it with Swiss large-cap equity. The LLM can still reason about sector concentration, geographic exposure, and market cap risk. A company_sector_map table (maintained by the business or compliance team) drives this consistently.

Preserve relationships, anonymize individuals. The temporal relationship between risk rating upgrade, KYC renewal, and recent trading activity is preserved — recently upgraded + Q1 2026 KYC + Q4 2025 sale. The LLM can reason about the pattern without knowing who, exactly how much, or which stock.

Encode the rules declaratively. The abstraction logic lives in PostgreSQL views and functions, not in application code. This means it’s auditable (you can review the view definition), testable (run the view and inspect the output), and consistent (every query path through this view applies the same rules).

The mapping problem: when context requires real entities

Here’s where it gets genuinely hard. Some questions require the LLM to know real entity names to be useful.

“Should the client reduce their Nestlé position given the recent earnings miss?”

If you’ve abstracted Nestlé to “Swiss large-cap FMCG,” the LLM can’t connect it to actual earnings data. It can give generic advice about concentration risk, but it can’t reason about Nestlé-specific fundamentals — which is what the relationship manager actually needs.

There are two approaches I’ve seen work in practice:

Approach A — Split context, split model calls. Use the anonymized context for client-specific reasoning (portfolio risk, concentration, suitability) and a separate, non-anonymized call for market/public data reasoning (Nestlé earnings, sector outlook). The LLM never sees both the client identity and the company name in the same context. The application layer merges the two responses.

Call 1 (anonymized): "Client has >40% concentration in a single 
                      Swiss large-cap FMCG stock. Risk rating recently 
                      upgraded. Assess concentration risk."

Call 2 (public data): "Analyze recent Nestlé S.A. earnings and 
                       outlook for institutional holders."

Application layer:    Merge both responses for the RM.

This is architecturally clean but operationally complex. You need a reliable merging layer, and the LLM can’t reason holistically across both contexts.

Approach B — Reversible pseudonymization with a mapping table. Replace real entities with consistent pseudonyms that the LLM treats as real entities. The mapping is stored in PostgreSQL, never sent to the LLM, and used by the application layer to re-hydrate the response before displaying it to the user.

-- Pseudonym mapping (never exposed to the LLM)
CREATE TABLE entity_pseudonyms (
    real_name       TEXT PRIMARY KEY,
    pseudonym       TEXT UNIQUE NOT NULL,
    entity_type     TEXT NOT NULL  -- 'company', 'person', 'fund'
);

-- Example entries:
-- ('Nestlé S.A.', 'Alpine Corp', 'company')
-- ('Roche Holding AG', 'Glacier Pharma', 'company')
-- ('Jean-Pierre Müller', 'Client Alpha', 'person')

The LLM sees: “Client Alpha has a 47% position in Alpine Corp and recently sold Glacier Pharma stock.” It can reason about concentration, sector correlation, and transaction patterns using these pseudonyms as if they were real entities. The application layer maps the pseudonyms back to real names before showing the response to the user.

The advantage: the LLM gets rich, entity-level context. It can reason about “Alpine Corp” as a coherent entity across multiple chunks and queries.

The risk: pseudonym consistency must be airtight. If “Alpine Corp” appears as “Nestlé” in one chunk due to a mapping error, you’ve leaked. The mapping table must be maintained carefully, and the transformation must happen at a single controlled layer — ideally the PostgreSQL view, not scattered across application code.

How this differs across paradigms

The anonymization-utility balance plays out differently depending on the paradigm:

RAG: You control the transformation at embedding time. The chunks stored in pgvector can already be the anonymized/abstracted version. The LLM never has the opportunity to see raw data — it’s been transformed before it was even indexed. This is the safest model, and it’s where the PostgreSQL view approach works best: embed from v_portfolio_llm_safe, not from the raw tables.

MCP: The transformation must happen at query time, in the MCP server. This is harder because the LLM generates arbitrary SQL, and you need to ensure that every possible query path returns abstracted data. You can force the MCP server to query through the anonymized views rather than the base tables, but you need to be rigorous about not exposing the raw tables at all. One misconfigured permission and the LLM can SELECT * FROM portfolios directly.

Skills: The transformation must happen at every step of the procedure where data flows through the LLM’s context. A multi-step Skill might read raw data in step 1, transform it in step 2, and reason about it in step 3 — but if the Skill’s instructions aren’t precise about when and how to transform, the LLM might shortcut the process and pass raw data into its reasoning context.

The pattern is consistent with the governance matrix: as you move from RAG to MCP to Skills, maintaining the anonymization-utility balance gets progressively harder because you have less control over when and how the LLM encounters the data.

The measurement gap

Finally, there’s a quality measurement dimension to this that I haven’t seen well-addressed anywhere.

When you anonymize or abstract data before sending it to the LLM, you need to verify that the transformation didn’t destroy the LLM’s ability to answer correctly. This means your golden question/answer evaluation pairs (from the Adaptive RAG post) need to be run on the abstracted data, not the raw data.

If your nDCG score drops from 0.85 on raw data to 0.62 on abstracted data, your bucketization is too aggressive — the LLM is losing too much context. If it stays at 0.83, the abstraction is working. You need to measure this explicitly during development, and re-measure whenever you change the abstraction rules.

This creates a feedback loop between the security team (who wants maximum anonymization) and the business team (who wants maximum answer quality). The DBA sits in the middle, tuning the PostgreSQL views and measuring the impact. In my experience, this negotiation — finding the right bucket boundaries, the right sector mappings, the right level of temporal abstraction — takes more time than building the RAG pipeline itself. But it’s the work that makes the difference between a system that compliance signs off on and one that stays in the sandbox.

What FINMA Expects (and Where Each Paradigm Stands)

Without going into a full regulatory analysis, FINMA’s Circular 2023/1 on operational risks and resilience and the broader EBA guidelines on ICT and security risk management establish expectations that directly affect paradigm choice:

Traceability and auditability: every data access must be traceable to a specific user, purpose, and time. RAG satisfies this naturally through database audit logs. MCP requires careful logging at the server layer. Skills require comprehensive step-by-step execution logging.

Data classification enforcement: sensitive data must be protected according to its classification level. RAG enforces this at retrieval time via RLS and masking. MCP and Skills require the enforcement to happen at the tool/execution layer, with no guarantee that the LLM won’t combine or infer beyond what’s permitted.

Outsourcing and third-party risk: if the LLM is a cloud service (OpenAI, Anthropic API), the data sent to it matters. RAG sends chunks — you control exactly what leaves your perimeter. MCP sends query results — potentially broader. Skills might send intermediate computation results, client data, or procedure outputs.

Model risk management: FINMA expects institutions to understand and manage model risk. For RAG, the “model” is the embedding model + the retrieval logic — both well-defined and testable. For MCP, the model risk includes the LLM’s query generation behavior. For Skills, the model risk encompasses the LLM’s procedural execution behavior — much harder to bound.

The Irony: We Need Databases Again

I want to close with an observation that keeps coming back to me.

As LLM-powered applications grow more complex — longer conversations, more context, more tools, more steps — we’re running into a fundamental problem: how do you manage context effectively across long-running interactions?

Today’s LLMs have fixed context windows. When the conversation exceeds that window, information is lost. The current solutions? Summarization (lossy), truncation (lossy), or… indexing the context into retrievable storage and fetching relevant pieces when needed.

Sound familiar? That’s essentially what databases have been doing since the 1960s. There’s a recent paper on Recursive Language Models that literally recreates context indexing into files to improve accuracy and avoid losing details in long-running conversations. It’s the application-database paradigm, but at the ISAM level.

The irony is not lost on me. We spent decades building sophisticated data management systems — indexing, caching, query optimization, transaction management, access control. Now the AI community is rediscovering these patterns from first principles, often without the benefit of the lessons we’ve already learned.

As the AI ecosystem matures, I believe we’ll see the database layer become more central, not less. Whether it’s pgvector for RAG, PostgreSQL as an MCP server, or a context store for long-running agent conversations — the principles of data management don’t change just because the client is a language model instead of a human.

And that’s good news for those of us who’ve spent our careers in databases. The skills transfer. The patterns transfer. RLS still works. Audit logging still matters. ACID still matters. The challenge is adapting our expertise to new trust boundaries and new failure modes.

Summary

The choice between RAG, MCP, and Skills is not primarily a technical decision. It’s a governance decision.

RAG keeps the LLM outside the data perimeter. MCP lets the LLM inside, as a query agent. Skills give the LLM the keys to execute procedures. Each step outward increases capability but also increases the surface area for data leakage, inference attacks, and compliance failures.

There’s another dimension to this progression that deserves explicit attention: as you move from RAG to MCP to Skills, you are also shifting trust from your own architecture to the LLM platform provider. With RAG, the LLM is a stateless text generator — your retrieval pipeline, your database, your access controls do the heavy lifting. With MCP and Skills, you are increasingly relying on the LLM’s behavioral compliance, its tool-use reliability, and the platform’s guarantees around data handling, logging, and isolation. In practice, this means trusting Anthropic, OpenAI, or whichever provider powers your inference layer to uphold the security properties your regulator demands.

To their credit, these providers are investing heavily in enterprise readiness. Anthropic and OpenAI both now offer features specifically designed for regulated environments — data residency controls, zero data retention options, SOC 2 compliance, audit logging, and increasingly granular access management. The MCP specification itself was donated to the Linux Foundation’s Agentic AI Foundation in December 2025, signaling a move toward vendor-neutral governance of the protocol layer. These are meaningful steps. But they don’t change the fundamental architectural reality: every capability you delegate to the LLM platform is a capability you no longer enforce within your own perimeter. For a CISO in a FINMA-regulated institution, “the provider is SOC 2 compliant” is a necessary condition, not a sufficient one.

For regulated environments — banking, healthcare, critical infrastructure — this progression matters. The question is not “which paradigm is most powerful?” but “which paradigm can I govern, audit, and explain to my regulator?”

As a DBA, my bias is clear: keep as much as possible within the database’s governance perimeter. PostgreSQL’s security model has been battle-tested for decades. The LLM is a powerful new client — but it’s still a client, and the database’s rules should still apply. Data pipelines should allow you to set up within the architecture a vertical defensibility and decouple your governance and business logic from the LLM.

The industry will figure out governance for MCP and Skills. The academic work on text-to-SQL evaluation is advancing. The tooling for behavioral evaluation of AI agents is improving. Enterprise features from LLM providers will continue to mature. But today, in February 2026, if a CISO asks me “can I deploy this safely?” — for RAG, I can say yes with confidence. For MCP, I can say yes with caveats. For Skills, I say: let’s build the evaluation framework first.

L’article RAG, MCP, Skills — Three Paradigms for LLMs Talking to Your Database, and Why Governance Changes Everything est apparu en premier sur dbi Blog.

Oracle to PostgreSQL Migration with Flink CDC

Adrien Obernesser — Sun, 30 Nov 2025 22:04:54 +0000

Introduction

When wanting to migrate from the big red to PostgreSQL, most of the time you can afford the downtime of the export/import process and starting from something fresh. It is simple and reliable. Ora2pg being one of the go-to tools for that. But sometimes, you can afford the downtime, either because the database is critical for business operations or either because the DB is to big to run the export/import process.
Hence the following example of using “Logical replication” between Oracle and PostgreSQL using Flink CDC. I call it like that even though it is a even stream process because for DBAs it will have roughly the same limitations and constraints as standard logical replication.

Here is the layout :

Oracle Source → Flink CDC → JDBC Sink → PostgreSQL Target

This approach is based on production experience migrating large Oracle databases, where we achieved throughput of 19,500 records per second—a 65x improvement over our initial baseline. But more importantly, it transformed a “big bang” migration event into a controlled, observable, and recoverable process.

The geek in me says that Flink CDC is a powerful tool for migrations. The consultant says it should not be used blindly—it’s relevant for specific use cases where the benefits outweigh the operational complexity.

What Each Piece Does

Oracle (source): The source database. Flink CDC reads directly from tables via JDBC for snapshot mode, or from redo logs for streaming CDC mode.
Flink CDC source (Oracle): A Flink table that wraps the Debezium Oracle connector. It reads data and turns it into a changelog stream (insert/update/delete). Key options control snapshot mode, parallelism, and fetch sizes.
Flink runtime: Runs a streaming job that:
- Snapshot: Reads current table state, optionally in parallel chunks
- Checkpoints: State is stored so restarts resume exactly from the last acknowledged point
- Transforms: You can filter, project, cast types, and even aggregate in Flink SQL
JDBC sink (PostgreSQL): Another Flink table. With a PRIMARY KEY defined, the connector performs UPSERT semantics (INSERT ... ON CONFLICT DO UPDATE in PostgreSQL). It writes in batches, flushes on checkpoints, and retries on transient errors.
PostgreSQL (target): Receives the stream and ends up with the migrated data. With proper tuning (especially rewriteBatchedInserts=true), it can handle high throughput.

Flink and Debezium: How CDC Works

Flink CDC connectors use Debezium which is an open-source platform for Change Data Capture that captures row-level changes in databases by reading transaction logs.

┌───────────────────────────────────────────────────────────────────────┐
│                        Flink CDC Architecture                         │
│                                                                       │
│  ┌──────────────┐    ┌────────────────────────────────┐    ┌────────┐ │
│  │    Oracle    │    │      Flink CDC Connector       │    │  Sink  │ │
│  │   Database   │    │  ┌─────────────────────────┐   │    │Database│ │
│  │              │───►│  │   Debezium (embedded)   │   │───►│        │ │
│  │  • Redo logs │    │  │   ─────────────────     │   │    │  Post- │ │
│  │  • Tables    │    │  │   • Oracle connector    │   │    │  greSQL│ │
│  │              │    │  │   • Log parsing         │   │    │        │ │
│  └──────────────┘    │  │   • Event streaming     │   │    └────────┘ │
│                      │  └─────────────────────────┘   │               │
│                      └────────────────────────────────┘               │
└───────────────────────────────────────────────────────────────────────┘

Why Debezium?

Log-based CDC: Reads database transaction logs, not polling tables—much lower overhead
Low impact: Minimal performance hit on source database
Exactly-once delivery: When combined with Flink’s checkpointing
Schema tracking: Handles schema evolution in streaming scenarios

Snapshot vs. CDC Modes

When you configure a Flink CDC source, you can choose:

Snapshot Only: Read current table state (what we use in this demo)—fastest for one-time migrations
Snapshot + CDC: Initial snapshot, then stream ongoing changes—for zero-downtime migrations
CDC Only: Stream only new changes (requires existing snapshot)

Note : Snapshot itself can be done with in one transaction (can be long for big tables) or using incremental snapshot. Since I am using an Oracle express edition for this demo I will stick with the normal Snapshot. In case having big tables to load standard/enterprise editions are required for supplemental logs.

Anatomy of a Flink SQL Pipeline

A Flink SQL migration pipeline has four distinct parts. Understanding each part is critical for troubleshooting and optimization.

Part 1: Runtime Configuration (SET Statements)

These settings control how the Flink job executes. Think of them as the “knobs” you turn to tune behavior:

-- Pipeline identification
SET 'pipeline.name' = 'Oracle-to-PostgreSQL: CUSTOMERS Migration';

-- Runtime mode: STREAMING for CDC, BATCH for one-time loads
SET 'execution.runtime-mode' = 'STREAMING';

-- Parallelism: how many workers process data concurrently
SET 'parallelism.default' = '4';

-- Checkpointing: how often Flink saves progress for recovery
SET 'execution.checkpointing.mode' = 'AT_LEAST_ONCE';
SET 'execution.checkpointing.interval' = '60 s';
SET 'execution.checkpointing.timeout' = '10 min';
SET 'execution.checkpointing.min-pause' = '30 s';

-- Restart strategy: what happens on failure
SET 'restart-strategy.type' = 'fixed-delay';
SET 'restart-strategy.fixed-delay.attempts' = '3';
SET 'restart-strategy.fixed-delay.delay' = '10 s';

Key points:

AT_LEAST_ONCE is faster than EXACTLY_ONCE for snapshot migrations where idempotency is guaranteed by upserts
Checkpoint interval affects both recovery granularity and overhead
Higher parallelism isn’t always better—you can hit contention on the target

Part 2: Source Table Definition (Oracle CDC)

This defines how Flink reads from Oracle. The column definitions must match your Oracle schema, using Flink SQL types:

DROP TABLE IF EXISTS src_customers;

CREATE TABLE src_customers (
    -- Column definitions must match Oracle schema
    -- Use Flink SQL types that map to Oracle types
    CUSTOMER_ID   DECIMAL(10,0),
    FIRST_NAME    STRING,
    LAST_NAME     STRING,
    EMAIL         STRING,
    PHONE         STRING,
    CREATED_AT    TIMESTAMP(6),
    STATUS        STRING,
    -- Primary key is required for CDC (NOT ENFORCED = Flink won't validate)
    PRIMARY KEY (CUSTOMER_ID) NOT ENFORCED
) WITH (
    -- Connector type: oracle-cdc (uses Debezium internally)
    'connector' = 'oracle-cdc',

    -- Oracle connection details
    'hostname' = 'oracle',
    'port' = '1521',
    'username' = 'demo',
    'password' = 'demo',

    -- Database configuration (pluggable database for Oracle XE)
    -- Use url to connect via service name instead of SID
    'url' = 'jdbc:oracle:thin:@//oracle:1521/XEPDB1',
    'database-name' = 'XEPDB1',
    'schema-name' = 'DEMO',
    'table-name' = 'CUSTOMERS',

    -- Snapshot mode: 'initial' = full snapshot, then stop (for snapshot-only)
    'scan.startup.mode' = 'initial',

    -- IMPORTANT: Disable incremental snapshot for this demo
    -- Incremental snapshot requires additional Oracle configuration
    'scan.incremental.snapshot.enabled' = 'false',

    -- Debezium snapshot configuration
    'debezium.snapshot.mode' = 'initial',
    'debezium.snapshot.fetch.size' = '10000'
);

Key concepts:

PRIMARY KEY NOT ENFORCED: Tells Flink the key exists but it won’t validate uniqueness
scan.incremental.snapshot.enabled: Set to false for simple snapshots; true requires Oracle archive log mode and supplemental logging
debezium.snapshot.fetch.size: How many rows to fetch per database round-trip—larger = fewer round-trips

Part 3: Sink Table Definition (PostgreSQL JDBC)

This defines how Flink writes to PostgreSQL:

DROP TABLE IF EXISTS sink_customers;

CREATE TABLE sink_customers (
    -- Column definitions for PostgreSQL target
    customer_id   BIGINT,
    first_name    STRING,
    last_name     STRING,
    email         STRING,
    phone         STRING,
    created_at    TIMESTAMP(6),
    status        STRING,
    PRIMARY KEY (customer_id) NOT ENFORCED
) WITH (
    -- Connector type: jdbc
    'connector' = 'jdbc',

    -- PostgreSQL connection with batch optimization
    -- rewriteBatchedInserts=true is CRITICAL for performance (5-10x improvement)
    'url' = 'jdbc:postgresql://postgres:5432/demo?rewriteBatchedInserts=true',
    'table-name' = 'customers',
    'username' = 'demo',
    'password' = 'demo',
    'driver' = 'org.postgresql.Driver',

    -- Sink parallelism (tune based on target DB capacity)
    -- Too high can cause contention; 4-8 is usually optimal
    'sink.parallelism' = '4',

    -- Buffer configuration for throughput
    'sink.buffer-flush.max-rows' = '10000',
    'sink.buffer-flush.interval' = '5 s',

    -- Retry configuration
    'sink.max-retries' = '3'
);

Key optimization: rewriteBatchedInserts=true is critical for PostgreSQL performance. This tells the JDBC driver to rewrite individual INSERT statements into a single multi-row INSERT:

Without this:

INSERT INTO t (a,b) VALUES (1,'x');
INSERT INTO t (a,b) VALUES (2,'y');
INSERT INTO t (a,b) VALUES (3,'z');

With rewriteBatchedInserts=true:

INSERT INTO t (a,b) VALUES (1,'x'),(2,'y'),(3,'z');

This single change gave us a 5-10x throughput improvement in production.

Part 4: Data Flow (INSERT…SELECT)

This starts the actual data migration. The CAST operations convert Oracle types to PostgreSQL types:

INSERT INTO sink_customers
SELECT
    CAST(CUSTOMER_ID AS BIGINT) AS customer_id,
    FIRST_NAME AS first_name,
    LAST_NAME AS last_name,
    EMAIL AS email,
    PHONE AS phone,
    CREATED_AT AS created_at,
    STATUS AS status
FROM src_customers;

This single statement:

Reads from the Oracle source table
Transforms data types (CAST operations)
Writes to the PostgreSQL sink table
Handles batching, parallelism, and fault tolerance automatically

Complete SQL Example

Here is the complete migration pipeline that you can run in Flink SQL Client. This is production-ready code with all the optimizations we’ve discussed:

-- =============================================================================
-- Flink CDC Pipeline: Migrate CUSTOMERS table (Oracle -> PostgreSQL)
-- =============================================================================
-- Mode: Snapshot-only (no incremental, no streaming)
-- Source: Oracle XE 21c
-- Target: PostgreSQL 18
-- =============================================================================

-- ============================================================================
-- PART 1: Runtime Configuration (SET statements)
-- ============================================================================
-- These settings control how the Flink job executes

SET 'pipeline.name' = 'Oracle-to-PostgreSQL: CUSTOMERS Migration';
SET 'execution.runtime-mode' = 'STREAMING';
SET 'parallelism.default' = '4';

-- Checkpointing configuration
-- AT_LEAST_ONCE is faster for snapshot/migration workloads
SET 'execution.checkpointing.mode' = 'AT_LEAST_ONCE';
SET 'execution.checkpointing.interval' = '60 s';
SET 'execution.checkpointing.timeout' = '10 min';
SET 'execution.checkpointing.min-pause' = '30 s';

-- Restart strategy for fault tolerance
SET 'restart-strategy.type' = 'fixed-delay';
SET 'restart-strategy.fixed-delay.attempts' = '3';
SET 'restart-strategy.fixed-delay.delay' = '10 s';

-- ============================================================================
-- PART 2: Source Table Definition (Oracle CDC)
-- ============================================================================
-- This defines how Flink reads from Oracle using Debezium under the hood

DROP TABLE IF EXISTS src_customers;

CREATE TABLE src_customers (
    -- Column definitions must match Oracle schema
    -- Use Flink SQL types that map to Oracle types
    CUSTOMER_ID   DECIMAL(10,0),
    FIRST_NAME    STRING,
    LAST_NAME     STRING,
    EMAIL         STRING,
    PHONE         STRING,
    CREATED_AT    TIMESTAMP(6),
    STATUS        STRING,
    -- Primary key is required for CDC (NOT ENFORCED = Flink won't validate)
    PRIMARY KEY (CUSTOMER_ID) NOT ENFORCED
) WITH (
    -- Connector type: oracle-cdc (uses Debezium internally)
    'connector' = 'oracle-cdc',

    -- Oracle connection details
    'hostname' = 'oracle',
    'port' = '1521',
    'username' = 'demo',
    'password' = 'demo',

    -- Database configuration (pluggable database for Oracle XE)
    -- Use url to connect via service name instead of SID
    'url' = 'jdbc:oracle:thin:@//oracle:1521/XEPDB1',
    'database-name' = 'XEPDB1',
    'schema-name' = 'DEMO',
    'table-name' = 'CUSTOMERS',

    -- Snapshot mode: 'initial' = full snapshot, then stop (for snapshot-only)
    'scan.startup.mode' = 'initial',

    -- IMPORTANT: Disable incremental snapshot for this demo
    -- Incremental snapshot requires additional Oracle configuration
    'scan.incremental.snapshot.enabled' = 'false',

    -- Debezium snapshot configuration
    'debezium.snapshot.fetch.size' = '10000'
);

-- ============================================================================
-- PART 3: Sink Table Definition (PostgreSQL JDBC)
-- ============================================================================
-- This defines how Flink writes to PostgreSQL

DROP TABLE IF EXISTS sink_customers;

CREATE TABLE sink_customers (
    -- Column definitions for PostgreSQL target
    customer_id   BIGINT,
    first_name    STRING,
    last_name     STRING,
    email         STRING,
    phone         STRING,
    created_at    TIMESTAMP(6),
    status        STRING,
    PRIMARY KEY (customer_id) NOT ENFORCED
) WITH (
    -- Connector type: jdbc
    'connector' = 'jdbc',

    -- PostgreSQL connection with batch optimization
    -- rewriteBatchedInserts=true is CRITICAL for performance (5-10x improvement)
    'url' = 'jdbc:postgresql://postgres:5432/demo?rewriteBatchedInserts=true',
    'table-name' = 'customers',
    'username' = 'demo',
    'password' = 'demo',
    'driver' = 'org.postgresql.Driver',

    -- Sink parallelism (tune based on target DB capacity)
    -- Too high can cause contention; 4-8 is usually optimal
    'sink.parallelism' = '4',

    -- Buffer configuration for throughput
    'sink.buffer-flush.max-rows' = '10000',
    'sink.buffer-flush.interval' = '5 s',

    -- Retry configuration
    'sink.max-retries' = '3'
);

-- ============================================================================
-- PART 4: Data Flow (INSERT...SELECT)
-- ============================================================================
-- This starts the actual data migration
-- CAST operations convert Oracle types to PostgreSQL types

INSERT INTO sink_customers
SELECT
    CAST(CUSTOMER_ID AS BIGINT) AS customer_id,
    FIRST_NAME AS first_name,
    LAST_NAME AS last_name,
    EMAIL AS email,
    PHONE AS phone,
    CREATED_AT AS created_at,
    STATUS AS status
FROM src_customers;

LAB Setup

In my LAB I am using PG18 and Oracle XE Docker container and the Flink task and job manager container with the follwing definition :

Create a docker-compose.yml:

services:
  # Oracle Database 21c XE (Source)
  oracle:
    image: gvenzl/oracle-xe:21-slim-faststart
    container_name: oracle-demo
    environment:
      ORACLE_PASSWORD: OracleDemo123
      APP_USER: demo
      APP_USER_PASSWORD: demo
    ports:
      - "1521:1521"
    volumes:
      - oracle-data:/opt/oracle/oradata
      - ./oracle-init:/container-entrypoint-initdb.d
    healthcheck:
      test: ["CMD", "healthcheck.sh"]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 120s
    networks:
      - flink-network

  # PostgreSQL 18 (Target)
  postgres:
    image: postgres:18
    container_name: postgres-demo
    environment:
      POSTGRES_USER: demo
      POSTGRES_PASSWORD: demo
      POSTGRES_DB: demo
    ports:
      - "5432:5432"
    volumes:
      - postgres-data:/var/lib/postgresql
      - ./postgres-init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U demo -d demo"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - flink-network

  # Flink Job Manager
  flink-jobmanager:
    build:
      context: ./flink
      dockerfile: Dockerfile
    container_name: flink-jobmanager
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        jobmanager.memory.process.size: 1600m
        parallelism.default: 4
        state.backend.type: rocksdb
    volumes:
      - ./pipelines:/opt/flink/pipelines
    networks:
      - flink-network
    depends_on:
      oracle:
        condition: service_healthy
      postgres:
        condition: service_healthy

  # Flink Task Manager
  flink-taskmanager:
    build:
      context: ./flink
      dockerfile: Dockerfile
    container_name: flink-taskmanager
    command: taskmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        taskmanager.memory.process.size: 2048m
        taskmanager.numberOfTaskSlots: 8
    networks:
      - flink-network
    depends_on:
      - flink-jobmanager

volumes:
  oracle-data:
  postgres-data:

networks:
  flink-network:
    driver: bridge

Create flink/Dockerfile:

FROM flink:1.20.3-scala_2.12-java11

# Download Flink CDC connector for Oracle
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-oracle-cdc/3.5.0/flink-sql-connector-oracle-cdc-3.5.0.jar

# Download JDBC connector
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/apache/flink/flink-connector-jdbc/3.2.0-1.19/flink-connector-jdbc-3.2.0-1.19.jar

# Download PostgreSQL JDBC driver
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.4/postgresql-42.7.4.jar

# Download Oracle JDBC driver
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc11/23.5.0.24.07/ojdbc11-23.5.0.24.07.jar

Access the Flink Web UI at: http://localhost:8081

Running the Migration

Let’s execute the actual migration with full command outputs.

Step 1: Verify Source Data (Before Migration)

$ docker exec oracle-demo bash -c "echo 'SELECT COUNT(*) FROM customers;' | \
    sqlplus -s demo/demo@//localhost:1521/XEPDB1"

  COUNT(*)
----------
     10000

Step 2: Verify Target is Empty (Before Migration)

$ docker exec postgres-demo psql -U demo -d demo -c "SELECT COUNT(*) FROM customers;"

 count
-------
     0
(1 row)

Step 3: Run the Migration Pipeline

$ docker exec flink-jobmanager /opt/flink/bin/sql-client.sh \
    -f /opt/flink/pipelines/migrate-customers.sql

[INFO] Executing SQL from file.
Flink SQL> SET 'pipeline.name' = 'Oracle-to-PostgreSQL: CUSTOMERS Migration';
[INFO] Execute statement succeeded.
...
Flink SQL> INSERT INTO sink_customers SELECT ...
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: c554d99dce69b084607080502c13ffca

Step 4: Monitor Progress

Check the Flink Web UI at http://localhost:8081 or use the REST API:

$ curl -s http://localhost:8081/jobs | jq '.jobs[] | select(.status == "RUNNING" or .status == "FINISHED")'
{
  "id": "c554d99dce69b084607080502c13ffca",
  "status": "FINISHED"
}

Step 5: Verify Migration (After)

$ docker exec postgres-demo psql -U demo -d demo -c "SELECT COUNT(*) FROM customers;"

 count
-------
 10000
(1 row)

$ docker exec postgres-demo psql -U demo -d demo -c "SELECT * FROM customers LIMIT 3;"

 customer_id |  first_name   |  last_name   |          email           |    phone
-------------+---------------+--------------+--------------------------+-------------
        8836 | FirstName8836 | LastName8836 | customer8836@example.com | +1-555-8836
        4740 | FirstName4740 | LastName4740 | customer4740@example.com | +1-555-4740
        8835 | FirstName8835 | LastName8835 | customer8835@example.com | +1-555-8835
(3 rows)

Type Mapping Reference

Before migrating data, you need to understand the type conversions. Here’s a reference:

Oracle	Flink SQL	PostgreSQL	Notes
NUMBER(10)	DECIMAL(10,0)	BIGINT	Use CAST in SELECT
NUMBER(12,2)	DECIMAL(12,2)	NUMERIC(12,2)	Direct mapping
VARCHAR2(n)	STRING	VARCHAR(n)	Direct mapping
DATE	TIMESTAMP(6)	TIMESTAMP	Oracle DATE includes time
TIMESTAMP	TIMESTAMP(6)	TIMESTAMP	Direct mapping
CLOB	STRING	TEXT	Large text
BLOB	BYTES	BYTEA	Binary data

Performance Optimization: What We Learned

Based on production experience, here are the key optimizations that improved throughput from 300 rec/sec to 19,500 rec/sec (65x improvement).

Understanding CPU-Bound vs. IOPS-Bound Pipelines

Before tuning, you need to understand what’s limiting your pipeline. This is critical because the solutions are different:

CPU-Bound Pipeline:

Symptoms: High CPU usage on Flink Task Manager, low disk I/O on target database
Cause: Complex transformations, serialization/deserialization overhead, too few parallel workers
Solution: Increase parallelism, simplify transformations, use more Task Manager slots

IOPS-Bound Pipeline:

Symptoms: Low CPU usage on Flink, high disk I/O or lock contention on target database
Cause: Too many small writes, target database bottleneck, excessive parallelism causing lock contention
Solution: Larger batch sizes, rewriteBatchedInserts=true, reduce sink parallelism, tune target database

Network-Bound Pipeline:

Symptoms: High network wait times, gaps between source reads and sink writes
Cause: Small fetch sizes, high latency between Flink and databases
Solution: Larger fetch sizes, co-locate components when possible

How to Identify Your Bottleneck

In the Flink Web UI, look at:

Backpressure indicators: Red/yellow backpressure on source = sink can’t keep up (IOPS-bound)
Records sent/received: Compare source output rate vs. sink input rate
Checkpoint duration: Long checkpoints often indicate IOPS issues on state backend
Task Manager metrics: CPU%, memory usage, GC pauses

On your databases:

# Oracle: Check redo log generation rate
SELECT * FROM V$SYSSTAT WHERE NAME LIKE '%redo%';

# PostgreSQL: Check write activity
SELECT * FROM pg_stat_bgwriter;
SELECT * FROM pg_stat_database WHERE datname = 'demo';

Critical Optimizations

1. JDBC Batch Rewriting (5-10x Improvement)

The single most impactful optimization for IOPS-bound pipelines:

'url' = 'jdbc:postgresql://host/db?rewriteBatchedInserts=true'

This is so important I’ll repeat it: this single parameter gave us 5-10x throughput improvement. Without it, every row is a separate INSERT statement. With it, rows are batched into efficient multi-row INSERTs.

2. Sink Parallelism (2-4x Improvement)

More workers can process more data—but there’s a sweet spot:

'sink.parallelism' = '12'

Our testing showed:

Parallelism	Throughput	Notes
1	5,000 rec/sec	Baseline
4	12,000 rec/sec	Good improvement
8	17,000 rec/sec	Still scaling
12	19,500 rec/sec	Sweet spot
16	18,000 rec/sec	Contention starts
24	15,000 rec/sec	Too much contention

Why does too much parallelism hurt? Lock contention on the target database. Each parallel writer tries to acquire locks, and beyond a certain point, they spend more time waiting than writing.

3. Buffer Size Tuning

Larger buffers = fewer flushes = better throughput (at cost of memory and latency):

'sink.buffer-flush.max-rows' = '50000'
'sink.buffer-flush.interval' = '10 s'

For IOPS-bound pipelines, larger buffers are critical. For CPU-bound pipelines, smaller buffers with higher parallelism may be better.

4. Source Fetch Size

Reduce round-trips to the source database:

-- For JDBC connector:
'scan.fetch-size' = '20000'

-- For CDC connector:
'debezium.snapshot.fetch.size' = '20000'

Larger fetch sizes reduce network overhead but increase memory usage. Find your balance based on available memory.

5. Checkpointing Mode

For migrations (where exactly-once is less critical):

SET 'execution.checkpointing.mode' = 'AT_LEAST_ONCE';

AT_LEAST_ONCE is faster than EXACTLY_ONCE because it doesn’t require barriers to align data across all parallel paths. Since our sink uses upserts (INSERT ON CONFLICT), duplicate processing is idempotent anyway.

6. Checkpoint Interval

Longer intervals = less overhead, but longer recovery time on failure:

SET 'execution.checkpointing.interval' = '60 s';

For our production migrations, 45-60 seconds was optimal. Shorter intervals caused excessive state backend I/O (another IOPS consideration).

Performance Reference

Setting	Baseline	Optimized	Impact
rewriteBatchedInserts	false	true	5-10x
sink.parallelism	1	12	2-4x
buffer-flush.max-rows	1000	50000	1.5-2x
fetch-size	1000	20000	1.3-1.5x
checkpoint.mode	EXACTLY_ONCE	AT_LEAST_ONCE	1.2-1.3x
Combined Throughput	300 rec/sec	19,500 rec/sec	65x

Real-World Tuning Process

Here’s how I approach tuning a new migration:

Start with defaults: Run the pipeline and observe behavior in Flink UI
Identify the bottleneck: Is it CPU, IOPS, or network?
Apply the biggest lever first: Usually rewriteBatchedInserts=true for PostgreSQL
Increase parallelism gradually: Watch for the point where throughput stops improving
Tune batch sizes: Larger for IOPS-bound, smaller for CPU-bound
Monitor the target database: Watch for lock contention, checkpoint lag, WAL accumulation
Document your findings: Each environment is different; what works for one may not work for another

Incremental Snapshot for Large Databases

For databases larger than ~100GB, incremental snapshot mode is essential. Instead of reading entire tables at once (which can cause locks and memory issues), incremental snapshot divides tables into chunks.

What is Incremental Snapshot?

┌─────────────────────────────────────────────────────────────────┐
│                   Incremental Snapshot                          │
│                                                                 │
│  Table (1M rows, chunked by ID):                                │
│                                                                 │
│  ┌───────┬───────┬───────┬───────┬───────┐                      │
│  │ Chunk │ Chunk │ Chunk │ Chunk │ Chunk │                      │
│  │  1    │  2    │  3    │  4    │  5    │  ...                 │
│  │ 1-200K│200K-  │400K-  │600K-  │800K-  │                      │
│  │       │ 400K  │ 600K  │ 800K  │ 1M    │                      │
│  └───┬───┴───┬───┴───┬───┴───┬───┴───┬───┘                      │
│      │       │       │       │                                  │
│      ▼       ▼       ▼       ▼                                  │
│   Process in parallel, no table locks                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Oracle Requirements

Incremental snapshot with CDC requires additional Oracle configuration:

Archive Log Mode: Must be enabled -- Check current mode SELECT LOG_MODE FROM V$DATABASE; -- Enable (requires DB restart) SHUTDOWN IMMEDIATE; STARTUP MOUNT; ALTER DATABASE ARCHIVELOG; ALTER DATABASE OPEN;
Supplemental Logging: ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;
LogMiner Privileges for the CDC user: GRANT SELECT ON V_$DATABASE TO cdc_user; GRANT SELECT ON V_$LOG TO cdc_user; GRANT SELECT ON V_$LOGFILE TO cdc_user; GRANT SELECT ON V_$ARCHIVED_LOG TO cdc_user; GRANT EXECUTE ON DBMS_LOGMNR TO cdc_user; GRANT EXECUTE ON DBMS_LOGMNR_D TO cdc_user; GRANT SELECT ON V_$LOGMNR_LOGS TO cdc_user; GRANT SELECT ON V_$LOGMNR_CONTENTS TO cdc_user; GRANT FLASHBACK ANY TABLE TO cdc_user;

Pipeline Configuration

CREATE TABLE src_large_table (...) WITH (
    'connector' = 'oracle-cdc',
    'url' = 'jdbc:oracle:thin:@//oracle:1521/XEPDB1',
    'database-name' = 'XEPDB1',
    'schema-name' = 'DEMO',
    'table-name' = 'LARGE_TABLE',

    -- Enable incremental snapshot
    'scan.incremental.snapshot.enabled' = 'true',
    'scan.incremental.snapshot.chunk.size' = '100000',
    'scan.incremental.snapshot.chunk.key-column' = 'ID',

    -- Debezium settings
    'debezium.snapshot.fetch.size' = '20000'
);

When to Use Incremental Snapshot

Database Size	Recommendation
< 10 GB	Standard snapshot (JDBC)
10-100 GB	Either approach works
> 100 GB	Incremental snapshot required
Active production DB	Incremental snapshot recommended

Production Implementation Advice

Before taking this approach to production, there are several considerations to keep in mind. First, this lab setup runs Flink in standalone mode which is fine for testing but lacks persistence—if you restart the Flink processes, your pipelines are lost. For production, you’ll want to deploy on Kubernetes using the official Flink Kubernetes Operator, which provides proper state management, automatic recovery, and horizontal scaling. Second, pay close attention to version compatibility because not all latest versions of Flink, CDC connectors, and JDBC drivers work together—I learned this the hard way, so check the compatibility matrix before building your stack and stick with LTS versions like Flink 1.20 when possible. Third, externalize your checkpoints to durable storage like S3, MinIO, or HDFS rather than local filesystem, as this enables true fault tolerance and job recovery across restarts. Fourth, implement proper monitoring by connecting Flink’s metrics to Prometheus and Grafana, setting up alerts for checkpoint failures, backpressure, and throughput drops—the Web UI is great for debugging but not for 24/7 operations. Fifth, secure your connections by using SSL/TLS for database connections, storing credentials in a secrets manager rather than plain text in SQL files, and implementing network segmentation between Flink and your databases. Finally, if your organization allows it, seriously consider managed services like AWS Managed Flink, Confluent Cloud, or Azure Stream Analytics, which eliminate most of the operational burden of running Flink clusters yourself. The official documentation provides comprehensive guidance for production deployments: Apache Flink CDC Introduction.

As per example, in a migration project for an Oracle database of 800GB, around 1500 tables and 4.8 Billions rows the VM that hosted the Flink services was 16 cores and 48GB of RAM. The initial incremental snapshot lasted for 3.5 days with a throughput of 18 000 records/sec and 15k IOPS. Several automation had to be created like how to generate the pipelines for all tables and how to sequentially go from the initial load to the streaming part while maintaining the CPU cores busy.

What We’ve Learned

Through this guide, we’ve explored database migration with Flink CDC and learned several important lessons. On the technical side, start simple with snapshot mode first and add complexity like incremental or streaming CDC only when needed—don’t overengineer for a one-time migration. Understanding your bottleneck is critical because the tuning strategy differs completely depending on whether your pipeline is CPU-bound, IOPS-bound, or network-bound. The rewriteBatchedInserts=true parameter is magic for PostgreSQL, giving us a 5-10x improvement with a single setting, and parallelism has a sweet spot where more isn’t always better—we found 12 workers optimal before lock contention started hurting performance. Checkpointing is a trade-off between throughput and recovery time, with 45-60 seconds being optimal for migrations, and type mapping matters because incorrect Oracle → Flink → PostgreSQL conversions cause silent data corruption. Operationally, monitor everything using the Flink Web UI alongside source and target database metrics, test thoroughly on a test environment first because production surprises are expensive, have a rollback plan by keeping the source database running until cutover is verified, and document your tuning because each environment is different. Strategically, know when NOT to use Flink since simpler tools are better for small databases or same-technology migrations, factor in the operational complexity of maintaining another system, and consider cloud-managed Flink/CDC solutions if your organization allows it.

Conclusion

Flink CDC transforms database migrations from anxious “big bang” events into controlled, observable, and recoverable processes by combining real-time monitoring in the Flink Web UI, fault tolerance through checkpointing, configurable parallelism for performance, and transform capabilities in Flink SQL—making it a powerful tool for cross-technology migrations. We achieved a 65x throughput improvement (300 → 19,500 rec/sec) by understanding our bottlenecks and applying targeted optimizations, with the key insight being to identify whether you’re CPU-bound or IOPS-bound and tune accordingly. As with any tool, use it where it fits: for large, cross-technology migrations with near zero-downtime requirements, Flink CDC is excellent, but for small databases or simple same-technology copies, stick with native tools.

Resources

L’article Oracle to PostgreSQL Migration with Flink CDC est apparu en premier sur dbi Blog.

PostgreSQL 19: Two nice little improvements: log_autoanalyze_min_duration and search_path in the psql prompt

Daniel Westermann — Wed, 29 Oct 2025 07:10:34 +0000

Two nice little improvements have been committed for PostgreSQL 19. The first one is about logging the duration of automatic analyze while the second one is about displaying the current search_path in psql’s prompt.

Lets start with the improvement for psql. As you probably know, the default prompt in psql looks like this:

postgres@:/home/postgres/ [pgdev] psql
psql (19devel)
Type "help" for help.

postgres=#

While I do not have an issue with the default prompt, you maybe want to see more information. An example of what you might do is this:

postgres=# \set PROMPT1 '%M:%> %n@%/%R%#%x '
[local]:5432 postgres@postgres=#

Now you immediately see that this is a connection over a socket on port 5432, and you’re connected as the “postgres” user to the “postgres” database. If you want to make this permanent, add it to your “.psqlrc” file.

The new prompting option which will come with PostgreSQL 19 is “%S”, and this will give you the search_path:

postgres=# show search_path;
   search_path   
-----------------
 "$user", public
(1 row)

postgres=# \set PROMPT1 '%/%R%x%..%S..# '
postgres=.."$user", public..# set search_path = 'xxxxxxx';
SET
postgres=..xxxxxxx..#

Nice. You can find all the other prompting options in the documentation of psql, the commit is here.

The second improvement is about logging the time of automatic analyze. Before PostgreSQL 19 we only had log_autovacuum_min_duration. This logs all actions of autovacuum if they cross the specified threshold. This of course includes the auto analyze as well, but usually it is autovacuum taking most of the time. This is now separated and there is a new parameter called “log_autovacuum_min_duration”. You can easily test this with the following snippet:

postgres=# create table t ( a int , b text );
CREATE TABLE
postgres=# insert into t select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000
postgres=# alter system set log_autoanalyze_min_duration = '1ms';
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

postgres=# insert into t select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000

Looking at the log file there is now this:

2025-10-29 08:05:52.744 CET - 1 - 3454 -  - @ - 771LOG:  automatic analyze of table "postgres.public.t"
        avg read rate: 0.033 MB/s, avg write rate: 0.033 MB/s
        buffer usage: 10992 hits, 1 reads, 1 dirtied
        WAL usage: 5 records, 1 full page images, 10021 bytes, 0 buffers full
        system usage: CPU: user: 0.10 s, system: 0.01 s, elapsed: 0.23 s

Also nice, the commit is here.

L’article PostgreSQL 19: Two nice little improvements: log_autoanalyze_min_duration and search_path in the psql prompt est apparu en premier sur dbi Blog.

RAG Series – Agentic RAG

Adrien Obernesser — Sun, 26 Oct 2025 20:29:09 +0000

Introduction

In earlier parts, we moved from Naive RAG (vector search) to Hybrid RAG (dense + sparse) to Adaptive RAG (query classification and dynamic weighting). Each step improved what we retrieve. Agentic RAG goes further: the LLM decides if and when to retrieve at all and can take multiple steps (retrieve → inspect → refine → retrieve) before answering. Retrieval stops being a fixed stage and becomes a tool the model invokes when and how it needs to. This blog post will explain the fundamental principles of agentic RAG from a DBA perspective on which you can build on top of all your business logic and governance rules.

RAG Type	Decision Logic	Flexibility	Typical Latency	Best For
Naive	None	Fixed	~0.5 s	Simple FAQ
Hybrid	Static weights	Moderate	~0.6 s	Mixed queries
Adaptive	Query classifier	Dynamic	~0.7 s	Varied, predictable query types
Agentic	LLM agent (tool use)	Fully dynamic	~2.0–2.5 s	Complex, exploratory, multi-hop

When One Retrieval Isn’t Enough

Example: “Compare PostgreSQL and MySQL indexing approaches.”

Traditional (single-pass) RAG

Retrieves mixed docs about both systems
LLM synthesizes from noisy context
Often misses nuanced differences or secondary linked subjects.

Agentic RAG

Agent searches “PostgreSQL indexing mechanisms”
Reads snippets; detects a gap for MySQL
Searches “MySQL indexing mechanisms”
Synthesizes a side-by-side comparison from focused contexts

This loop generalizes: the agent decomposes questions, detects missing context, and invokes tools until it has enough evidence to answer confidently.

Generalization is an ability for an agent to take on new tasks or variations of existing ones by reusing and combining what it already knows rather than memorizing patterns. This is very usefull to handle variations in inputs and allow an agent to adapt faster with fewer codified examples but also comes with new limitations. There are different ways to generalize and you need to measure this functionality to detect when it fails. So far we covered monitoring at the ranking level of retrieval but this part I am not going to extend on is about reliable solving new tasks or shift, one way to measure it would be implement failure detection wired to the agent logs.

Architecture Overview

Agentic RAG inserts a decision loop between query and retrieval; the database is explicitly a tool.

About This Implementation

The code examples in this post are based on the complete, production-ready implementation
available in the pgvector_RAG_search_lab repository.

What’s included:

✅ Agentic search engine
✅ Modern OpenAI tools API
✅ Interactive demo and CLI
✅ FastAPI integration
✅ n8n workflow template

You don’t need to build from scratch — the implementation is ready to use. The post explains
the concepts and design decisions behind the working code.

Picking an orchestration style

Approach	Best For	Complexity	Control
LangGraph	Production agent graphs & branches	Medium	Medium
n8n	Low-code demos / single-decision flows	Low	Low
Custom Python (ours)	Full transparency & tight DB integration	Medium	High

We’ll keep the loop in custom Python (testable, observable), while remaining compatible with LangGraph or n8n if you want to wrap it later.

Implementing the Agent Loop (Python)

Below are compact, production-minded snippets using the new client style and gpt-5 / gpt-5-mini. We use gpt-5-mini for the decision step (cheap/fast) and gpt-5 for the final synthesis (quality). You can also run everything on gpt-5-mini if cost/latency is critical.

1) Expose the retrieval tool

# lab/search/agentic_search.py
from typing import List
from dataclasses import dataclass

@dataclass
class SearchResult:
    content: str
    metadata: dict

class VectorSearchService:
    def __init__(self, pg_conn):
        self.pg = pg_conn  # psycopg / asyncpg / SQLAlchemy
    def search(self, query: str, top_k: int = 5) -> List[SearchResult]:
        # Implement hybrid/dense search as in prior posts; this is the dense-only core.
        # SELECT title, text FROM wiki ORDER BY embedding <-> embed($1) LIMIT $2;
        ...

def format_snippets(results: List[SearchResult]) -> str:
    lines = []
    for i, r in enumerate(results, 1):
        title = r.metadata.get("title", "Untitled")
        snippet = (r.content or "").replace("\n", " ")[:220]
        lines.append(f"[{i}] {title}: {snippet}...")
    return "\n".join(lines)

def search_wikipedia(vector_service: VectorSearchService, query: str, top_k: int = 5) -> str:
    results = vector_service.search(query, top_k=top_k)
    return format_snippets(results)

2) Register tool schema for the LLM

search_tool = {
  "type": "function",
  "function": {
    "name": "search_wikipedia",
    "description": "Retrieve relevant Wikipedia snippets from Postgres+pgvector.",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {"type": "string"},
        "top_k": {"type": "integer", "default": 5}
      },
      "required": ["query"]
    }
  }
}

3) Prompts (concise, outcome-oriented)

SYSTEM_PROMPT = """
You are an expert assistant with access to a Wikipedia database via the tool `search_wikipedia`.
Decide if retrieval is needed before answering. If you are unsure, retrieve first.
If context is insufficient after the first retrieval, you may request one additional retrieval.
Base answers strictly on provided snippets; otherwise reply "Unknown".
Conclude with a one-line decision note: `Decision: used search` or `Decision: skipped search`.
"""

4) The agent loop with gpt-5-mini (decision) + gpt-5 (final)

import json, time
from openai import OpenAI
client = OpenAI()

def agentic_answer(pg_conn, user_question: str, max_retries: int = 1):
    vs = VectorSearchService(pg_conn)

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_question},
    ]

    # Phase 1: Decision / planning (cheap & fast)
    start = time.time()
    decision = client.chat.completions.create(
        model="gpt-5-mini",
        messages=messages,
        tools=[search_tool],
        tool_choice="auto",  # let the model decide
        temperature=0.2,
    )
    msg = decision.choices[0].message
    tool_used = False
    loops = 0

    # Handle tool calls (allow at most 1 retry cycle)
    while msg.tool_calls and loops <= max_retries:
        tool_used = True
        call = msg.tool_calls[0]
        args = json.loads(call.function.arguments or "{}")
        query = args.get("query") or user_question
        top_k = int(args.get("top_k") or 5)

        snippets = search_wikipedia(vs, query=query, top_k=top_k)
        if not snippets.strip():
            # No results safeguard
            messages += [msg, {"role": "tool", "name": "search_wikipedia", "content": "NO_RESULTS"}]
            break

        messages += [
            msg,
            {"role": "tool", "name": "search_wikipedia", "content": snippets}
        ]
        # Optionally allow one more decision round on mini
        decision = client.chat.completions.create(
            model="gpt-5-mini",
            messages=messages,
            tools=[search_tool],
            tool_choice="auto",
            temperature=0.2,
        )
        msg = decision.choices[0].message
        loops += 1

    # Phase 2: Final synthesis (high quality)
    final = client.chat.completions.create(
        model="gpt-5",
        messages=messages + ([msg] if not msg.tool_calls else []),
        tool_choice="none",
        temperature=0.3,
    )
    answer = final.choices[0].message.content or "Unknown"
    total_ms = int((time.time() - start) * 1000)

    return {
        "answer": answer,
        "tool_used": tool_used,
        "loops": loops,
        "latency_ms": total_ms,
    }

We use gpt-5-mini for the decision step (cheap/fast) and gpt-5 for the final synthesis (quality).
The two-phase strategy (gpt-5-mini for decisions, gpt-5 for synthesis)
is an optional optimization for high-volume production use. The repository implementation uses a
single configurable model (default: gpt-5-mini) which works well for most use cases. Implement the
two-phase approach only if you need to optimize the cost/quality balance.

Decision logic

LLM Output	Action	Next Step
Direct answer	Return	Done
Tool call	Execute search	Feed snippets back; allow one re-decision
Tool call + no results	Log low confidence	Stop (avoid loops) → “Unknown”

Guards

Max loop (max_retries) to prevent infinite cycles
Empty-results check
Low temperature for consistent decisions
Optional rate limiting (e.g., sleep/backoff) if your OpenAI or DB tier needs it

Evaluating Agentic Decisions

Agentic RAG adds a new layer to measure: decision quality. Keep the retrieval metrics (precision@k, nDCG), but add decision metrics and overhead.

1) Decision Accuracy (4 outcomes)

TP (True Positive): Agent retrieved when external info was needed ✓
FP (false Positive): Agent retrieved unnecessarily (latency/cost waste)
FN (False Negative): Agent skipped retrieval but should have (hallucination risk)
TN (True Negative): Agent skipped retrieval appropriately ✓
N : total questions evaluated

Accuracy = (TP + TN) / N

Interpretation:

How often did the agent make the right call about retrieval?
“Right call” means either:
- it retrieved when it should (TP), or
- it skipped when it could safely skip (TN).

Example:

TP = 40
TN = 50
FP = 5
FN = 5
N = 100

Accuracy = (40 + 50) / 100 = 90%

That means: in 90% of cases, the agent made the correct decision about using retrieval.

Note: This doesn’t judge how good the final answer is — that’s a separate metric. This only measures the decision to retrieve or not retrieve.

2) Tool Usage Rate

% of queries that triggered retrieval.
Too low → overconfident model; too high → cautious and costly. Track per domain/query type.

3) Latency/Cost Impact

Scenario	LLM Calls	DB Queries	Avg Latency
No retrieval	1	0	~0.5 s
Single retrieval	2	1	~2.1 s
Double retrieval	3	2	~3.8 s

Report both p50/p95 to capture long-tail tool loops.

4) Answer Quality (Agentic vs. Adaptive)

Run the same query set through Adaptive and Agentic:

Compare precision@k/nDCG of retrieved sets
Human-rate final answers for factuality and completeness
Track “Unknown” rate (good: avoids hallucination; too high: under-retrieval)

Tip: log a one-liner in every response:
decision=used_search|skipped_search loops=0|1 latency_ms=...

When Agentic RAG May Not Be Worth It

Prefer Adaptive if:

p95 latency must be <1s (agentic adds ~1-2s when searching)
Requests are predictable and schema-bound
Compliance needs strong deterministic behavior
Token cost is primary constraint

Choose Agentic when:

Queries are exploratory / multi-hop
Multiple tools or sources are available
Context quality > raw speed
You’re building research assistants, not simple FAQ bots

Repository Integration (pgvector_RAG_search_lab)

Proposed tree

lab/
├── search/
|	|__ ...other search scritps
│   ├── agentic_search.py           # Main agentic engine 
│   └── examples/
│       └── agentic_demo.py         # Interactive demo 
├── core/
│   └── generation.py               # Includes generate_with_tools() for function calling
├── api/
│   └── fastapi_server.py           # REST endpoint: POST /search with method="agentic"
├── workflows/
│   └── agentic_rag_workflow.json   # n8n visual workflow
└── evaluation/
    └── metrics.py                  # nDCG and ranking metrics (not agentic-specific)

FastAPI endpoint (sketch)

# lab/search/api_agentic.py
from fastapi import APIRouter
from .agentic_search import agentic_answer
router = APIRouter()

@router.post("/agent_search")
def agent_search(payload: dict):
    q = payload.get("query", "")
    result = agentic_answer(pg_conn=..., user_question=q)
    return result

n8n (optional)

Manual trigger → HTTP Request /agent_search → Markdown render
Show 🔄 if tool_used: true and add latency badge

LangGraph (optional)

Map our tool into a graph node; Agent node → Tool node → Agent node
Useful if you later add web search / SQL tools in parallel branches

Prompt Tips (small changes, big impact)

Goal	Prompt Additions
Fewer false positives	“Only search when facts are needed; avoid unnecessary retrieval.”
Fewer false negatives	“Never guess facts; reply ‘Unknown’ if not in snippets.”
Lower latency	“Limit to at most one additional retrieval if context is missing.”
Better observability	“End with: `Decision: used search` or `Decision: skipped search`.”

Quickstart

git clone https://github.com/boutaga/pgvector_RAG_search_lab
cd pgvector_RAG_search_lab

# Export your API key
export OPENAI_API_KEY=...

  # Interactive demo (recommended first try)  
python lab/search/examples/agentic_demo.py  
  
# Command-line single query  
python lab/search/agentic_search.py \  
  --source wikipedia \  
  --query "Compare PostgreSQL and MySQL indexing" \  
  --show-decision \  
  --show-sources  
  
# Interactive mode  
python lab/search/agentic_search.py --source wikipedia --interactive  
  
# Start API server  
python lab/api/fastapi_server.py  
  
# Call API endpoint  
curl -X POST http://localhost:8000/search \  
  -H "Content-Type: application/json" \  
  -d '{  
    "query": "How does PostgreSQL MVCC work?",  
    "method": "agentic",  
    "source": "wikipedia",  
    "top_k": 5,  
    "generate_answer": true  
  }'

Appendix A — Minimal “all-mini” variant (cheapest path)

# Use gpt-5-mini for both phases
resp = client.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
    tools=[search_tool],
    tool_choice="auto",
    temperature=0.2,
)
# ... identical loop, then final:
final = client.chat.completions.create(
    model="gpt-5-mini",
    messages=messages + ([msg] if not msg.tool_calls else []),
    tool_choice="none",
    temperature=0.3,
)

Appendix B — Simple metrics logger

# lab/search/metrics.py
from dataclasses import dataclass

@dataclass
class DecisionLog:
    query: str
    tool_used: bool
    loops: int
    latency_ms: int
    label_retrieval_required: bool | None = None  # optional gold label
    hallucinated: bool | None = None             # set via eval

def summarize(logs: list[DecisionLog]):
    n = len(logs)
    usage = sum(1 for x in logs if x.tool_used) / n
    p95 = sorted(x.latency_ms for x in logs)[int(0.95 * n) - 1]
    # If labels present, compute TP/FP/FN/TN
    labeled = [x for x in logs if x.label_retrieval_required is not None]
    if labeled:
        tp = sum(1 for x in labeled if x.tool_used and x.label_retrieval_required)
        tn = sum(1 for x in labeled if (not x.tool_used) and (not x.label_retrieval_required))
        acc = (tp + tn) / len(labeled)
    else:
        acc = None
    return {"tool_usage_rate": usage, "p95_latency_ms": p95, "decision_accuracy": acc}

This is a simplified example for illustration. The repository currently tracks decision
metadata within the response object (decision, tool_used, search_count, cost). You can
implement this DecisionLog class separately if you need persistent decision analytics.

Conclusion

Agentic RAG makes retrieval intentional. By letting the model decide if and when to search—and by limiting the loop to one safe retry—you gain better answers on complex queries with measured, predictable overhead.

Takeaways

Expect latency to be several seconds when the tool is used; near-naive latency when skipped
Measure decision accuracy and tool usage rate alongside precision/nDCG
Start with one tool (wiki search) and one retry; expand only if metrics justify it
Use smaller models for decisions and bigger ones for synthesis to balance quality/cost

In this post we focused only on retrieval quality — teaching the agent when to call the vector store and when to skip it, and giving it a controlled retrieval loop, that’s the foundation. The next step is to extend the loop with governance and compliance checks (who is asking, what data can they see, should this query even be answered), and only then layer domain-specific business logic on top. That’s how an agentic workflow evolves from “smart retrieval” into something you can actually trust in production.

L’article RAG Series – Agentic RAG est apparu en premier sur dbi Blog.

pgconf.eu 2025 – RECAP

Adrien Obernesser — Sun, 26 Oct 2025 18:30:40 +0000

I was fortunate to be able to attend at the pgconf.eu 2025.
This year event was happening in RIGA and joined together once again key members of the community, contributors, committers, sponsors and users from across the world.
I would summarize this year event with those three main topics : AI/LLM – PG18- Monitoring.

AI/LLMs

Compared to last year the formula changed a bit regarding the Community events day of Tuesday where for the first time different “Summits” where organized. If you want full details on the event and the schedule as well as the presentation slides of each talk you may find it here : Schedule — PostgreSQL Conference Europe 2025
I had the chance to be chosen as a speaker for the AI Summit. It was quite interesting for me. In total there was 13 short talks (10min) on various topics related to PostgreSQL and AI/LLMs it was dense with a lot of interesting ideas of implementations – you can find the details and slides here PGConf.EU 2025 PostgreSQL AI Summit – PostgreSQL wiki. AI/LLMs are the hot topic of the moment and naturally it came up often during this event, in the talks and in the discussions. You can find the pdf of my presentation here. I explained a business case implementation of a BI self-service agentic RAG to find relevant fields for a target KPI and data marts creation as output. Since the talks were short, it allowed to have a debate at the end between the audience and the speakers. The discussion nicely moderated by organizers was interesting because it exposed the same strong thoughts people have in general about AI/LLMs. A blend of distrust and not fully understanding of what it is about or how it could help organizations. Which, in itself, shows that the PostgreSQL community has the same difficulties at explaining technical challenges versus organizational/human challenges. My view here is that we don’t have technical challenges, they are almost un-relevant to most arguments but rather human relation and understanding of what values a DBA for example, brings to the organization. To me installing and configuring PostgreSQL has no benefits in terms of personal growth so automating it is quite natural and adding AI/LLMs on top is “nice to have” but not fundamentally different than an Ansible playbook. But for the junior DBA this an additional abstraction that can be dangerous because it provides tools that users can’t grasp the full extent of their consequences. This outlines that the main issue of integrating AI/LLMs workflows is more a governance/ C-management issue than a technical one and it can’t be the last excuse for adding to the technological debt.
Jay Miller from Aiven explained how you can fail at exposing PII from LLMs and MCPs. This is rely a relevant topic knowing that more and more organization are facing issues like shadow IT. He also was quite the show host and was funny to hear. I recommend strongly watching the recording when it will be released.

PG18

This year was just after the PostgreSQL 18 version release which is one the version that brought major improvements and is initiating changes for future release to come. I was quite enthusiast to listen to Melanie Plagemen on how she worked on the improvements on freezing in this release. I have to say, usually when I am going at an advanced internal talk, I am more confused after than before. But here, Melanie did an amazing job at talking about a technical complex topic without loosing the audience.
Gülçin Yıldırım Jelínek, on her side explained what’s new in PG18 about constraints like NOT ENFORCED and NOT NULL and how to use them. The COO of Cybertec Raj Verma, during a sponsor talk, explained why compliance matters and how to minimize the risks and how PostgreSQL is helping us to be PCI DSS, GDPR, nLPD or HIPAA compliant.
Another interesting talk I was happy to attend was from Floor Drees and Gabriele Bartolini. they explain how they went on joining the CloudNativePG project to the CNCF.

Monitoring

This leads me to another important topic, I wasn’t looking for it but became a bit of a main subject for my over the years as a DBA that was interested in performance tuning. Monitoring on PostgreSQL was introduced by several talks like Luigi Nardi and his idea of workload fingerprint with the DBtune tool they have. Additionally, Lukas Fittl presented pg_stat_plans, an extension which aims at tracking execution plans over time. This is definitely something I am going to try and will push for implementation in the core extensions if not the core code itself.
The reason for that is obvious for me, PostgreSQL is becoming more and more central to enterprise organizations and appart from subject like TDE, monitoring is going to become a key aspect of automation, CloudNativePG and AI/LLM workflows. Having PostgreSQL being able to be monitored better and easier at the core will allow leveraging at all this levels. Cloud companies release that already hence there involvement in similar projects.

In the end, this year was once again the occasion for me to think about many relevant topics and exchange with PostgreSQL hackers as well as users from around the world. I came back home with the head full of ideas to investigate.

Additionally after the conference the videos of the each talks will be uploaded to the pgconf Europe Youtube channel : PostgreSQL Europe, but you can already check previous amazing talks and this year pgday Paris.

So once again the PostgreSQL flag was floating up high !

L’article pgconf.eu 2025 – RECAP est apparu en premier sur dbi Blog.

RAG Series – Adaptive RAG, understanding Confidence, Precision & nDCG

Adrien Obernesser — Sun, 12 Oct 2025 12:17:54 +0000

Introduction

In this RAG series we tried so far to introduce new concepts of the RAG workflow each time. This new article is going to introduce also new key concepts at the heart of Retrieval. Adaptive RAG will allow us to talk about measuring the quality of the retrieved data and how we can leverage it to push our optimizations further.
A now famous study from MIT is stating how 95% of organizations fail to get ROI within the 6 months of their “AI projects”. Although we could argue about the relevancy of the study and what it actually measured, one of the key element to have a successful implementation is measurement.
An old BI principle is to know your KPI, what it really measures but also when it fails to measure. For example if you would use the speedometer on your dashboard’s car to measure the speed at which you are going, you’d be right as long as the wheels are touching the ground. So with that in mind, let’s see how we can create smart and reliable retrieval.

From Hybrid to Adaptive

Hybrid search significantly improves retrieval quality by combining dense semantic vectors with sparse lexical signals. However, real-world queries vary:

Some are factual, asking for specific names, numbers, or entities.
Others are conceptual, exploring ideas, reasons, or relationships.

A single static weighting between dense and sparse methods cannot perform optimally across all query types.

Adaptive RAG introduces a lightweight classifier that analyzes each query to determine its type and dynamically adjusts the hybrid weights before searching.
For example:

Query Type	Example	Dense Weight	Sparse Weight
Factual	“Who founded PostgreSQL?”	0.3	0.7
Conceptual	“How does PostgreSQL handle concurrency?”	0.7	0.3
Exploratory	“Tell me about Postgres performance tuning”	0.5	0.5

This dynamic weighting ensures that each search leverages the right signals:

Sparse when exact matching matters.
Dense when semantic similarity matters.

Under the hood, our AdaptiveSearchEngine wraps dense and sparse retrieval modules. Before executing, it classifies the query, assigns weights, and fuses the results via a weighted Reciprocal Rank Fusion (RRF), giving us the best of both worlds — adaptivity without complexity.

Confidence-Driven Retrieval

Once we make retrieval adaptive, the next challenge is trust. How confident are we in the results we just returned?

Confidence from Classification

Each query classification includes a confidence score (e.g., 0.92 “factual” vs 0.58 “conceptual”).
When classification confidence is low, Adaptive RAG defaults to a balanced retrieval (dense 0.5, sparse 0.5) — avoiding extreme weighting that might miss relevant content.

Confidence from Retrieval

We also compute confidence based on retrieval statistics:

The similarity gap between the first and second ranked results (large gap = high confidence).
Average similarity score of the top-k results.
Ratio of sparse vs dense agreement (when both find the same document, confidence increases).

These metrics are aggregated into a normalized confidence score between 0 and 1:

def compute_confidence(top_scores, overlap_ratio):
    sim_conf = min(1.0, sum(top_scores[:3]) / 3)
    overlap_conf = 0.3 + 0.7 * overlap_ratio
    return round((sim_conf + overlap_conf) / 2, 2)

If confidence < 0.5, the system triggers a fallback strategy:

Expands top_k results (e.g., from 10 → 30).
Broadens search to both dense and sparse equally.
Logs the event for later evaluation.

The retrieval API now returns a structured response:

{
  "query": "When was PostgreSQL 1.0 released?",
  "query_type": "factual",
  "confidence": 0.87,
  "precision@10": 0.8,
  "recall@10": 0.75
}

This allows monitoring not just what was retrieved, but how sure the system is. Enabling alerting, adaptive reruns, or downstream LLM prompt adjustments (e.g., “Answer cautiously” when confidence < 0.6).

Evaluating Quality with nDCG

Precision and recall are fundamental metrics for retrieval systems, but they don’t consider the order of results. If a relevant document appears at rank 10 instead of rank 1, the user experience is still poor even if recall is high.

That’s why we now add nDCG@k (normalized Discounted Cumulative Gain) — a ranking-aware measure that rewards systems for ordering relevant results near the top.

The idea:

DCG@k evaluates gain by position:

nDCG@k normalizes this against the ideal order (IDCG):

A perfect ranking yields nDCG = 1.0. Poorly ordered but complete results may still have high recall, but lower nDCG.

In practice, we calculate nDCG@10 for each query and average it over the dataset.
Our evaluation script (lab/04_evaluate/metrics.py) integrates this directly:

from evaluation import ndcg_at_k

score = ndcg_at_k(actual=relevant_docs, predicted=retrieved_docs, k=10)
print(f"nDCG@10: {score:.3f}")

Results on the Wikipedia dataset (25K articles)

Method	Precision@10	Recall@10	nDCG@10
Dense only	0.61	0.54	0.63
Hybrid fixed weights	0.72	0.68	0.75
Adaptive (dynamic)	0.78	0.74	0.82

These results confirm that adaptive weighting not only improves raw accuracy but also produces better-ranked results, giving users relevant documents earlier in the list.

Implementation in our LAB

You can explore the implementation in the GitHub repository:

git clone https://github.com/boutaga/pgvector_RAG_search_lab
cd pgvector_RAG_search_lab

Key components:

lab/04_search/adaptive_search.py — query classification, adaptive weights, confidence scoring.
lab/04_evaluate/metrics.py — precision, recall, and nDCG evaluation.
Streamlit UI (streamlit run streamlit_demo.py) — visualize retrieved chunks, scores, and confidence in real time.

Example usage:

python lab/04_search/adaptive_search.py --query "Who invented SQL?"

Output:

Query type: factual (0.91 confidence)
Dense weight: 0.3 | Sparse weight: 0.7
Precision@10: 0.82 | Recall@10: 0.77 | nDCG@10: 0.84

This feedback loop closes the gap between research and production — making RAG not only smarter but measurable.

What is “Relevance”?

When we talk about precision, recall, or nDCG, all three depend on one hidden thing:

a ground truth of which documents are relevant for each query.

There are two main ways to establish that ground truth:

Approach	Who decides relevance	Pros	Cons
Human labeling	Experts mark which documents correctly answer each query	Most accurate; useful for benchmarks	Expensive and slow
Automated or LLM-assisted labeling	An LLM (or rules) judges if a retrieved doc contains the correct answer	Scalable and repeatable	Risk of bias / noise

In some business activity you are almost forced to use human labeling because the business technicalities are so deep that automating it is hard. Labeling can be slow and expensive for a business but I learned that it also is a way to introduce change management towards AI workflow by enabling key employees of the company to participate and build a solution with their expertise and without going through a harder project of asking to an external organization to create specific business logic into a software that was never made to handle it in the first place. As a DBA, I witnessed business logic move away from databases towards ORMs and application code and this time the business logic is going towards AI workflow. Starting this human labeling project my be the first step towards it and guarantees solid foundations.
Managers need to keep in mind that AI workflows are not just a technical solution, they are social-technical framework to allow organizational growth. You can’t just ship an AI chatbot into an app and expect 10x returns with minimal effort, this is a simplistic state of mind that already cost billions according the MIT study.

In a research setup (like your pgvector_RAG_search_lab), you can mix both approach:

Start with a seed dataset of (query, relevant_doc_ids) pairs (e.g. small set labeled manually).
Use the LLM to extend or validate relevance judgments automatically.

For example:

prompt = f"""
Query: {query}
Document: {doc_text[:2000]}
Is this document relevant to answering the query? (yes/no)
"""
llm_response = openai.ChatCompletion.create(...)
label = llm_response['choices'][0]['message']['content'].strip().lower() == 'yes'

Then you store that in a simple table or CSV:

query_id	doc_id	relevant
1	101	true
1	102	false
2	104	true

Precision & Recall in Practice

Once you have that table of true relevances, you can compute:

Precision@k → “Of the top k documents I retrieved, how many were actually relevant?”
Recall@k → “Of all truly relevant documents, how many did I retrieve in my top k?”

They’re correlated but not the same:

High precision → few false positives.
High recall → few false negatives.

For example:

Query	Retrieved docs (top 5)	True relevant	Precision@5	Recall@5
“Who founded PostgreSQL?”	[d3, d7, d9, d1, d4]	[d1, d4]	0.4	1.0

You got both relevant docs (good recall = 1.0), but only 2 of the 5 retrieved were correct (precision = 0.4).

Why nDCG is Needed

Precision and recall only measure which docs were retrieved, not where they appeared in the ranking.

nDCG@k adds ranking quality:

Each relevant document gets a relevance grade (commonly 0, 1, 2 — irrelevant, relevant, highly relevant).
The higher it appears in the ranked list, the higher the gain.

So if a highly relevant doc is ranked 1st, you get more credit than if it’s ranked 10th.

In your database, you can store relevance grades in a table like:

query_id	doc_id	rel_grade
1	101	2
1	102	1
1	103	0

Then your evaluator computes:

import math

def dcg_at_k(relevances, k):
    return sum((2**rel - 1) / math.log2(i+2) for i, rel in enumerate(relevances[:k]))

def ndcg_at_k(actual_relevances, k):
    ideal = sorted(actual_relevances, reverse=True)
    return dcg_at_k(actual_relevances, k) / dcg_at_k(ideal, k)

You do need to keep track of rank (the order in which docs were returned).
In PostgreSQL, you could log that like:

query_id	doc_id	rank	score	rel_grade
1	101	1	0.92	2
1	102	2	0.87	1
1	103	3	0.54	0

Then it’s easy to run SQL to evaluate:

SELECT query_id,
       SUM((POWER(2, rel_grade) - 1) / LOG(2, rank + 1)) AS dcg
FROM eval_results
WHERE rank <= 10
GROUP BY query_id;

In a real system (like your Streamlit or API demo), you can:

Log each retrieval attempt (query, timestamp, ranking list, scores, confidence).
Periodically recompute metrics (precision, recall, nDCG) using a fixed ground-truth set.

This lets you track if tuning (e.g., changing dense/sparse weights) is improving performance.

Structure of your evaluation log table could be:

run_id	query_id	method	rank	doc_id	score	confidence	rel_grade
2025-10-12_01	1	adaptive_rrf	1	101	0.92	0.87	2
2025-10-12_01	1	adaptive_rrf	2	102	0.85	0.87	1

From there, you can generate:

nDCG@10 trend over runs (e.g., in Prometheus or Streamlit chart)
Precision vs Confidence correlation
Recall improvements per query type

⚠️ Note: While nDCG is a strong metric for ranking quality, it’s not free from bias. Because it normalizes per query, easier questions (with few relevant documents) can inflate the average score. In our lab, we mitigate this by logging both raw DCG and nDCG, and by comparing results across query categories (factual vs conceptual vs exploratory). This helps ensure improvements reflect true retrieval quality rather than statistical artifacts.

Human + LLM Hybrid Evaluation (Practical Middle Ground)

For your PostgreSQL lab setup:

Label a small gold set manually (e.g., 20–50 queries × 3–5 relevant docs each).
For larger coverage, use the LLM as an auto-grader.
You can even use self-consistency: ask the LLM to re-evaluate relevance twice and keep consistent labels only.

This gives you a semi-automated evaluation dataset, good enough to monitor:

Precision@10
Recall@10
nDCG@10 over time

Lessons Learned

Through Adaptive RAG, we’ve transformed retrieval from a static process into a self-aware one.

Precision increased by ~6–7%, especially for conceptual queries.
Recall improved by ~8% for factual questions thanks to better keyword anchoring.
nDCG@10 rose from 0.75 → 0.82, confirming that relevant results are appearing earlier.
Confidence scoring provides operational visibility: we now know when the system is uncertain, enabling safe fallbacks and trust signals.

The combination of adaptive routing, confidence estimation, and nDCG evaluation makes this pipeline suitable for enterprise-grade RAG use cases — where explainability, reliability, and observability are as important as accuracy.

Conclusion and Next Steps

Adaptive RAG is the bridge between smart retrieval and reliable retrieval.
By classifying queries, tuning dense/sparse balance dynamically, and measuring ranking quality with nDCG, we now have a system that understands what kind of question it’s facing and how well it performed in answering it.

This version of the lab introduces the first metrics-driven feedback loop for RAG in PostgreSQL:

Retrieve adaptively,
Measure precisely,
Adjust intelligently.

In the next part, we’ll push even further — introducing Agentic RAG, and how it plans and executes multi-step reasoning chains to improve retrieval and answer quality even more.

Try Adaptive RAG in the pgvector_RAG_search_lab repository, explore your own datasets, and start measuring nDCG@10 to see how adaptive retrieval changes the game.

L’article RAG Series – Adaptive RAG, understanding Confidence, Precision & nDCG est apparu en premier sur dbi Blog.

Archives des PostgreSQL - dbi Blog

pgvector, a guide for DBA – Part 2: Indexes (update march 2026)

Introduction

The Index Types

HNSW: The Default Choice

The 2,000-Dimension Wall

Tuning ef_search: The Recall vs Speed Dial

Build Parameters: m and ef_construction

IVFFlat: Fast Build

Same Dimension Limit

Tuning Probes: How Many Cells to Search

SET LOCAL: The Production Pattern

DiskANN: Leveraging B-TREE index principle

No 2,000-Dimension Wall

CREATE INDEX CONCURRENTLY

Why These Dimension Limits Exist: Buffer Pages and Bits

How pgvectorscale Compresses: Statistical Binary Quantization (SBQ)

Microsoft’s pg_diskann: A Different Approach

Comparing the Two DiskANN Implementations

Iterative Scans

The Problem

The Solution: Iterative Index Scans

Two Modes

The Safety Valve: max_scan_tuples

IVFFlat Too

Filtered Results in Action

Multi-Filter Combinations

Quantization and Storage

Binary Quantize + Re-Ranking

Operators and Sargability

Wrong Operator = No Index

Sargable Queries: The Cross-Join Trap

Partial Indexes

Monitoring

Index Inventory

Settings

Access Method Capabilities

Build Progress Monitoring

Decision Guidelines

Summary

PostgreSQL Anonymizer: Simple Data Masking for DBAs

I. What is PostgreSQL Anonymizer?

II. How does it work ?

III. Installing PostgreSQL Anonymizer on RHEL 10.1 (PostgreSQL 18.1)

1. Install the Anonymizer extension

2. Enable Anonymizer in postgresql.conf

3. Install and enable the Anonymizer extension

IV. Demo: anonymizing a simple table

1. Static masking (permanent anonymization)

2. Dynamic masking

3. Anonymized dump

V. Conclusion

RAG Series – Embedding Versioning LAB

Introduction

What’s in the lab directory

The Starting Point

Step 1: Apply the Versioning Schema

What schema.sql does

Step 2: Test the Trigger Manually

Step 3: Simulate 50 Document Mutations

What the script does

Step 4: Change Detection Without a Baseline

What change_detector.py does

Step 5: Create Baseline Embeddings

What worker.py does

Step 6: The Real Demo — SKIP vs EMBED

What targeted_mutations.py does

What happens inside the detector this time

Reading the results

Step 7: Freshness Monitoring Report

What freshness_monitor.py does

Step 8: SKIP LOCKED — Multi-Worker Concurrency

What demo_skip_locked.py does

Reading the output

Step 9a: End-to-End Trigger Flow

What demo_trigger_flow.py does

Step 9b: Quality Feedback Loop

What demo_quality_drift.py does

Reading the queue state

Key Takeaways

What `schema.sql` does

What `change_detector.py` does

What `worker.py` does

What `targeted_mutations.py` does

What `freshness_monitor.py` does

What `demo_skip_locked.py` does

What `demo_trigger_flow.py` does

What `demo_quality_drift.py` does