<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Archives des PostgreSQL - dbi Blog</title>
	<atom:link href="https://www.dbi-services.com/blog/category/postgresql/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.dbi-services.com/blog/category/postgresql/</link>
	<description></description>
	<lastBuildDate>Thu, 09 Apr 2026 07:37:13 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/05/cropped-favicon_512x512px-min-32x32.png</url>
	<title>Archives des PostgreSQL - dbi Blog</title>
	<link>https://www.dbi-services.com/blog/category/postgresql/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>pgvector, a guide for DBA &#8211; Part 2: Indexes (update march 2026)</title>
		<link>https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/</link>
					<comments>https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:09:15 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[AI/LLM]]></category>
		<category><![CDATA[pgvector]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[RAG]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=43256</guid>

					<description><![CDATA[<p>Introduction In Part 1 of this series, we covered what pgvector is, how embeddings work, and how to store them in PostgreSQL. We ended with a working similarity search &#8212; but on a sequential scan. That works fine for a demo table with 1,000 rows. It does not work for production. This post is about [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/">pgvector, a guide for DBA &#8211; Part 2: Indexes (update march 2026)</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading" id="h-introduction">Introduction</h2>



<p>In <a href="https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part1-lab-demo/">Part 1</a> of this series, we covered what pgvector is, how embeddings work, and how to store them in PostgreSQL. We ended with a working similarity search &#8212; but on a sequential scan. That works fine for a demo table with 1,000 rows. It does not work for production.</p>



<p>This post is about what comes next: <strong>indexes</strong>. Specifically, the <strong>three index families</strong> in the pgvector ecosystem as of February 2026 (HNSW, IVFFlat, and DiskANN), including <strong>two DiskANN implementations targeting different deployment models</strong>, what they&#8217;re good at, where they break, and the patterns you need, whether you&#8217;re the DBA tuning them or the developer looking to understand the the strenghts of PostgreSQL as a vector store. </p>



<p>Everything in this post was tested on public dataset: <strong>25,000 Wikipedia articles</strong> embedded with OpenAI&#8217;s <code>text-embedding-3-large</code> at <strong>3,072 dimensions</strong>, the maximum the model supports. The high number of dimension is a choice, to highlight some limitations for pedagogical reasons. You would be ok running and testing with lower dimensions or other embedding models, you might want to look into the RAG series, I will probably make a blog post on how to test embedding models against your data sets.  <br>The environment is PostgreSQL 18 with pgvector 0.8.1 and pgvectorscale 0.9.0. </p>



<p>All the SQL scripts, Python code, and Docker configuration are in the companion lab: <a href="https://github.com/boutaga/pgvector_RAG_search_lab/tree/main/lab/06_pgvector_indexes"><code>lab/06_pgvector_indexes</code></a>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-index-types">The Index Types</h2>



<p>Before we dive in, here&#8217;s the landscape. pgvector ships with two built-in index types (HNSW and IVFFlat), and two DiskANN implementations are available from different vendors:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>HNSW</th><th>IVFFlat</th><th>DiskANN (pgvectorscale)</th><th>DiskANN (pg_diskann)</th></tr></thead><tbody><tr><td><strong>Provider</strong></td><td>pgvector</td><td>pgvector</td><td>Timescale</td><td>Microsoft</td></tr><tr><td><strong>Availability</strong></td><td>Built-in</td><td>Built-in</td><td>Open source, self-hosted</td><td>Azure DB for PostgreSQL</td></tr><tr><td><strong>Algorithm</strong></td><td>Multi-layer graph</td><td>Voronoi cell partitioning</td><td>Vamana graph + SBQ</td><td>Vamana graph + PQ</td></tr><tr><td><strong>Best for</strong></td><td>General purpose</td><td>Fast build</td><td>Storage-constrained</td><td>Azure + high recall</td></tr><tr><td><strong>Build time (25K, 3072d)</strong></td><td>29s</td><td>5s</td><td>49s</td><td>N/A (Azure)</td></tr><tr><td><strong>Index size</strong></td><td>193 MB</td><td>193 MB</td><td><strong>21 MB</strong></td><td>Similar</td></tr><tr><td><strong>Query time</strong></td><td>2-6 ms</td><td>2-10 ms</td><td>3 ms</td><td>~3 ms</td></tr></tbody></table></figure>



<p>That pgvectorscale number is not a typo. 21 MB vs 193 MB for the same data. W</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Note:</strong> This post uses pgvectorscale for all DiskANN benchmarks since it&#8217;s the open-source, self-hosted option. We&#8217;ll compare both DiskANN implementations in detail in Section 3. pg_diskann is available only for Azure Flexible Server for PostgreSQL, the managed instance service from Microsoft. </p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-hnsw-the-default-choice">HNSW: The Default Choice</h2>



<p>HNSW (Hierarchical Navigable Small World) is the most commonly recommended index type for vector search. It builds a multi-layered graph where each node connects to its nearest neighbors, and search navigates this graph from top to bottom.</p>



<h3 class="wp-block-heading" id="h-the-2-000-dimension-wall">The 2,000-Dimension Wall</h3>



<p>Here&#8217;s the first thing you&#8217;ll hit with modern embedding models:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
CREATE INDEX idx_content_hnsw
ON articles USING hnsw (content_vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

ERROR:  column cannot have more than 2000 dimensions for hnsw index
</pre></div>


<p>The <code>vector</code> type in pgvector has a <strong>2,000-dimension limit</strong> for HNSW indexes. If you&#8217;re using <code>text-embedding-3-large</code> (3,072 dimensions), or <code>voyage-3-large</code> at its 2,048-dimension setting, this is a blocker.</p>



<p>The workaround: <strong><code>halfvec</code></strong>.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Step 1: Store a half-precision copy
ALTER TABLE articles ADD COLUMN content_halfvec halfvec(3072);
UPDATE articles SET content_halfvec = content_vector::halfvec;

-- Step 2: Index the halfvec column (limit: 4,000 dimensions)
CREATE INDEX idx_content_hnsw_halfvec
ON articles USING hnsw (content_halfvec halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);

Time: 28974.392 ms (00:28.974)

SELECT pg_size_pretty(pg_relation_size(&#039;idx_content_hnsw_halfvec&#039;)) AS hnsw_size;

 hnsw_size
-----------
 193 MB
</pre></div>


<p>29 seconds to build, 193 MB for 25,000 articles at 3,072 dimensions. That&#8217;s roughly <strong>8 KB per row</strong> in the index alone.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Important:</strong> <code>halfvec</code> stores each dimension in 2 bytes instead of 4. You lose some floating-point precision, but in practice the recall difference is negligible for similarity search. The storage savings are real: your halfvec column is 6 KB per row vs 12 KB for the full vector.</p>



<p><strong>Alternative:</strong> Instead of a separate column, you can create an expression index that casts on the fly: <code>CREATE INDEX ... ON articles USING hnsw ((content_vector::halfvec(3072)) halfvec_cosine_ops);</code> The trade-off is that your queries must use the matching expression (<code>content_vector::halfvec(3072) &lt;=&gt; ...</code>) for the planner to pick it up, which is harder to read in application code. The separate column approach gives cleaner queries.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-tuning-ef-search-the-recall-vs-speed-dial">Tuning ef_search: The Recall vs Speed Dial</h3>



<p><code>ef_search</code> controls how many candidates the HNSW algorithm considers during search. Higher values mean more candidates examined, better recall, but more work. The default is 40.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- ef_search = 40 (default)
SET hnsw.ef_search = 40;
EXPLAIN ANALYZE
SELECT id, title, content_halfvec &lt;=&gt; (
    SELECT content_halfvec FROM articles WHERE id = 1
) AS distance
FROM articles
ORDER BY content_halfvec &lt;=&gt; (
    SELECT content_halfvec FROM articles WHERE id = 1
)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)
  Index Searches: 1
  Buffers: shared hit=551
Execution Time: 6.004 ms

-- ef_search = 100
SET hnsw.ef_search = 100;

Index Scan using idx_content_hnsw_halfvec on articles
  Buffers: shared hit=716
Execution Time: 2.365 ms

-- ef_search = 200
SET hnsw.ef_search = 200;

Index Scan using idx_content_hnsw_halfvec on articles
  Buffers: shared hit=883
Execution Time: 2.542 ms
</pre></div>


<p>Wait &#8212; ef_search=100 was <em>faster</em> than ef_search=40? Not really. Those numbers came from a warm cache (<code>shared hit=551</code>, zero disk reads). The apparent speedup is a <strong>cache warming effect</strong>, not a property of the algorithm. To prove this, I restarted PostgreSQL and ran the full sweep from cold:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>ef_search</th><th>Execution Time</th><th>Buffers</th><th>read (disk)</th></tr></thead><tbody><tr><td>40 (cold)</td><td>91 ms</td><td>hit=189, <strong>read=362</strong></td><td>362 pages from disk</td></tr><tr><td>100</td><td>33 ms</td><td>hit=616, <strong>read=132</strong></td><td>fewer cold pages</td></tr><tr><td>200</td><td>22 ms</td><td>hit=850, <strong>read=65</strong></td><td>even fewer</td></tr><tr><td>40 (warm)</td><td><strong>0.8 ms</strong></td><td><strong>hit=551, read=0</strong></td><td>all cached</td></tr></tbody></table></figure>



<p>The second run at ef_search=40 clocked 0.8 ms &#8212; faster than any ef_search=100 or 200 run. On a warm cache, all three (40/100/200) land in the 0.8-5 ms range. The variance is cache state, not algorithmic shortcuts. <strong>The real cost cliff is at ef_search=400</strong> where the optimizer switches plans entirely:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- ef_search = 400
SET hnsw.ef_search = 400;

Sort  (actual time=364.693..364.695 rows=10.00 loops=1)
  Sort Key: ((articles.content_halfvec &lt;=&gt; (InitPlan 1).col1))
  Sort Method: top-N heapsort  Memory: 25kB
  -&gt;  Seq Scan on articles  (actual time=0.143..356.482 rows=24700.00 loops=1)
        Filter: (content_halfvec IS NOT NULL)
        Rows Removed by Filter: 300
Execution Time: 364.724 ms
</pre></div>


<p><strong>The planner chose a Seq Scan + Sort path.</strong> At ef_search=400, PostgreSQL&#8217;s cost model estimated the index scan would be more expensive than reading every row sequentially, and the query went from 2.5 ms to 365 ms.</p>



<p>This is your <strong>optimizer flip-flop</strong>. It&#8217;s not a bug &#8212; it&#8217;s the planner doing its job. But it means you need to be aware of the threshold.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>ef_search</th><th>Execution Time</th><th>Plan</th><th>Buffers</th></tr></thead><tbody><tr><td>40</td><td>6.0 ms</td><td>Index Scan</td><td>551</td></tr><tr><td>100</td><td>2.4 ms</td><td>Index Scan</td><td>716</td></tr><tr><td>200</td><td>2.5 ms</td><td>Index Scan</td><td>883</td></tr><tr><td>400</td><td><strong>365 ms</strong></td><td><strong>Seq Scan</strong></td><td>209,192</td></tr></tbody></table></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>DBA Takeaway:</strong> For HNSW on halfvec(3072), stay in the ef_search 40-200 range. Past that, you&#8217;re fighting the optimizer. If you need ef_search &gt; 200 for recall, you probably need a bigger <code>m</code> parameter at build time.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-build-parameters-m-and-ef-construction">Build Parameters: m and ef_construction</h3>



<p><code>m</code> is the number of connections per node in the graph. <code>ef_construction</code> is the candidate list size during build. Higher values = better graph quality but slower builds and (potentially) larger indexes.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- m=16, ef_construction=64 (default-ish)
Time: 28,974 ms    Size: 193 MB

-- m=32, ef_construction=128
Time: 54,077 ms    Size: 193 MB
</pre></div>


<p>At 25K rows, doubling <code>m</code> doubled the build time but didn&#8217;t change the index size. The size effect becomes more visible at larger scales. In general: <strong>start with m=16, ef_construction=64 and only increase if recall is insufficient after tuning ef_search</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-ivfflat-fast-build">IVFFlat: Fast Build</h2>



<p>IVFFlat (Inverted File with Flat quantization) partitions the vector space into Voronoi cells using k-means clustering, then searches only the cells closest to the query vector.</p>



<h3 class="wp-block-heading" id="h-same-dimension-limit">Same Dimension Limit</h3>



<p>IVFFlat has the <strong>same 2,000-dimension limit</strong> as HNSW for the <code>vector</code> type. Same workaround:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE INDEX idx_content_ivfflat
ON articles USING ivfflat (content_halfvec halfvec_cosine_ops)
WITH (lists = 25);

Time: 5008.765 ms (00:05.009)

SELECT pg_size_pretty(pg_relation_size(&#039;idx_content_ivfflat&#039;));
-- 193 MB
</pre></div>


<p>5 seconds to build vs 29 for HNSW. The index size is nearly identical (193 MB vs 193 MB), but <strong>IVFFlat builds 5.8x faster</strong>.</p>



<p>The <code>lists</code> parameter controls the number of Voronoi cells. The pgvector documentation recommends: <strong><code>rows / 1000</code> for tables up to 1M rows</strong>, <code>sqrt(rows)</code> for larger tables. For 25,000 articles: 25000 / 1000 = 25 lists, giving roughly 1,000 rows per cell. A common mistake is applying <code>sqrt(rows)</code> to small tables &#8212; that gives 158 lists here, creating cells with only ~167 rows each, which fragments the index and causes the optimizer to flip to sequential scan at surprisingly low probe counts.</p>



<h3 class="wp-block-heading" id="h-tuning-probes-how-many-cells-to-search">Tuning Probes: How Many Cells to Search</h3>



<p><code>probes</code> controls how many Voronoi cells are searched at query time. Default is 1 &#8212; fast but low recall. A good starting point is <code>sqrt(lists)</code>. Here&#8217;s the sweep with lists=25:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET ivfflat.probes = 1;
-- Index Scan, Execution Time: 1.0 ms, Buffers: 571

SET ivfflat.probes = 2;
-- Index Scan, Execution Time: 3.7 ms, Buffers: 1,944

SET ivfflat.probes = 3;
-- Index Scan, Execution Time: 4.4 ms, Buffers: 2,793

SET ivfflat.probes = 4;
-- Index Scan, Execution Time: 5.9 ms, Buffers: 3,937
</pre></div>


<p>And then:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET ivfflat.probes = 5;

Sort  (actual time=152.548..152.549 rows=10.00 loops=1)
  Sort Key: ((articles.content_halfvec &lt;=&gt; (InitPlan 1).col1))
  -&gt;  Seq Scan on articles  (actual time=0.144..148.844 rows=24700.00 loops=1)
        Filter: (content_halfvec IS NOT NULL)
Execution Time: 152.584 ms
</pre></div>


<p>Same story as HNSW. At probes=5 (searching 5 of 25 cells = 20%), the optimizer decided a sequential scan was cheaper. The query went from 6 ms to 153 ms.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>probes</th><th>Execution Time</th><th>Plan</th></tr></thead><tbody><tr><td>1</td><td>1.0 ms</td><td>Index Scan</td></tr><tr><td>2</td><td>3.7 ms</td><td>Index Scan</td></tr><tr><td>3</td><td>4.4 ms</td><td>Index Scan</td></tr><tr><td>4</td><td>5.9 ms</td><td>Index Scan</td></tr><tr><td>5</td><td><strong>153 ms</strong></td><td><strong>Seq Scan</strong></td></tr></tbody></table></figure>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> For IVFFlat with 25 lists, the optimizer flips between probes=4 and probes=5. With <code>sqrt(25) = 5</code>, you&#8217;re right at the tipping point. Use <code>SET LOCAL ivfflat.probes = 3</code> or <code>4</code> for a good recall/speed balance. On larger tables (100K+ rows), the flip happens at much higher probe counts because sequential scans become proportionally more expensive.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-set-local-the-production-pattern">SET LOCAL: The Production Pattern</h3>



<p>Never set <code>ivfflat.probes</code> at the session level in production. Use <code>SET LOCAL</code> inside a transaction:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
BEGIN;
    SET LOCAL ivfflat.probes = 3;
    SELECT id, title, content_halfvec &lt;=&gt; (
        SELECT content_halfvec FROM articles WHERE id = 1
    ) AS distance
    FROM articles
    ORDER BY content_halfvec &lt;=&gt; (
        SELECT content_halfvec FROM articles WHERE id = 1
    )
    LIMIT 10;
COMMIT;

-- Verify: probes reverted to default
SHOW ivfflat.probes;  -- 1
</pre></div>


<p>The setting reverts automatically after COMMIT/ROLLBACK. No global state leakage. <strong>Do the same for <code>hnsw.ef_search</code>.</strong></p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>DO NOT PLAY AROUND WITH SESSION PARAMETERS ON PRODUCTION.</strong> Use <code>SET LOCAL</code> inside a transaction, or set them at the function/procedure level. Session-level changes persist until disconnect and can affect every query on that connection, including connection pooler reuse. Hinting is not a strategy. </p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-diskann-leveraging-b-tree-index-principle">DiskANN: Leveraging B-TREE index principle </h2>



<p>DiskANN is provided by the <a href="https://github.com/timescale/pgvectorscale">pgvectorscale</a> project from Timescale (SQL extension name: <code>vectorscale</code>). It implements the DiskANN algorithm with <strong>Statistical Binary Quantization (SBQ)</strong> compression built in. </p>



<h3 class="wp-block-heading" id="h-no-2-000-dimension-wall">No 2,000-Dimension Wall</h3>



<p>Unlike HNSW and IVFFlat, DiskANN supports the <code>vector</code> type natively up to <strong>16,000 dimensions</strong> &#8212; no halfvec workaround needed:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE INDEX idx_content_diskann ON articles USING diskann (content_vector)
WITH (storage_layout = memory_optimized);

NOTICE:  Starting index build with num_neighbors=50, search_list_size=100,
         max_alpha=1.2, storage_layout=SbqCompression.  -- memory_optimized maps to SBQ
NOTICE:  Indexed 24700 tuples
Time: 49140.736 ms (00:49.141)

SELECT pg_size_pretty(pg_relation_size(&#039;idx_content_diskann&#039;));
-- 21 MB
</pre></div>


<p>49 seconds to build, but <strong>21 MB</strong>. Compare:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
        indexname         |  size  | size_bytes
--------------------------+--------+------------
 idx_content_diskann      | 21 MB  |   22,511,616
 idx_content_hnsw_halfvec | 193 MB |  202,350,592
 idx_content_ivfflat      | 193 MB |  202,522,624
</pre></div>


<p>DiskANN is <strong>9x smaller</strong> than HNSW and IVFFlat on the same data. SBQ compression is the default &#8212; <code>storage_layout = memory_optimized</code> is what you get if you don&#8217;t specify a layout. Specifying it explicitly (as in the <code>CREATE INDEX</code> above) is good practice for readability. The alternative <code>plain</code> layout stores full vectors in the index and does not compress.</p>



<p>Query performance is competitive:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
Index Scan using idx_content_diskann on articles
  Order By: (content_vector &lt;=&gt; (InitPlan 1).col1)
  Buffers: shared hit=1437 read=132
Execution Time: 2.915 ms
</pre></div>


<p>3 ms for a 3,072-dimension nearest neighbor search on 25K articles. On the same data, HNSW does it in 2-6 ms and IVFFlat in 2-10 ms. All three are in the same ballpark for query speed.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> DiskANN is the right choice when your HNSW index outgrows <code>shared_buffers</code>. At 193 MB for 25K rows, HNSW on halfvec(3072) would reach ~77 GB at 10 million rows, this is well beyond what most buffer pools can keep hot. DiskANN&#8217;s 9x compression keeps the navigational structure cached while full vectors stay in the heap, fetched only for the final rescore of a few dozen candidates. Same access pattern as a B-tree: compact index in memory, selective heap lookups on demand. The trade-off is longer build times and fewer operator class options (vector type only, no halfvec/bit/sparsevec).</p>
</blockquote>



<h3 class="wp-block-heading" id="h-create-index-concurrently">CREATE INDEX CONCURRENTLY</h3>



<p>As of pgvectorscale 0.9.0, DiskANN supports <code>CONCURRENTLY</code>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE INDEX CONCURRENTLY idx_content_diskann
ON articles USING diskann (content_vector)
WITH (storage_layout = memory_optimized);
</pre></div>


<p>This is critical for production &#8212; you can build the index without locking the table. HNSW and IVFFlat also support <code>CREATE INDEX CONCURRENTLY</code>.</p>



<h3 class="wp-block-heading" id="h-why-these-dimension-limits-exist-buffer-pages-and-bits">Why These Dimension Limits Exist: Buffer Pages and Bits</h3>



<p>If the 2,000 / 4,000 / 16,000 dimension limits seem arbitrary, they&#8217;re not. The <strong>intuition</strong> comes from how PostgreSQL stores data: in <strong>8 KB buffer pages</strong> (8,192 bytes). Every index tuple &#8212; including the vector representation in a vector index &#8212; has to fit within a page. The fewer bytes per dimension, the more dimensions you can pack.</p>



<p>Here&#8217;s the back-of-the-envelope arithmetic:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Encoding</th><th>Bytes per dimension</th><th>Theoretical max in 8 KB</th></tr></thead><tbody><tr><td><code>vector</code> (float32)</td><td>4 bytes</td><td>8,192 / 4 = <strong>2,048</strong></td></tr><tr><td><code>halfvec</code> (float16)</td><td>2 bytes</td><td>8,192 / 2 = <strong>4,096</strong></td></tr><tr><td>4-bit quantized (PQ)</td><td>0.5 bytes</td><td>8,192 * 2 = <strong>16,384</strong></td></tr><tr><td>1-bit binary (SBQ)</td><td>0.125 bytes</td><td>8,192 * 8 = <strong>65,536</strong></td></tr></tbody></table></figure>



<p>The theoretical numbers explain the <strong>intuition</strong> &#8212; why halfvec doubles the limit and quantized encodings push it further. The <strong>actual limits</strong> are slightly lower because of page headers, tuple overhead, and the index metadata stored alongside each vector. For HNSW, each page also stores <strong>neighbor connection lists</strong> (up to <code>m * 2</code> neighbor IDs per node); for IVFFlat, each page carries <strong>centroid references and list pointers</strong>. These eat into the available space, which is why pgvector&#8217;s HNSW sets 2,000 (not 2,048) and pgvectorscale&#8217;s DiskANN sets 16,000 (not 16,384). But the pattern is unmistakable: <strong>the fewer bits per dimension, the more dimensions you can fit in a page</strong>.</p>



<p>This is why DiskANN can handle 16,000 dimensions where HNSW on <code>vector</code> tops out at 2,000 &#8212; DiskANN stores a compressed representation in the index page, not the full vector. And why halfvec doubled the HNSW/IVFFlat limit from 2,000 to 4,000: half the bytes per dimension, twice the capacity.<br>This is more than enough for the vast majority of use cases. Most embedding models in production today default to 768–1536 dimensions, well within the 2,000-dimension limit. This also proves how future proof the curent vector store implementation is on PostgreSQL. </p>



<h3 class="wp-block-heading" id="h-how-pgvectorscale-compresses-statistical-binary-quantization-sbq">How pgvectorscale Compresses: Statistical Binary Quantization (SBQ)</h3>



<p>pgvectorscale&#8217;s DiskANN from Tiger Data uses a method called <strong>Statistical Binary Quantization</strong> (SBQ). The idea is deceptively simple: for each dimension, replace the float value with a 1 or 2-bit code.</p>



<p><strong>1-bit mode</strong> (default for dimensions &gt;= 900): Each dimension is compressed to a single bit. But not naively &#8212; standard binary quantization uses 0.0 as the threshold (positive = 1, negative = 0), which works poorly because real embedding distributions are rarely centered on zero. SBQ instead computes the <strong>per-dimension mean</strong> across all vectors during index build and uses that as the threshold:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
if value &amp;gt; mean_of_this_dimension → 1
else → 0
</pre></div>


<p>A 3,072-dimension float32 vector (12,288 bytes) becomes a 3,072-bit string (384 bytes). That&#8217;s <strong>32x compression</strong>. During search, the query vector is also SBQ-encoded, and distances are computed using <strong>XOR + popcount</strong> on the bit strings &#8212; which modern CPUs execute in a single instruction.</p>



<p><strong>2-bit mode</strong> (default for dimensions &lt; 900): Each dimension gets two bits, encoding four &#8220;zones&#8221; based on the z-score (how many standard deviations from the mean). This gives finer granularity at 16x compression instead of 32x.</p>



<p>The accuracy loss is real but small. On common benchmarks, SBQ achieves 96-99% recall compared to exact search. The rescore step (controlled by <code>diskann.query_rescore</code>, default 50) compensates: after the graph traversal finds the top-50 candidates using quantized distances, pgvectorscale fetches the <strong>full-precision vectors from the heap</strong> and re-computes exact distances to produce the final top-10.</p>



<p><strong>What&#8217;s stored where:</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Location</th><th>What&#8217;s stored</th><th>Accessed when</th></tr></thead><tbody><tr><td><strong>Index pages</strong></td><td>SBQ-compressed vectors + graph edges</td><td>Every query (graph traversal)</td></tr><tr><td><strong>Heap (table)</strong></td><td>Full float32 vectors</td><td>Only during rescore (top-N candidates)</td></tr></tbody></table></figure>



<p>This two-tier architecture is why DiskANN achieves 21 MB index size: the index only stores 384-byte compressed vectors, not 12 KB originals. The full vectors stay in the table, fetched only for the final re-ranking of a few dozen candidates.</p>



<h3 class="wp-block-heading" id="h-microsoft-s-pg-diskann-a-different-approach">Microsoft&#8217;s pg_diskann: A Different Approach</h3>



<p>There&#8217;s a second DiskANN implementation for PostgreSQL: Microsoft&#8217;s <strong>pg_diskann</strong>, currently documented and distributed for <a href="https://learn.microsoft.com/en-us/azure/postgresql/extensions/how-to-use-pgdiskann">Azure Database for PostgreSQL Flexible Server</a>. It uses the same Vamana graph algorithm but a fundamentally different compression strategy: <strong>Product Quantization (PQ)</strong>.</p>



<p>Where SBQ asks &#8220;is this dimension above or below the mean?&#8221;, Product Quantization asks &#8220;which of 16 codewords best represents this group of dimensions?&#8221;</p>



<p>Here&#8217;s how PQ works, step by step:</p>



<ol class="wp-block-list">
<li><strong>Divide the vector into chunks.</strong> A 3,072-dimension vector is split into, say, 1,024 chunks of 3 dimensions each.</li>



<li><strong>Train a codebook per chunk.</strong> For each chunk, k-means clustering finds 16 representative codewords (centroids). Why 16? Because 16 values fit in <strong>4 bits</strong> &#8212; a single hex digit.</li>



<li><strong>Encode each chunk as a 4-bit symbol.</strong> During index build, each chunk of 3 dimensions is replaced by the index (0-15) of its closest codeword. The entire 3,072-dimension vector becomes 1,024 symbols of 4 bits each = <strong>512 bytes</strong>.</li>



<li><strong>Decode via lookup table at query time.</strong> To compute the distance to a query vector, you pre-compute the distance from the query to all 16 codewords for each chunk, creating a <strong>16-row x 1,024-column lookup table</strong>. Then for each stored vector, you sum up the table entries corresponding to its symbols. No floating-point multiplication needed &#8212; just table lookups and additions.</li>
</ol>



<p>The compression is dramatic: 3,072 dimensions * 4 bytes = 12,288 bytes → 512 bytes with PQ. That&#8217;s <strong>24x compression</strong>, in the same ballpark as SBQ&#8217;s 32x.</p>



<h3 class="wp-block-heading" id="h-comparing-the-two-diskann-implementations">Comparing the Two DiskANN Implementations</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>pgvectorscale (Timescale)</th><th>pg_diskann (Microsoft)</th></tr></thead><tbody><tr><td><strong>Compression</strong></td><td>SBQ (1-2 bits/dim)</td><td>Product Quantization (4 bits/chunk)</td></tr><tr><td><strong>Compression ratio</strong></td><td>32x (1-bit) or 16x (2-bit)</td><td>~24x (depends on chunks)</td></tr><tr><td><strong>How it works</strong></td><td>Per-dimension thresholding</td><td>Codebook lookup per chunk</td></tr><tr><td><strong>Distance computation</strong></td><td>XOR + popcount (very fast)</td><td>Table lookup + sum</td></tr><tr><td><strong>Trainable</strong></td><td>Minimal (just means + stddev)</td><td>Heavy (k-means per chunk)</td></tr><tr><td><strong>Max dimensions</strong></td><td>16,000</td><td>16,000</td></tr><tr><td><strong>Availability</strong></td><td>Open source, self-hosted</td><td>Azure Database for PostgreSQL</td></tr><tr><td><strong>License</strong></td><td>PostgreSQL License</td><td>Distributed via Azure</td></tr><tr><td><strong>Iterative scan</strong></td><td>No</td><td>Yes (relaxed/strict, ON by default)</td></tr><tr><td><strong>PG version</strong></td><td>PG 14-18 (self-hosted)</td><td>Azure DB for PostgreSQL</td></tr></tbody></table></figure>



<p>Both implementations share the core DiskANN algorithm (Vamana graph) and the two-phase search pattern (compressed scan + full-precision rescore). The difference is <em>how</em> they compress:</p>



<ul class="wp-block-list">
<li><strong>SBQ</strong> is simpler and faster to build (just compute means). It&#8217;s a blunt instrument &#8212; 1 bit per dimension loses a lot of information, but XOR + popcount is blazingly fast, and the rescore step recovers accuracy.</li>



<li><strong>PQ</strong> is more sophisticated and retains more information per bit (a 4-bit symbol captures relationships between groups of dimensions). It&#8217;s slower to build (k-means training) but can achieve better recall at the same compression ratio, especially for vectors with correlated dimensions.</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> If you&#8217;re self-hosting PostgreSQL, pgvectorscale is your DiskANN option &#8212; open source, well-maintained, and the SBQ compression is effective. If you&#8217;re on Azure Database for PostgreSQL, you also have pg_diskann, whose PQ compression may give better recall on very high-dimensional data. The underlying algorithm is the same; the compression strategy is the difference.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-iterative-scans">Iterative Scans</h2>



<p>This is the most important query-time feature added to pgvector since HNSW support. If you take one thing from this post, let it be this section.</p>



<h3 class="wp-block-heading" id="h-the-problem">The Problem</h3>



<p>Vector search indexes return the K nearest neighbors, then PostgreSQL applies your WHERE clause. If your filter is selective, you get fewer results than requested.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Our dataset has 9 categories with varying selectivity
  category   | cnt   | % of 25K
-------------+-------+---------
 History     | 8719  | 34.9%
 General     | 6221  | 24.9%
 Science     | 2232  |  8.9%
 Mathematics |  116  |  0.5%
</pre></div>


<p>When you search for the 10 nearest &#8220;Science&#8221; articles, the HNSW index returns its ef_search nearest neighbors (say, 40), PostgreSQL filters to keep only Science articles, and you get however many matched. For Science (8.9%), you&#8217;ll usually get your 10 results. For Mathematics (0.5%), you won&#8217;t.</p>



<p>Let&#8217;s prove it. Here&#8217;s a search for Mathematics articles with ef_search=40, forcing the HNSW index path. (We disable sequential scan here to force the index path on our small 25K dataset. On production tables with 100K+ rows, the optimizer would naturally choose the index without this hint.)</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET enable_seqscan = off;
SET hnsw.iterative_scan = &#039;off&#039;;
SET hnsw.ef_search = 40;

SELECT id, title, category,
       content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1) AS distance
FROM articles
WHERE category = &#039;Mathematics&#039;
ORDER BY content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)
  Filter: (category = &#039;Mathematics&#039;::text)
  Rows Removed by Filter: 40
  Index Searches: 1
Execution Time: 0.766 ms
(0 rows)
</pre></div>


<p><strong>Zero rows returned.</strong> The index fetched 40 candidates, all 40 were filtered out (none were Mathematics), and the query silently returned an empty result set. This is the core problem: your application asked for 10 results and got 0.</p>



<p>Before iterative scans, you had two bad options:</p>



<ol class="wp-block-list">
<li><strong>Over-fetch</strong> (LIMIT 1000) and hope enough rows match &#8212; wasteful and unreliable</li>



<li><strong>Sequential scan</strong> &#8212; correct but slow on large tables</li>
</ol>



<h3 class="wp-block-heading" id="h-the-solution-iterative-index-scans">The Solution: Iterative Index Scans</h3>



<p>pgvector 0.8.0 introduced iterative scans. Instead of fetching a fixed batch and filtering, the index keeps fetching more candidates until the filter is satisfied.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET enable_seqscan = off;
SET hnsw.iterative_scan = &#039;relaxed_order&#039;;
SET hnsw.ef_search = 40;

SELECT id, title, category,
       content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1) AS distance
FROM articles
WHERE category = &#039;Mathematics&#039;
ORDER BY content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)
  Filter: (category = &#039;Mathematics&#039;::text)
  Rows Removed by Filter: 6809
  Index Searches: 171
Execution Time: 4235.805 ms
(10 rows)
</pre></div>


<p><strong>10 rows returned.</strong> The index scanned 171 times, examined 6,819 candidates, filtered out 6,809 non-Mathematics rows, and delivered exactly 10 results. It took 4.2 seconds &#8212; much slower than the 0.8 ms empty result &#8212; but you got a <strong>correct</strong> answer instead of silence.</p>



<p>At 6,819 tuples scanned, this was well within the default <code>max_scan_tuples</code> of 20,000. On a table with millions of rows and the same 0.5% selectivity, the scan might hit the 20,000 limit before finding 10 matches &#8212; you&#8217;d get a partial result set. That&#8217;s the trade-off the safety valve makes: bounded latency vs guaranteed result count.</p>



<p>That 4.2-second cost reflects the extreme selectivity: Mathematics is only 0.5% of the data, so the index had to traverse deep into the graph. For moderate selectivity like Science (8.9%), the overhead is negligible:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET hnsw.iterative_scan = &#039;relaxed_order&#039;;
SET hnsw.ef_search = 40;

SELECT ... WHERE category = &#039;Science&#039; ORDER BY ... LIMIT 10;

Index Scan using idx_content_hnsw_halfvec on articles
  Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)
  Filter: (category = &#039;Science&#039;::text)
  Rows Removed by Filter: 26
  Index Searches: 1
Execution Time: 1.058 ms
</pre></div>


<p>Same feature, but for Science the index found 10 matching rows in a single search pass &#8212; no extra work needed.</p>



<h3 class="wp-block-heading" id="h-two-modes">Two Modes</h3>



<ul class="wp-block-list">
<li><strong><code>relaxed_order</code></strong>: Results are approximately ordered by distance. Slightly faster. Good enough for most use cases.</li>



<li><strong><code>strict_order</code></strong>: Results are exactly ordered by distance. Slightly slower. Use when ranking precision matters. SET hnsw.iterative_scan = &#8216;strict_order&#8217;;<br>&#8212; Execution Time: 0.885 ms</li>
</ul>



<h3 class="wp-block-heading" id="h-the-safety-valve-max-scan-tuples">The Safety Valve: max_scan_tuples</h3>



<p>To prevent runaway scans on extremely selective filters (imagine filtering for a category that has 1 row in 10 million), there&#8217;s a safety limit:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET hnsw.max_scan_tuples = 500;   -- Restrictive: stop after 500 index tuples
SET hnsw.max_scan_tuples = 20000; -- Default: generous enough for most workloads
SET hnsw.max_scan_tuples = 0;     -- Unlimited (use with caution)
</pre></div>


<h3 class="wp-block-heading" id="h-ivfflat-too">IVFFlat Too</h3>



<p>Same concept, different GUC prefix:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET ivfflat.iterative_scan = &#039;relaxed_order&#039;;
SET ivfflat.probes = 3;

SELECT ... WHERE category = &#039;Science&#039; ORDER BY ... LIMIT 10;
-- Execution Time: 2.001 ms
</pre></div>


<h3 class="wp-block-heading" id="h-filtered-results-in-action">Filtered Results in Action</h3>



<p>With iterative scan, every result matches the filter:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET hnsw.iterative_scan = &#039;relaxed_order&#039;;
SET hnsw.ef_search = 100;

SELECT id, title, category,
       round((content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1))::numeric, 4) AS distance
FROM articles
WHERE category = &#039;Science&#039;
ORDER BY content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

  id   |  title   | category | distance
-------+----------+----------+----------
     1 | April    | Science  |   0.0000
  7862 | April 23 | Science  |   0.2953
  9878 | April 25 | Science  |   0.3076
   469 | May      | Science  |   0.3082
  9880 | April 24 | Science  |   0.3451
   402 | July     | Science  |   0.3453
  5156 | April 4  | Science  |   0.3531
  9530 | April 7  | Science  |   0.3588
 34906 | 2013     | Science  |   0.3674
  9149 | April 8  | Science  |   0.3690
</pre></div>


<p>All 10 results are Science. All sorted by cosine distance. No over-fetching, no sequential scan. <em>(The titles are Wikipedia date articles &#8212; &#8220;April&#8221;, &#8220;May&#8221;, etc. &#8212; that happen to be classified under Science in this dataset.)</em></p>



<h3 class="wp-block-heading" id="h-multi-filter-combinations">Multi-Filter Combinations</h3>



<p>Iterative scans work with compound WHERE clauses too:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET hnsw.iterative_scan = &#039;relaxed_order&#039;;
SET hnsw.ef_search = 100;

SELECT id, title, category, word_count
FROM articles
WHERE category = &#039;Science&#039; AND word_count &gt; 1000
ORDER BY content_halfvec &lt;=&gt; (SELECT content_halfvec FROM articles WHERE id = 1)
LIMIT 10;

Limit  (actual time=7.098..7.105 rows=10.00 loops=1)
  -&gt;  Sort  (actual time=7.097..7.101 rows=10.00 loops=1)
        Sort Key: ((content_halfvec &lt;=&gt; (InitPlan 1).col1))
        Sort Method: top-N heapsort  Memory: 25kB
        -&gt;  Bitmap Heap Scan on articles  (actual time=0.690..7.027 rows=444.00 loops=1)
              Recheck Cond: ((word_count &gt; 1000) AND (category = &#039;Science&#039;::text))
              -&gt;  BitmapAnd  (actual time=0.641..0.643 rows=0.00 loops=1)
                    -&gt;  Bitmap Index Scan on idx_articles_word_count  (rows=1806.00)
                    -&gt;  Bitmap Index Scan on idx_articles_category    (rows=2232.00)
Execution Time: 7.542 ms
</pre></div>


<p>Here the PostgreSQL optimizer did something clever: it combined two B-tree indexes (BitmapAnd) instead of using the HNSW iterative scan. On a 25K-row table that fits in memory, this is often cheaper. At scale (millions of rows), the iterative scan path wins because the bitmap approach requires reading too many heap pages.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> Iterative scans are the answer to &#8220;vector search + WHERE doesn&#8217;t work.&#8221; Enable them with <code>SET LOCAL hnsw.iterative_scan = 'relaxed_order'</code> inside transactions. The safety valve <code>max_scan_tuples = 20000</code> is a sensible default.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-quantization-and-storage">Quantization and Storage</h2>



<p>With 3,072-dimension embeddings, storage is the elephant in the room. Here&#8217;s what each representation costs per row:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT
  pg_size_pretty(avg(pg_column_size(content_vector)))  AS vector_size,
  pg_size_pretty(avg(pg_column_size(content_halfvec))) AS halfvec_size,
  pg_size_pretty(avg(pg_column_size(content_bq)))      AS binary_size
FROM articles WHERE content_vector IS NOT NULL;

 vector_size | halfvec_size | binary_size
-------------+--------------+-------------
 12 kB       | 6148 bytes   | 392 bytes
</pre></div>


<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Bytes per dimension</th><th>Per-row (3072d)</th><th>Savings</th></tr></thead><tbody><tr><td><code>vector(3072)</code></td><td>4 bytes (float32)</td><td>12,288 bytes</td><td>baseline</td></tr><tr><td><code>halfvec(3072)</code></td><td>2 bytes (float16)</td><td>6,144 bytes</td><td><strong>50%</strong></td></tr><tr><td><code>bit(3072)</code></td><td>1/8 byte (1 bit)</td><td>384 bytes</td><td><strong>97%</strong></td></tr></tbody></table></figure>



<p>At 1 million rows:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Type</th><th>Column storage</th><th>HNSW index</th></tr></thead><tbody><tr><td><code>vector(3072)</code></td><td>~12 GB</td><td>N/A (2000-dim limit)</td></tr><tr><td><code>halfvec(3072)</code></td><td>~6 GB</td><td>~7.7 GB</td></tr><tr><td><code>bit(3072)</code></td><td>~0.4 GB</td><td>~0.8 GB (bit_hamming_ops)</td></tr></tbody></table></figure>



<p>The storage math is brutal for high-dimensional embeddings. This is why DiskANN&#8217;s built-in SBQ compression matters: it gets you to 21 MB where HNSW on halfvec costs 193 MB.</p>



<h3 class="wp-block-heading" id="h-binary-quantize-re-ranking">Binary Quantize + Re-Ranking</h3>



<p>Binary quantization crushes each dimension to a single bit (positive = 1, negative = 0). It&#8217;s lossy, but very fast for coarse filtering. The pattern is a two-phase search:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Phase 1: Hamming distance on binary (fast, coarse) → 100 candidates
-- Phase 2: Cosine distance on full vector (precise) → 10 results

WITH coarse AS (
    SELECT id, title, content_vector
    FROM articles
    WHERE content_bq IS NOT NULL
    ORDER BY content_bq &lt;~&gt; (
        SELECT binary_quantize(content_vector)::bit(3072)
        FROM articles WHERE id = 1
    )
    LIMIT 100
)
SELECT id, title,
       content_vector &lt;=&gt; (SELECT content_vector FROM articles WHERE id = 1) AS distance
FROM coarse
ORDER BY content_vector &lt;=&gt; (SELECT content_vector FROM articles WHERE id = 1)
LIMIT 10;

Limit  (actual time=29.021..29.026 rows=10.00 loops=1)
  -&gt;  Sort on coarse  (actual time=29.020..29.023 rows=10.00 loops=1)
        -&gt;  Subquery Scan  (actual time=27.805..28.999 rows=100.00 loops=1)
              -&gt;  Sort by content_bq &lt;~&gt;  (actual time=27.744..27.754 rows=100.00 loops=1)
                    -&gt;  Seq Scan on articles  (actual time=0.435..24.288 rows=24700.00 loops=1)
Execution Time: 29.127 ms
</pre></div>


<p>29 ms without an index on <code>content_bq</code>. With an HNSW index using <code>bit_hamming_ops</code>, Phase 1 would be sub-millisecond. The re-ranking in Phase 2 only touches 100 full vectors instead of 25,000.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>DBA Takeaway:</strong> Use halfvec as your default for 3072-dimension embeddings. Use binary quantize + re-ranking when you need to search billions of rows and can tolerate a two-phase approach. DiskANN&#8217;s SBQ gives you similar compression automatically.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-operators-and-sargability">Operators and Sargability</h2>



<p>pgvector provides four distance operators:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Operator</th><th>Distance</th><th>Operator Class</th><th>Use When</th></tr></thead><tbody><tr><td><code>&lt;=&gt;</code></td><td>Cosine</td><td><code>vector_cosine_ops</code></td><td>Normalized embeddings (most common)</td></tr><tr><td><code>&lt;-&gt;</code></td><td>L2 (Euclidean)</td><td><code>vector_l2_ops</code></td><td>Absolute distance matters</td></tr><tr><td><code>&lt;#&gt;</code></td><td>Inner Product (negative)</td><td><code>vector_ip_ops</code></td><td>Pre-normalized, slight speed edge</td></tr><tr><td><code>&lt;+&gt;</code></td><td>L1 (Manhattan)</td><td><code>vector_l1_ops</code></td><td>Sparse-like behavior on dense vectors</td></tr></tbody></table></figure>



<h3 class="wp-block-heading" id="h-wrong-operator-no-index">Wrong Operator = No Index</h3>



<p>This is the single most common mistake. If your index uses <code>halfvec_cosine_ops</code> but your query uses <code>&lt;-&gt;</code> (L2), the index <strong>cannot</strong> be used:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- CORRECT: cosine operator on cosine index → Index Scan
EXPLAIN (COSTS OFF)
SELECT ... ORDER BY content_halfvec &lt;=&gt; (...) LIMIT 10;

 Limit
   -&gt;  Index Scan using idx_content_hnsw_halfvec on articles
         Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)

-- WRONG: L2 operator on cosine index → Seq Scan!
EXPLAIN (COSTS OFF)
SELECT ... ORDER BY content_halfvec &lt;-&gt; (...) LIMIT 10;

 Limit
   -&gt;  Sort
         Sort Key: ((articles.content_halfvec &lt;-&gt; (InitPlan 1).col1))
         -&gt;  Seq Scan on articles
</pre></div>


<p>The planner can&#8217;t use a cosine-distance index for an L2-distance query. They&#8217;re different metrics with different orderings. If you see an unexpected Seq Scan on a vector query, <strong>check your operator first</strong>.</p>



<h3 class="wp-block-heading" id="h-sargable-queries-the-cross-join-trap">Sargable Queries: The Cross-Join Trap</h3>



<p>This is the pattern I see most often in the wild, and it&#8217;s wrong:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- BAD: cross-join prevents index use
SELECT a.id, a.title,
       a.content_halfvec &lt;=&gt; b.content_halfvec AS distance
FROM articles a, articles b
WHERE b.id = 1
ORDER BY a.content_halfvec &lt;=&gt; b.content_halfvec
LIMIT 10;

Limit
  -&gt;  Sort
        Sort Key: ((a.content_halfvec &lt;=&gt; b.content_halfvec))
        -&gt;  Nested Loop
              -&gt;  Index Scan using articles_pkey on articles b
                    Index Cond: (id = 1)
              -&gt;  Seq Scan on articles a     ← NO INDEX!
</pre></div>


<p>The planner sees <code>a.content_halfvec &lt;=&gt; b.content_halfvec</code> as a <strong>join condition</strong>, not an index-scan ordering. It can&#8217;t push the ORDER BY into the vector index because the right-hand side comes from a different table reference.</p>



<p>The fix: use a <strong>scalar subquery</strong>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- GOOD: scalar subquery → index used
SELECT id, title,
       content_halfvec &lt;=&gt; (
           SELECT content_halfvec FROM articles WHERE id = 1
       ) AS distance
FROM articles
ORDER BY content_halfvec &lt;=&gt; (
    SELECT content_halfvec FROM articles WHERE id = 1
)
LIMIT 10;

Limit
  InitPlan 1
    -&gt;  Index Scan using articles_pkey on articles articles_1
          Index Cond: (id = 1)
  -&gt;  Index Scan using idx_content_hnsw_halfvec on articles
        Order By: (content_halfvec &lt;=&gt; (InitPlan 1).col1)
</pre></div>


<p>The scalar subquery is evaluated once (InitPlan), then the result is treated as a constant. Now the planner can push the ORDER BY into the HNSW index scan. Same results, but with the index instead of a sequential scan.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> For this nearest-neighbor ORDER BY pattern, always use scalar subqueries for vector distance calculations, never cross-joins. This is the vector-search equivalent of writing sargable WHERE clauses for B-tree indexes.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-partial-indexes">Partial Indexes</h3>



<p>If you frequently filter by a specific category, a partial index is dramatically more efficient:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE INDEX idx_content_hnsw_science
ON articles USING hnsw (content_halfvec halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE category = &#039;Science&#039;;

Time: 1420.207 ms (00:01.420)

        indexname         |  size
--------------------------+--------
 idx_content_hnsw_halfvec | 193 MB   ← full table (25,000 rows)
 idx_content_hnsw_science |  17 MB   ← Science only (2,232 rows)
</pre></div>


<p><strong>11x smaller, 20x faster to build.</strong> If your application queries consistently filter by a known set of categories, partial indexes are the single biggest optimization available. Build one per high-value category.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-monitoring">Monitoring</h2>



<h3 class="wp-block-heading" id="h-index-inventory">Index Inventory</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT indexname,
       pg_size_pretty(pg_relation_size(indexname::regclass)) AS size,
       indexdef
FROM pg_indexes
WHERE tablename = &#039;articles&#039;
ORDER BY pg_relation_size(indexname::regclass) DESC;

        indexname        |  size   | indexdef
-------------------------+---------+-------------------------------------------
 idx_content_hnsw_halfvec| 193 MB  | USING hnsw ... WITH (m=&#039;16&#039;, ef_construction=&#039;64&#039;)
 idx_content_ivfflat     | 193 MB  | USING ivfflat ... WITH (lists=&#039;25&#039;)
 idx_content_diskann     | 21 MB   | USING diskann ... WITH (storage_layout=memory_optimized)
 articles_pkey           | 1384 kB | USING btree (id)
 idx_articles_category   | 1192 kB | USING btree (category)
 idx_articles_word_count | 904 kB  | USING btree (word_count)
</pre></div>


<p>Adding up the three vector indexes (193 + 193 + 21 = 407 MB) plus B-tree indexes (3 MB), the total index footprint is over <strong>410 MB</strong> for a <strong>90 MB</strong> table. The indexes are ~4.5x the data. This is typical for high-dimensional vector data &#8212; plan your storage accordingly.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Note:</strong> In practice you&#8217;d pick one vector index, not all three. With just HNSW + B-tree indexes, the ratio drops to ~2x.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-settings">Settings</h3>



<p>Know what settings exist and what they default to:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT name, setting, short_desc
FROM pg_settings
WHERE name ~ &#039;^(hnsw|ivfflat|diskann)\.&#039;
ORDER BY name;

                    name                    | setting | short_desc
--------------------------------------------+---------+-------------------------------------------
 hnsw.ef_search                             | 40      | Dynamic candidate list size for search
 hnsw.iterative_scan                        | off     | Mode for iterative scans
 hnsw.max_scan_tuples                       | 20000   | Max tuples to visit for iterative scans
 hnsw.scan_mem_multiplier                   | 1       | Multiple of work_mem for iterative scans
 ivfflat.iterative_scan                     | off     | Mode for iterative scans
 ivfflat.max_probes                         | 32768   | Max probes for iterative scans
 ivfflat.probes                             | 1       | Number of probes
 diskann.query_search_list_size             | 100     | Search list size for queries
 diskann.query_rescore                      | 50      | Rescore candidates
</pre></div>


<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Takeaway:</strong> <code>hnsw.iterative_scan</code> and <code>ivfflat.iterative_scan</code> default to <code>off</code>. If your application relies on vector search with WHERE clauses, you need to explicitly enable iterative scans.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-access-method-capabilities">Access Method Capabilities</h3>



<p>Not sure which operator class works with which index type? Query the catalog:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT am.amname AS access_method,
       opc.opcname AS operator_class,
       t.typname AS data_type
FROM pg_opclass opc
JOIN pg_am am ON am.oid = opc.opcmethod
JOIN pg_type t ON t.oid = opc.opcintype
WHERE am.amname IN (&#039;hnsw&#039;, &#039;ivfflat&#039;, &#039;diskann&#039;)
ORDER BY am.amname, opc.opcname;
</pre></div>


<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Access Method</th><th>Data Types</th><th>Operator Classes</th></tr></thead><tbody><tr><td><strong>HNSW</strong></td><td>vector, halfvec, bit, sparsevec</td><td>18 classes (broadest support)</td></tr><tr><td><strong>IVFFlat</strong></td><td>vector, halfvec, bit</td><td>7 classes</td></tr><tr><td><strong>DiskANN</strong></td><td>vector only (+ label filtering)</td><td>4 classes</td></tr></tbody></table></figure>



<p>HNSW is the most versatile. DiskANN is the most constrained. IVFFlat falls in between.</p>



<h3 class="wp-block-heading" id="h-build-progress-monitoring">Build Progress Monitoring</h3>



<p>While building a large index:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT phase,
       round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS pct_done,
       tuples_done, tuples_total
FROM pg_stat_progress_create_index;
</pre></div>


<p>This works for HNSW and IVFFlat builds (pgvector reports progress). DiskANN builds from pgvectorscale don&#8217;t currently report to this view.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-decision-guidelines">Decision Guidelines</h2>



<p>Here&#8217;s how to choose:</p>



<p><strong>Start with HNSW</strong> unless you have a reason not to. It has the broadest operator support, good recall, and predictable performance. Use halfvec if your dimensions exceed 2,000.</p>



<p><strong>Choose IVFFlat</strong> if build time matters more than query time. IVFFlat builds 5-6x faster than HNSW. If your data distribution shifts materially, plan a <code>REINDEX</code> to refresh clustering quality (centroid drift). A practical signal: if recall drops without any change in query patterns, or if you&#8217;ve inserted more than ~30% new rows since the last build, the centroids are likely stale. For a deeper look at how embedding model upgrades trigger reindexing at scale, see <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/">Embedding Versioning with pgvector</a>.</p>



<p><strong>Choose DiskANN</strong> if storage is the constraint. The 9x compression is decisive at scale. It handles high dimensions natively (no halfvec needed) and supports <code>CONCURRENTLY</code> for production deployments.</p>



<p><strong>Enable iterative scans</strong> for vector search + filtering in production. The trade-off is potentially higher latency on very selective filters (the index does more work to find matching rows), but that&#8217;s usually the right trade-off for correctness. Tune <code>max_scan_tuples</code> / <code>max_probes</code> to bound worst-case work. Use <code>relaxed_order</code> by default, <code>strict_order</code> when ranking precision matters.</p>



<p><strong>Use partial indexes</strong> for category-specific searches. An 11x size reduction and 20x faster build is hard to argue with.</p>



<p><strong>Use SET LOCAL</strong> for all vector parameter changes in production AFTER having tested them.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Feature</th><th>HNSW</th><th>IVFFlat</th><th>DiskANN (pgvectorscale)</th><th>DiskANN (pg_diskann)</th></tr></thead><tbody><tr><td><strong>Provider</strong></td><td>pgvector</td><td>pgvector</td><td>Timescale (open source)</td><td>Microsoft (Azure DB for PG)</td></tr><tr><td><strong>Max dims (vector)</strong></td><td>2,000</td><td>2,000</td><td>16,000</td><td>16,000</td></tr><tr><td><strong>Max dims (halfvec)</strong></td><td>4,000</td><td>4,000</td><td>N/A</td><td>N/A</td></tr><tr><td><strong>Compression</strong></td><td>Via halfvec</td><td>Via halfvec</td><td>SBQ (1-2 bits/dim)</td><td>PQ (4 bits/chunk)</td></tr><tr><td><strong>Build time (25K, 3072d)</strong></td><td>29s</td><td>5s</td><td>49s</td><td>N/A</td></tr><tr><td><strong>Index size</strong></td><td>193 MB</td><td>193 MB</td><td><strong>21 MB</strong></td><td>Similar</td></tr><tr><td><strong>Query time</strong></td><td>2-6 ms</td><td>2-10 ms</td><td>3 ms</td><td>~3 ms</td></tr><tr><td><strong>Key tuning param</strong></td><td>ef_search</td><td>probes</td><td>query_search_list_size</td><td>search list / PQ params</td></tr><tr><td><strong>Iterative scan</strong></td><td>Yes</td><td>Yes</td><td>No</td><td>Yes (ON by default)</td></tr><tr><td><strong>CONCURRENTLY</strong></td><td>Yes</td><td>Yes</td><td>Yes (0.9.0+)</td><td>Yes</td></tr><tr><td><strong>Data types</strong></td><td>vector, halfvec, bit, sparsevec</td><td>vector, halfvec, bit</td><td>vector only</td><td>vector only</td></tr></tbody></table></figure>



<p id="h-dimension-limits-are-largely-explained-by-postgresql-s-8-kb-page-size-and-encoding-density-exact-cutoffs-are-implementation-defined">Dimension limits are largely explained by PostgreSQL&#8217;s 8 KB page size and encoding density (exact cutoffs are implementation-defined):</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Encoding</th><th>Bits/dim</th><th>Theoretical max</th><th>Actual limit</th><th>Who uses it</th></tr></thead><tbody><tr><td>float32 (<code>vector</code>)</td><td>32</td><td>2,048</td><td>2,000</td><td>HNSW, IVFFlat</td></tr><tr><td>float16 (<code>halfvec</code>)</td><td>16</td><td>4,096</td><td>4,000</td><td>HNSW, IVFFlat</td></tr><tr><td>PQ symbol</td><td>4</td><td>16,384</td><td>16,000</td><td>pg_diskann</td></tr><tr><td>SBQ binary</td><td>1</td><td>65,536</td><td>16,000</td><td>pgvectorscale</td></tr></tbody></table></figure>



<p>The compression story is ultimately a story about <strong>how many bits of information you need per dimension to navigate the index</strong>. pgvector stores full-precision values; DiskANN stores just enough to find the right neighborhood, then goes back to the heap for exact distances. <br>But here&#8217;s the thing: exact results are not necessarily what you want. In a RAG pipeline, the retrieved documents are context for a language model that will synthesize and rephrase an answer — not return rows verbatim. Whether your top-10 results are ranked by exact cosine distance or by a 96%-accurate approximation rarely changes the generated answer. The same is true for recommendations, semantic deduplication, and most classification workflows: the downstream consumer is tolerant of small ranking variations.</p>



<p>The real production trade-off is not precision vs approximation — it&#8217;s <strong>the balance between retrieval speed, resource efficiency, and result quality at your scale</strong>. An HNSW index that doesn&#8217;t fit in <code>shared_buffers</code> and hits disk on every query will give you worse <em>effective</em> results than a DiskANN index that stays cached and returns slightly less precise distances in a fraction of the time. The best retrieval is the one that actually runs within your latency budget.</p>



<p>The lab with all SQL scripts, and Python embedding pipeline are available here :  <a href="https://github.com/boutaga/pgvector_RAG_search_lab/tree/main/lab/06_pgvector_indexes"><code>lab/06_pgvector_indexes</code></a>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<p><em>All benchmarks: PostgreSQL 18, pgvector 0.8.1, pgvectorscale 0.9.0, 25K Wikipedia articles, 3072d text-embedding-3-large embeddings, (4 vCPUs, 8 GB RAM).</em></p>
<p>L’article <a href="https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/">pgvector, a guide for DBA &#8211; Part 2: Indexes (update march 2026)</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>PostgreSQL Anonymizer: Simple Data Masking for DBAs</title>
		<link>https://www.dbi-services.com/blog/postgresql-anonymizer-simple-data-masking-for-dbas/</link>
					<comments>https://www.dbi-services.com/blog/postgresql-anonymizer-simple-data-masking-for-dbas/#respond</comments>
		
		<dc:creator><![CDATA[Joan Frey]]></dc:creator>
		<pubDate>Fri, 27 Feb 2026 10:08:50 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[anon]]></category>
		<category><![CDATA[anonymization]]></category>
		<category><![CDATA[pg_anonymizer]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=43200</guid>

					<description><![CDATA[<p>Sensitive data (names, emails, phone numbers, personal identifiers…) should not be freely exposed outside production. When you refresh a production database to a test or staging environment, or when analysts need access to real-looking data, anonymization becomes critical. The PostgreSQL Anonymizer extension (often called anon) is an open‑source extension that lets you mask, fake, shuffle, [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/postgresql-anonymizer-simple-data-masking-for-dbas/">PostgreSQL Anonymizer: Simple Data Masking for DBAs</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Sensitive data (names, emails, phone numbers, personal identifiers…) should not be freely exposed outside production. When you refresh a production database to a test or staging environment, or when analysts need access to real-looking data, anonymization becomes critical.</p>



<p>The PostgreSQL Anonymizer extension (often called anon) is an open‑source extension that lets you mask, fake, shuffle, or generalize data directly inside PostgreSQL, using simple SQL rules. This article explains what it is, how it works, how to install it on PostgreSQL 18.1 running on Red Hat Enterprise Linux 10.1, and how to use it with clear, beginner‑friendly examples.</p>



<p>The target audience is everyone, but especially beginner DBAs who want a practical, command‑line–focused introduction.</p>



<h2 class="wp-block-heading" id="h-i-what-is-postgresql-anonymizer">I. What is PostgreSQL Anonymizer?</h2>



<p>PostgreSQL Anonymizer is an extension that helps protect sensitive data by replacing it with fake or obfuscated values.</p>



<p>Instead of exporting data and anonymizing it with scripts, the rules live inside the database itself. You declare how a column should be anonymized, and PostgreSQL applies that rule automatically.</p>



<p>Typical use cases:</p>



<ul class="wp-block-list">
<li>Refreshing production data into test / staging environments</li>



<li>Giving developers or analysts access to realistic but non‑sensitive data</li>



<li>Producing anonymized database dumps for external sharing</li>



<li>Helping comply with GDPR and other privacy regulations</li>
</ul>



<p>The extension supports three main approaches:</p>



<ol start="1" class="wp-block-list">
<li>Static masking – permanently replaces data in tables</li>



<li>Dynamic masking – masks data on the fly for specific users</li>



<li>Anonymous dumps – exports an already anonymized dump</li>
</ol>



<h2 class="wp-block-heading" id="h-ii-how-does-it-work">II. How does it work ?</h2>



<p>PostgreSQL Anonymizer uses PostgreSQL’s security labels mechanism. You attach a label to a column that says: “When this data is anonymized, use <em>this function</em>.”</p>



<p>Example:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SECURITY LABEL FOR anon ON customer.email IS &#039;MASKED WITH FUNCTION anon.fake_email()&#039;;
</pre></div>


<p>Once declared:</p>



<ul class="wp-block-list">
<li>Static masking rewrites the table using those rules</li>



<li>Dynamic masking rewrites query results for masked users</li>



<li>Dumps automatically apply the same rules</li>
</ul>



<p>The rules stay attached to the schema, not to scripts or applications.</p>



<h2 class="wp-block-heading" id="h-iii-installing-postgresql-anonymizer-on-rhel-10-1-postgresql-18-1">III. Installing PostgreSQL Anonymizer on RHEL 10.1 (PostgreSQL 18.1)</h2>



<p>In this guide, PostgreSQL 18.1 is already installed following dbi services standard. PostgreSQL binaries are located in:</p>



<ul class="wp-block-list">
<li>/u01/app/postgres/product/18/db_1/bin</li>
</ul>



<p>The PostgreSQL data directory (PGDATA) is:</p>



<ul class="wp-block-list">
<li>/u02/pgdata/18/demo-cluster</li>
</ul>



<p>We will only focus on installing and enabling the PostgreSQL Anonymizer extension.</p>



<h3 class="wp-block-heading" id="h-1-install-the-anonymizer-extension">1. Install the Anonymizer extension</h3>



<p>Since we are installing from source, we&#8217;ll start by installing the necessary prerequisites. We&#8217;ll use cargo to handle the PGRX system requirements. You can find the official documentation <a href="https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/#install-on-redhat-rocky-linux-alma-linux" id="https://postgresql-anonymizer.readthedocs.io/en/latest/INSTALL/#install-on-redhat-rocky-linux-alma-linux">here</a>, but keep in mind that I&#8217;ve updated the commands to reflect a more recent version.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
#Cargo install

curl --proto &#039;=https&#039; --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
cargo --version

# postgresql_anonymizer install

cargo install cargo-pgrx --version 0.16.1 --locked
cargo pgrx init --pg18 /u01/app/postgres/product/18/db_1/bin/pg_config
git clone https://gitlab.com/dalibo/postgresql_anonymizer.git
cd postgresql_anonymizer/
make extension PG_CONFIG=/u01/app/postgres/product/18/db_1/bin/pg_config PGVER=pg18
sudo make install PG_CONFIG=/u01/app/postgres/product/18/db_1/bin/pg_config PGVER=pg18
</pre></div>


<h3 class="wp-block-heading" id="h-2-enable-anonymizer-in-postgresql-conf">2. Enable Anonymizer in postgresql.conf</h3>



<p>Because PostgreSQL Anonymizer hooks into query execution, it must be loaded at session start. Edit the configuration file:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
vi /u02/pgdata/18/demo-cluster/postgresql.conf
</pre></div>


<p>Add or update:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
session_preload_libraries = &#039;anon&#039;
</pre></div>


<p>Restart PostgreSQL:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
pg_ctl -D /u02/pgdata/18/demo-cluster restart
</pre></div>


<h3 class="wp-block-heading" id="h-3-install-and-enable-the-anonymizer-extension">3. Install and enable the Anonymizer extension</h3>



<p>Connect as the PostgreSQL superuser:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
psql -U postgres
</pre></div>


<p>Create a database:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE DATABASE anonymizer_demo;
</pre></div>


<p>Create and initialize the extension:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
postgres=# \c anonymizer_demo
You are now connected to database &quot;anonymizer_demo&quot; as user &quot;postgres&quot;.
anonymizer_demo=# CREATE EXTENSION anon CASCADE;
CREATE EXTENSION

anonymizer_demo=# \dx
                                  List of installed extensions
  Name   | Version | Default version |   Schema   |                 Description
---------+---------+-----------------+------------+---------------------------------------------
 anon    | 3.0.0   | 3.0.0           | public     | Anonymization &amp; Data Masking for PostgreSQL
 plpgsql | 1.0     | 1.0             | pg_catalog | PL/pgSQL procedural language
(2 rows)

anonymizer_demo=# SELECT anon.init();
 init
------
 t
(1 row)
</pre></div>


<p>anon.init() loads fake data dictionaries (names, companies, cities, etc.) used by anonymization functions.</p>



<h2 class="wp-block-heading" id="h-iv-demo-anonymizing-a-simple-table">IV. Demo: anonymizing a simple table</h2>



<p>For this demo, I will keep it simple and create everything inside the postgres database and default schema, but I recommend you to follow the best practices and use a dedicated database, user and schema.</p>



<p>Create sample data:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE TABLE customer (
  id         SERIAL PRIMARY KEY,
  first_name TEXT,
  last_name  TEXT,
  birthdate  DATE,
  email      TEXT,
  company    TEXT
);

INSERT INTO customer (first_name, last_name, birthdate, email, company) VALUES
(&#039;Alice&#039;, &#039;Martin&#039;, &#039;1987-02-14&#039;, &#039;alice.martin@example.com&#039;, &#039;Acme Corp&#039;),
(&#039;Bob&#039;,   &#039;Dupont&#039;, &#039;1979-11-03&#039;, &#039;bob.dupont@example.com&#039;,   &#039;Globex&#039;);

anonymizer_demo=# select * from customer;
 id |  full_name   | birthdate  |          email           |  company
----+--------------+------------+--------------------------+-----------
  1 | Alice Martin | 1987-02-14 | alice.martin@example.com | Acme Corp
  2 | Bob Dupont   | 1979-11-03 | bob.dupont@example.com   | Globex
(2 rows)
</pre></div>


<h3 class="wp-block-heading" id="h-1-static-masking-permanent-anonymization">1. Static masking (permanent anonymization)</h3>



<p>Static masking is a &#8220;fire and forget&#8221; approach where the original sensitive data is physically overwritten on the disk with faked or scrambled values. This process is destructive. Once the data is masked, the original values are gone forever. This should only be performed on non-production environments (like Staging or Dev) or on a backup copy of your database.</p>



<p>Before applying any rules, you must explicitly enable the static masking engine at both the database and role levels. This acts as a safety switch to prevent accidental data loss.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Enable the extension for the current database
anonymizer_demo=# ALTER DATABASE anonymizer_demo SET anon.static_masking = TRUE;
ALTER DATABASE

-- Grant the postgres user permission to execute static masking operations
anonymizer_demo=# ALTER ROLE postgres SET anon.static_masking = TRUE;
ALTER ROLE
</pre></div>


<p>Next, you define how the data should be transformed. We use <code>SECURITY LABEL</code> to attach masking logic to specific columns. This doesn&#8217;t change the data yet; it simply tells the anon extension which functions to use during the anonymization process.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Replace names with realistic dummy values
anonymizer_demo=# SECURITY LABEL FOR anon ON COLUMN customer.first_name IS &#039;MASKED WITH FUNCTION anon.dummy_first_name()&#039;;
SECURITY LABEL
anonymizer_demo=# SECURITY LABEL FOR anon ON COLUMN customer.last_name IS &#039;MASKED WITH FUNCTION anon.dummy_last_name()&#039;;
SECURITY LABEL

-- Generate a random date within a specific age range (1950-2000)
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.birthdate IS &#039;MASKED WITH FUNCTION anon.random_date_between(&#039;&#039;1950-01-01&#039;&#039;, &#039;&#039;2000-12-31&#039;&#039;)&#039;;
SECURITY LABEL

-- Generate syntactically correct but fake emails and company names
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.email IS &#039;MASKED WITH FUNCTION anon.fake_email()&#039;;
SECURITY LABEL
anonymizer_demo=# SECURITY LABEL FOR anon ON column customer.company IS &#039;MASKED WITH FUNCTION anon.fake_company()&#039;;
SECURITY LABEL
</pre></div>


<p>This is the final execution step. Running anonymize_database() triggers the engine to scan your rules and overwrite the table data globally. Depending on the size of your database, this may take some time as it performs UPDATE operations on the disk.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# SELECT anon.anonymize_database();
 anonymize_database
--------------------
 t
(1 row)
</pre></div>


<p>The data rules have now been applied and the data have been anonymized. You can check the result with the following query:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# SELECT * FROM customer;
 id | birthdate  |        email         |    company     | first_name | last_name
----+------------+----------------------+----------------+------------+------------
  1 | 1960-11-30 | lpeters@example.net  | Brown and Sons | Chyna      | Mertz
  2 | 1997-09-08 | avaughan@example.com | Rice PLC       | Damien     | Williamson
(2 rows)
</pre></div>


<p>You can now turn off static masking to continue with the next demo, dynamic masking:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# ALTER SYSTEM SET anon.static_masking TO off;
ALTER SYSTEM
anonymizer_demo=# ALTER ROLE postgres SET anon.static_masking TO off;
ALTER ROLE
</pre></div>


<h3 class="wp-block-heading" id="h-2-dynamic-masking">2. Dynamic masking</h3>



<p>Dynamic masking allows you to hide sensitive information from specific users (like developers or analysts) while preserving the original data for administrators or the application itself. The masking happens in memory at the moment the query is executed.</p>



<p>First, we tell the database to activate the transparent masking engine. This allows the anon extension to intercept queries from specific roles and apply masking rules before the results are returned.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# ALTER DATABASE anonymizer_demo SET anon.transparent_dynamic_masking = TRUE;
ALTER DATABASE
</pre></div>


<p>To see show case masking in action, we will use two types of users: a masked user who sees fake data, and an unmasked user (like the superuser) who sees the actual data stored on disk.</p>



<p>In this step, we create <code>demo_user</code> and &#8220;tag&#8221; them with a security label that forces the masking engine to engage whenever they log in.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# CREATE ROLE demo_user LOGIN;
CREATE ROLE
anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_user IS &#039;MASKED&#039;;
SECURITY LABEL
anonymizer_demo=# GRANT pg_read_all_data to demo_user;
GRANT ROLE

anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_user IS &#039;MASKED&#039;;
SECURITY LABEL

-- As PostgreSQL user:
-- Ensure the postgres user remains unmasked so we can see the &#039;real&#039; data

anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE postgres IS NULL;
SECURITY LABEL
</pre></div>


<p>We don&#8217;t need to redefine our masking rules (for names, emails, etc.) because the SECURITY LABEL definitions we created in the static masking section are still stored in the database schema.</p>



<p>Watch what happens when we query the table as demo_user:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=&gt; select * from customer;
 id | birthdate  |          email           |     company      | first_name | last_name
----+------------+--------------------------+------------------+------------+-----------
  1 | 1957-02-07 | walterkristi@example.org | Fernandez-Tucker | Emelie     | Rohan
  2 | 1950-05-15 | hannah76@example.com     | Leonard Group    | Jane       | Durgan
(2 rows)
</pre></div>


<p>If we run the exact same command again, the output changes:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=&gt; select * from customer;
 id | birthdate  |         email          |    company    | first_name | last_name
----+------------+------------------------+---------------+------------+------------
  1 | 1963-12-26 | blairpeter@example.com | Simpson Group | Tod        | Balistreri
  2 | 1971-09-14 | steven27@example.com   | Davis-Hardin  | Sonny      | Wintheiser
(2 rows)
</pre></div>


<p>Because the data is being generated &#8220;on-the-fly&#8221; by the masking functions, the results are dynamic. Each request produces a fresh set of data.</p>



<p>Now, let&#8217;s switch back to the postgres user. Since we removed the MASKED label from this role, the engine steps aside and shows us the actual data residing on the disk (which, in this case, is the data we masked statically in the previous step).</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
anonymizer_demo=# SELECT * FROM customer;
 id | birthdate  |        email         |    company     | first_name | last_name
----+------------+----------------------+----------------+------------+------------
  1 | 1960-11-30 | lpeters@example.net  | Brown and Sons | Chyna      | Mertz
  2 | 1997-09-08 | avaughan@example.com | Rice PLC       | Damien     | Williamson
(2 rows)
</pre></div>


<h3 class="wp-block-heading" id="h-3-anonymized-dump">3. Anonymized dump</h3>



<p>An Anonymized Dump allows you to export your database into a <code>.sql</code> file where the sensitive data is already replaced by fake values. This is incredibly powerful because you can create a &#8220;safe&#8221; backup that can be shared with developers or consultants without ever giving them access to your live production server.</p>



<p>Rather than using a superuser, we create a specific role whose sole purpose is to retrieve the masked version of the data during the export process.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Create a specialized user for the dump process
anonymizer_demo=# CREATE ROLE demo_ano_dumper LOGIN PASSWORD &#039;secret&#039;;
CREATE ROLE

-- Force the masking engine to be active for this user session
anonymizer_demo=# ALTER ROLE demo_ano_dumper SET anon.transparent_dynamic_masking = TRUE;
ALTER ROLE

-- Apply the MASKED label to the role
anonymizer_demo=# SECURITY LABEL FOR anon ON ROLE demo_ano_dumper IS &#039;MASKED&#039;;
SECURITY LABEL

-- Grant permission to read all tables
anonymizer_demo=# GRANT pg_read_all_data TO demo_ano_dumper;
GRANT
</pre></div>


<p>Now, we use the standard pg_dump utility. Because we are logging in as demo_ano_dumper, the anon extension intercepts the data export on-the-fly. We use a few specific flags to ensure the resulting file is clean and doesn&#8217;t contain the masking logic itself:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
/u01/app/postgres/product/18/db_1/bin/pg_dump anonymizer_demo --username=demo_ano_dumper --password --no-security-labels --exclude-extension=&quot;anon&quot; --file=anonymized_dump.sql
</pre></div>


<p><strong><code>--no-security-labels</code></strong>: Prevents the &#8220;MASKED&#8221; tags from being exported (the new database doesn&#8217;t need to know how the data was masked).</p>



<p><strong><code>--exclude-extension="anon"</code></strong>: Ensures the recipient doesn&#8217;t need the <code>anon</code> extension installed to restore the file.</p>



<p>If you open the generated anonymized_dump.sql file in a text editor, you will see that the COPY commands contain the fake data, not the original sensitive information.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
--
-- Data for Name: customer; Type: TABLE DATA; Schema: public; Owner: postgres
--

COPY public.customer (id, birthdate, email, company, first_name, last_name) FROM stdin;
1       1980-12-17      cannonlauren@example.com        Carr-Doyle      Katharina       Shanahan
2       1998-03-27      sfox@example.net        Pollard and Sons        Ayla    Spencer
\.
</pre></div>


<h2 class="wp-block-heading" id="h-v-conclusion">V. Conclusion</h2>



<p>In modern development, the goal is to work with realistic data without the real-world risk. PostgreSQL Anonymizer bridges this gap by allowing you to transform sensitive production information into safe, functional datasets.</p>



<p>Now that we&#8217;ve explained how to use pg_anonymizer and covered all three methods, here is a quick guide on when to use each:</p>



<ul class="wp-block-list">
<li><strong>Static Masking:</strong> Best for &#8220;cleaning&#8221; a staging database after a production refresh.</li>



<li><strong>Dynamic Masking:</strong> Best for internal users (DBAs, support staff) who need to work on the live database but shouldn&#8217;t see production data.</li>



<li><strong>Anonymized Dump:</strong> Best for sharing data with external partners or creating local development environments.</li>
</ul>



<p>In the end, whether you choose Static, Dynamic, or Dump masking, the benefits remain the same:</p>



<ul class="wp-block-list">
<li><strong>Utility:</strong> Because the data is masked with realistic functions (like <code>fake_email</code> or <code>dummy_first_name</code>), your application logic—like email validation or UI layout—still works perfectly.</li>



<li><strong>Compliance:</strong> Meet GDPR, HIPAA, and internal security requirements by default.</li>



<li><strong>Safety:</strong> Developers and analysts can work on real-world bugs and features without ever seeing a customer&#8217;s actual PII (Personally Identifiable Information).</li>
</ul>



<p>I hope you enjoyed this guide and found these examples clear and easy to follow! My goal was to show that data privacy doesn&#8217;t have to be painful for your development workflow. Don&#8217;t forget to follow the extension latest news on the official website: <a href="https://postgresql-anonymizer.readthedocs.io/en/latest/">https://postgresql-anonymizer.readthedocs.io/en/latest/</a></p>



<p>If you have any questions about these commands or how to implement them in your own environment, feel free to reach out <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p>L’article <a href="https://www.dbi-services.com/blog/postgresql-anonymizer-simple-data-masking-for-dbas/">PostgreSQL Anonymizer: Simple Data Masking for DBAs</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/postgresql-anonymizer-simple-data-masking-for-dbas/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>RAG Series – Embedding Versioning LAB</title>
		<link>https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/</link>
					<comments>https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 22 Feb 2026 21:36:36 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[pgvector]]></category>
		<category><![CDATA[postgresql]]></category>
		<category><![CDATA[RAG]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=43036</guid>

					<description><![CDATA[<p>Introduction This is Part 2 of the embedding versionin, in Part 1, I covered the theory: why event-driven embedding refresh matters, the three levels of architecture (triggers, logical replication, Flink CDC), and how to detect and skip insignificant changes. If you haven&#8217;t read it, go there first, this post won&#8217;t through the entire intent of [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/">RAG Series – Embedding Versioning LAB</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h1 class="wp-block-heading" id="h-introduction">Introduction</h1>



<p>This is Part 2 of the embedding versionin, in <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/" target="_blank" rel="noreferrer noopener">Part 1</a>, I covered the theory: why event-driven embedding refresh matters, the three levels of architecture (triggers, logical replication, Flink CDC), and how to detect and skip insignificant changes. If you haven&#8217;t read it, go there first, this post won&#8217;t through the entire intent of the designs but just demonstrate how it can work.</p>



<p>Here, I&#8217;m going to <strong>run the whole thing</strong> on the Wikipedia dataset from the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab</a> repository. 25,000 articles, triggers, OpenAI API calls, real numbers.</p>



<p>The goal is to answer the questions you&#8217;d actually have when implementing this:</p>



<ul class="wp-block-list">
<li>How do you adapt the schema to an existing table that wasn&#8217;t designed for versioning?</li>



<li>What do the SKIP vs EMBED decisions actually look like with real data?</li>



<li>Does <code>SELECT FOR UPDATE SKIP LOCKED</code> really work with concurrent workers? </li>



<li>What does the freshness monitoring report show in practice?</li>



<li>How does the quality feedback loop close the circle?</li>
</ul>



<p>All the code is in the <code>lab/05_embedding_versioning/</code> directory of the repository.</p>



<h3 class="wp-block-heading" id="h-what-s-in-the-lab-directory">What&#8217;s in the lab directory</h3>



<p>Before diving in, here&#8217;s what each file does — so you know what you&#8217;re running:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
lab/05_embedding_versioning/
├── schema.sql                          # DDL: tables, triggers, indexes
├── worker.py                           # Embedding worker (claims queue items, calls OpenAI, writes vectors)
├── change_detector.py                  # Compares new vs old embeddings to decide SKIP or EMBED
├── freshness_monitor.py                # Generates a full health report on embedding staleness
└── examples/
    ├── simulate_document_changes.py    # Generates a realistic mix of article mutations
    ├── targeted_mutations.py           # Applies specific change types to specific articles
    ├── demo_skip_locked.py             # Demonstrates concurrent worker queue distribution
    ├── demo_trigger_flow.py            # End-to-end: UPDATE → trigger → queue → embed
    └── demo_quality_drift.py           # Simulates declining search quality + automatic re-queuing

</pre></div>


<p>Every script connects to the local <code>wikipedia</code> database and uses the same embedding queue. They&#8217;re designed to run sequentially — each step builds on the state left by the previous one.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-starting-point">The Starting Point</h2>



<p>My lab environment runs PostgreSQL 17.6 with pgvector 0.8.0 and pgvectorscale (DiskANN). The <code>articles</code> table already has 25,000 Wikipedia articles with dense and sparse embeddings from the previous labs (the <code>sparsevec(30522)</code> column holds SPLADE sparse vectors — 30,522 is the BERT WordPiece vocabulary size):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# \d articles
                          Table &quot;public.articles&quot;
         Column         |       Type       | Collation | Nullable | Default
------------------------+------------------+-----------+----------+---------
 id                          | integer          |           | not null |
 url                    | text             |           |          |
 title                  | text             |           |          |
 content                | text             |           |          |
 title_vector           | vector(1536)     |           |          |
 content_vector         | vector(1536)     |           |          |
 vector_id              | integer          |           |          |
 content_tsv            | tsvector         |           |          |
 title_content_tsvector | tsvector         |           |          |
 content_sparse         | sparsevec(30522) |           |          |
 title_vector_3072      | vector(3072)     |           |          |
 content_vector_3072    | vector(3072)     |           |          |
Indexes:
    &quot;articles_pkey&quot; PRIMARY KEY, btree (id)
    &quot;articles_content_3072_diskann&quot; diskann (content_vector_3072)
    &quot;articles_sparse_hnsw&quot; hnsw (content_sparse sparsevec_cosine_ops)
    &quot;articles_title_vector_3072_diskann&quot; diskann (title_vector_3072)
    &quot;idx_articles_content_tsv&quot; gin (content_tsv)
    &quot;idx_articles_title_content_tsvector&quot; gin (title_content_tsvector)
Triggers:
    tsvectorupdate BEFORE INSERT OR UPDATE ON articles FOR EACH ROW ...
    tsvupdate BEFORE INSERT OR UPDATE ON articles FOR EACH ROW ...

</pre></div>


<p>No <code>content_hash</code>, no <code>updated_at</code>, no versioned embeddings. This is the reality of most existing deployments — you need to retrofit versioning without breaking what&#8217;s already working.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-1-apply-the-versioning-schema">Step 1: Apply the Versioning Schema</h2>



<h3 class="wp-block-heading" id="h-what-schema-sql-does">What <code>schema.sql</code> does</h3>



<p>The schema file adapts the generic pattern from Part 1 to the existing <code>articles</code> table. It runs inside a single transaction and performs these operations in order:</p>



<ol class="wp-block-list">
<li><strong>Adds two columns</strong> to <code>articles</code>: <code>content_hash TEXT</code> and <code>updated_at TIMESTAMPTZ DEFAULT now()</code></li>



<li><strong>Creates a BEFORE trigger</strong> (<code>trg_content_hash</code>) that automatically computes <code>md5(content)</code> before every INSERT or UPDATE of the <code>content</code> column — this is our change detection fingerprint</li>



<li><strong>Backfills</strong> <code>content_hash</code> for all 25,000 existing articles with <code>UPDATE articles SET content_hash = md5(content)</code></li>



<li><strong>Creates <code>article_embeddings_versioned</code></strong> — the versioned embeddings table with <code>model_name</code>, <code>model_version</code>, <code>source_hash</code>, <code>is_current</code>, and a partial DiskANN index on <code>WHERE is_current = true</code></li>



<li><strong>Creates <code>embedding_queue</code></strong> — the work queue with <code>status</code>, <code>content_hash</code>, <code>change_type</code>, <code>claimed_at</code>, and retry tracking</li>



<li><strong>Creates <code>embedding_change_log</code></strong> — records every SKIP/EMBED decision with similarity scores for audit</li>



<li><strong>Creates <code>retrieval_quality_log</code></strong> — for the quality feedback loop (Step 9b)</li>



<li><strong>Creates an AFTER trigger</strong> (<code>trg_queue_embedding</code>) that fires on <code>INSERT OR UPDATE OF content</code> and inserts a queue entry automatically</li>
</ol>



<p>Key differences from the &#8220;clean&#8221; schema in Part 1:</p>



<ul class="wp-block-list">
<li><strong>No generated column for <code>content_hash</code></strong>: <code>GENERATED ALWAYS AS (md5(content)) STORED</code> would rewrite the entire 25K-row table. The BEFORE trigger achieves the same result without a table rewrite — important for large production tables.</li>



<li><strong>Column-targeted trigger</strong>: <code>AFTER UPDATE OF content</code> instead of <code>AFTER UPDATE</code>. The trigger only fires when the <code>content</code> column is touched — title-only or metadata-only updates are ignored at the PostgreSQL level, not inside application code.</li>



<li><strong>Table naming</strong>: <code>article_embeddings_versioned</code> (not <code>document_embeddings</code>) to match the existing <code>articles</code> table naming convention.</li>
</ul>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
psql -d wikipedia -f lab/05_embedding_versioning/schema.sql

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
BEGIN
ALTER TABLE
CREATE FUNCTION
DROP TRIGGER
CREATE TRIGGER
UPDATE 25000
CREATE TABLE
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE TABLE
CREATE INDEX
CREATE INDEX
CREATE TABLE
CREATE TABLE
CREATE FUNCTION
DROP TRIGGER
CREATE TRIGGER
COMMIT

</pre></div>


<p>Let me walk through the important lines:</p>



<ul class="wp-block-list">
<li><strong><code>ALTER TABLE</code></strong> — adds <code>content_hash</code> and <code>updated_at</code> columns</li>



<li><strong><code>CREATE FUNCTION</code> + <code>CREATE TRIGGER</code></strong> (first pair) — the BEFORE trigger that computes <code>md5(content)</code></li>



<li><strong><code>UPDATE 25000</code></strong> — the backfill. This is the most expensive line: PostgreSQL computes MD5 for every article and writes the hash. On 25K rows it takes a few seconds; on millions of rows, plan a maintenance window</li>



<li><strong><code>CREATE TABLE</code> + <code>CREATE INDEX</code> (×3)</strong> — the versioned embeddings table with its partial DiskANN index, version lookup index, and staleness detection index</li>



<li><strong><code>CREATE FUNCTION</code> + <code>CREATE TRIGGER</code></strong> (second pair) — the AFTER trigger that queues embedding work</li>
</ul>



<p>After applying, the table now has versioning infrastructure:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# \d articles
         Column         |           Type           | Nullable | Default
------------------------+--------------------------+----------+---------
 ...existing columns...
 content_hash           | text                     |          |
 updated_at             | timestamp with time zone |          | now()
Referenced by:
    TABLE &quot;article_embeddings_versioned&quot; CONSTRAINT ... FOREIGN KEY (article_id) ...
    TABLE &quot;embedding_queue&quot; CONSTRAINT ... FOREIGN KEY (article_id) ...
Triggers:
    trg_content_hash BEFORE INSERT OR UPDATE OF content ON articles ...
    trg_queue_embedding AFTER INSERT OR UPDATE OF content ON articles ...
    tsvectorupdate BEFORE INSERT OR UPDATE ON articles ...
    tsvupdate BEFORE INSERT OR UPDATE ON articles ...

</pre></div>


<p>Two new triggers alongside the existing tsvector triggers. They coexist without conflict because <code>trg_content_hash</code> is BEFORE (updates the hash) and <code>trg_queue_embedding</code> is AFTER (queues the embedding work using the already-computed hash).</p>



<p>Five new tables: <code>article_embeddings_versioned</code>, <code>embedding_queue</code>, <code>embedding_change_log</code>, <code>retrieval_quality_log</code>, and the queue&#8217;s indexes.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-2-test-the-trigger-manually">Step 2: Test the Trigger Manually</h2>



<p>Before running anything complex, verify the trigger actually works. This is just a sanity check — one UPDATE, then look at the queue:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# SELECT id, title, content_hash FROM articles WHERE id = 1;
 id | title |           content_hash
----+-------+----------------------------------
  1 | April | 47761052aee1158134fc07f3f7337952

wikipedia=# UPDATE articles SET content = content || &#039; &#x5B;test trigger]&#039; WHERE id = 1;
UPDATE 1

wikipedia=# SELECT id, article_id, status, content_hash, change_type, queued_at
  FROM embedding_queue ORDER BY queued_at DESC LIMIT 5;
 id | article_id | status  |           content_hash           |  change_type   |           queued_at
----+------------+---------+----------------------------------+----------------+-------------------------------
  1 |          1 | pending | 59e5ebe6fa9fce7ab87beccf6523dda6 | content_update | 2026-02-18 14:38:01.626792+00

</pre></div>


<p><strong>What happened here, step by step:</strong></p>



<ol class="wp-block-list">
<li>We checked article 1 (&#8220;April&#8221;) — its <code>content_hash</code> was <code>4776...</code></li>



<li>We appended <code>' [test trigger]'</code> to its content</li>



<li>The <strong>BEFORE trigger</strong> (<code>trg_content_hash</code>) fired first, recomputing <code>content_hash</code> to <code>59e5...</code> (the new MD5)</li>



<li>The <strong>AFTER trigger</strong> (<code>trg_queue_embedding</code>) fired next, inserting a row into <code>embedding_queue</code> with the new hash and <code>change_type = 'content_update'</code></li>



<li>The queue entry has <code>status = 'pending'</code> — nothing has processed it yet</li>
</ol>



<p>The <code>change_type</code> column is important: it&#8217;s how we&#8217;ll later distinguish content-triggered re-embeddings from quality-triggered ones (Step 9b).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-3-simulate-50-document-mutations">Step 3: Simulate 50 Document Mutations</h2>



<p>Real knowledge bases don&#8217;t get 1 change at a time. The <code>simulate_document_changes.py</code> script generates a realistic mix of changes to random articles.</p>



<h3 class="wp-block-heading" id="h-what-the-script-does">What the script does</h3>



<p>The script picks 50 random articles from the database and applies one of five mutation types to each, chosen by a weighted random distribution that mimics real-world editing patterns:</p>



<ul class="wp-block-list">
<li><strong><code>typo_fix</code></strong> (most common): appends a period or fixes a word — the kind of minor edit that shouldn&#8217;t trigger re-embedding</li>



<li><strong><code>paragraph_add</code></strong>: appends a substantial paragraph (3-5 sentences) — new information that changes the semantic content</li>



<li><strong><code>section_rewrite</code></strong>: replaces a portion of the article with new text — significant semantic shift</li>



<li><strong><code>major_rewrite</code></strong>: rewrites most of the article — entirely new embedding needed</li>



<li><strong><code>metadata_only</code></strong>: changes only the title (not the content) — should NOT trigger the embedding pipeline at all</li>
</ul>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python examples/simulate_document_changes.py --count 50

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Mutation Summary:
----------------------------------------
  major_rewrite        3
  metadata_only        6
  paragraph_add       15
  section_rewrite      4
  typo_fix            22
  TOTAL               50

</pre></div>


<p>This distribution is realistic: most changes are minor fixes, a smaller portion adds new content, and a few are major rewrites. The 6 <code>metadata_only</code> changes simulate edits to fields other than <code>content</code> — think correcting a title or updating a URL.</p>



<p>Now check the queue:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# SELECT status, count(*) FROM embedding_queue GROUP BY status;
 status  | count
---------+-------
 pending |    44

</pre></div>


<p><strong>50 mutations, but only 44 queue entries.</strong> Where did the other 6 go?</p>



<p>The 6 <code>metadata_only</code> mutations changed only the title (not content), so the trigger — which fires on <code>UPDATE OF content</code> — <strong>didn&#8217;t fire for them</strong>. Those 6 changes never reached the embedding pipeline. This is the first cost optimization, and it happens at the PostgreSQL trigger level with zero application code, zero API calls, zero overhead.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Why this matters</strong>: In a real knowledge base, a meaningful fraction of updates are metadata-only — tags, categories, status flags, author fields (in some orgs, 30-50% of all UPDATEs). Filtering them at the trigger level means your embedding worker never even sees them.</p>
</blockquote>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-4-change-detection-without-a-baseline">Step 4: Change Detection Without a Baseline</h2>



<p>Now let&#8217;s run the change detector to see which items should be embedded vs skipped.</p>



<h3 class="wp-block-heading" id="h-what-change-detector-py-does">What <code>change_detector.py</code> does</h3>



<p>The change detector is the &#8220;smart filter&#8221; in our pipeline. For each pending queue item, it:</p>



<ol class="wp-block-list">
<li><strong>Fetches the article&#8217;s current content</strong> from the <code>articles</code> table</li>



<li><strong>Looks up the most recent embedding</strong> for that article in <code>article_embeddings_versioned</code></li>



<li><strong>If no previous embedding exists</strong>: marks the item as EMBED (similarity = 0.0) — there&#8217;s nothing to compare against</li>



<li><strong>If a previous embedding exists</strong>: generates a new embedding for the current content via OpenAI, computes the <strong>cosine similarity</strong> between old and new embeddings, and applies the threshold:
<ul class="wp-block-list">
<li>Similarity ≥ 0.95 → <strong>SKIP</strong> (the semantic meaning barely changed, re-embedding would be wasteful)</li>



<li>Similarity &lt; 0.95 → <strong>EMBED</strong> (the meaning shifted enough to warrant a new embedding)</li>
</ul>
</li>



<li><strong>Logs every decision</strong> to <code>embedding_change_log</code> with the similarity score — this is your audit trail</li>
</ol>



<p><strong>Multi-chunk articles</strong>: When an article has multiple chunks (like &#8220;Dean Martin&#8221; with 3), the detector compares against <code>chunk_index = 0</code> only — the lead section, which concentrates the article&#8217;s core topic. This is a deliberate tradeoff: it&#8217;s fast (one comparison, not N), and for Wikipedia-style content where the introduction summarizes the whole article, it&#8217;s a reliable proxy. For corpora where meaning is spread more evenly across chunks, you&#8217;d want a centroid approach (average the L2-normalized chunk vectors) or max pairwise similarity across corresponding chunks. The threshold may need recalibration depending on which strategy you choose.</p>



<p>The <code>--analyze-queue</code> flag tells it to analyze all pending items without actually embedding anything. Think of it as a dry run that records decisions.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python change_detector.py --analyze-queue

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
2026-02-18 14:43:22 &#x5B;DETECTOR] INFO Analyzing 44 pending queue items (threshold=0.95)
2026-02-18 14:43:22 &#x5B;DETECTOR] INFO Article 6607: EMBED (similarity=0.0000)
2026-02-18 14:43:22 &#x5B;DETECTOR] INFO Article 36870: EMBED (similarity=0.0000)
...all 44 show similarity=0.0000...
2026-02-18 14:43:22 &#x5B;DETECTOR] INFO Results: 44 EMBED, 0 SKIP

</pre></div>


<p>Every single article shows <code>similarity=0.0</code>. Why?</p>



<p>Because <code>article_embeddings_versioned</code> is <strong>empty</strong>. There are no previous embeddings to compare against. The change detector hit step 3 for every article: &#8220;no previous embedding exists → must EMBED.&#8221;</p>



<p><strong>This is an important operational insight</strong>: the change detector needs a baseline to work. On the very first run — or when you deploy to a new system — everything must be embedded. The SKIP optimization only kicks in on <strong>subsequent</strong> changes, after embeddings exist to compare against. If you&#8217;re migrating from a system that already has embeddings in a different format, you&#8217;d need to populate the <code>source_hash</code> column from those existing embeddings first to bootstrap the comparison.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-5-create-baseline-embeddings">Step 5: Create Baseline Embeddings</h2>



<p>Now we need to establish that baseline. Let&#8217;s run the worker for one small batch.</p>



<h3 class="wp-block-heading" id="h-what-worker-py-does">What <code>worker.py</code> does</h3>



<p>The worker is the component that actually calls the OpenAI API and writes embeddings to PostgreSQL. Here&#8217;s its internal flow:</p>



<ol class="wp-block-list">
<li><strong>Claim items from the queue</strong> using <code>SELECT ... FOR UPDATE SKIP LOCKED</code> — this is the concurrency primitive from Part 1. Multiple workers can run simultaneously, and each gets a non-overlapping set of items.</li>



<li><strong>For each claimed item</strong>: fetch the article content, split it into chunks (2000-character windows with overlap), and call the OpenAI <code>text-embedding-3-small</code> API to generate a 1536-dimensional vector for each chunk.</li>



<li><strong>Write the embeddings</strong> to <code>article_embeddings_versioned</code> with <code>is_current = true</code>, <code>model_name</code>, <code>model_version</code>, and <code>source_hash</code> (the content&#8217;s MD5 at the moment of embedding).</li>



<li><strong>Mark old embeddings</strong> for the same article as <code>is_current = false</code> (soft delete — they&#8217;re kept for rollback).</li>



<li><strong>Update the queue item</strong> to <code>status = 'completed'</code> with <code>processed_at = now()</code>.</li>
</ol>



<p>The <code>--once</code> flag means &#8220;process one batch and exit&#8221; (instead of running in an infinite polling loop). The <code>--batch-size 10</code> flag means &#8220;claim up to 10 items at a time.&#8221;</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python worker.py --once --batch-size 10

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
2026-02-18 14:45:58 &#x5B;8526] INFO Worker worker-once claimed 10 items
2026-02-18 14:46:02 &#x5B;8526] INFO Article 6607: embedded 1 chunks
2026-02-18 14:46:02 &#x5B;8526] INFO Article 36870: embedded 1 chunks
2026-02-18 14:46:05 &#x5B;8526] INFO Article 19078: embedded 1 chunks
2026-02-18 14:46:05 &#x5B;8526] INFO Article 7947: embedded 1 chunks
2026-02-18 14:46:05 &#x5B;8526] INFO Article 75802: embedded 2 chunks
2026-02-18 14:46:05 &#x5B;8526] INFO Article 5150: embedded 1 chunks
2026-02-18 14:46:06 &#x5B;8526] INFO Article 55579: embedded 1 chunks
2026-02-18 14:46:06 &#x5B;8526] INFO Article 92697: embedded 1 chunks
2026-02-18 14:46:06 &#x5B;8526] INFO Article 49417: embedded 3 chunks
2026-02-18 14:46:06 &#x5B;8526] INFO Article 70595: embedded 1 chunks
Processed 10 items

</pre></div>


<p><strong>Reading the output:</strong></p>



<ul class="wp-block-list">
<li><code>claimed 10 items</code> — the worker took 10 items from the queue using SKIP LOCKED. If another worker ran simultaneously, it would get different items.</li>



<li><code>Article 6607: embedded 1 chunks</code> — this article&#8217;s content fit within a single 2000-character chunk. One API call, one embedding vector stored.</li>



<li><code>Article 75802: embedded 2 chunks</code> — &#8220;Brandenburg Gate&#8221; was longer and required two chunks. Two API calls, two embedding vectors, both linked to the same article with <code>chunk_index</code> 0 and 1.</li>



<li><code>Article 49417: embedded 3 chunks</code> — &#8220;Dean Martin&#8221; was the longest article in this batch, requiring three chunks.</li>
</ul>



<p>Let&#8217;s verify the data in PostgreSQL:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# SELECT count(DISTINCT article_id) AS articles, count(*) AS chunks
  FROM article_embeddings_versioned WHERE is_current = true;
 articles | chunks
----------+--------
       10 |     13

</pre></div>


<p>10 articles, 13 chunks. The numbers match the worker output.</p>



<p>Total time: ~8 seconds for 10 articles. <strong>The bottleneck is the OpenAI API call</strong> (~300-600ms per embedding request), not PostgreSQL. In this lab, the trigger overhead, queue operations, and embedding writes were all negligible compared to API latency. If you need faster throughput, the answer is more workers (see Step 8) or a local embedding model — not database optimization.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-6-the-real-demo-skip-vs-embed">Step 6: The Real Demo — SKIP vs EMBED</h2>



<p>Now we have a baseline: 10 articles with embeddings and known <code>source_hash</code> values. This is the step where the change detector can finally do its job properly.</p>



<h3 class="wp-block-heading" id="h-what-targeted-mutations-py-does">What <code>targeted_mutations.py</code> does</h3>



<p>This script applies <strong>specific, known mutation types</strong> to the 10 articles we just embedded. Unlike <code>simulate_document_changes.py</code> (which picks random articles and random mutations), this script is deterministic — we control exactly what changes happen so we can verify the detector&#8217;s decisions:</p>



<ul class="wp-block-list">
<li><strong>5 articles</strong>: append a single period character (<code>.</code>) to the content — the smallest possible content change. This is a typo-level edit that should not change the semantic meaning at all.</li>



<li><strong>3 articles</strong>: append a substantial paragraph (~100 words of new information) — this adds genuine semantic content that should shift the embedding.</li>



<li><strong>2 articles</strong>: rewrite the second half of the content — a major structural change that dramatically alters the meaning.</li>
</ul>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python examples/targeted_mutations.py

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Embedded articles: &#x5B;5150, 6607, 7947, 19078, 36870, 49417, 55579, 70595, 75802, 92697]
  Article 5150: appended period (typo fix)
  Article 6607: appended period (typo fix)
  Article 7947: appended period (typo fix)
  Article 19078: appended period (typo fix)
  Article 36870: appended period (typo fix)
  Article 49417: appended major paragraph
  Article 55579: appended major paragraph
  Article 70595: appended major paragraph
  Article 75802: rewrote second half
  Article 92697: rewrote second half
Done - 10 targeted mutations applied

</pre></div>


<p>Each of these UPDATEs fires the trigger, which creates a new queue entry. But now — unlike Step 4 — we have <strong>existing embeddings</strong> to compare against.</p>



<p>Now run the change detector again:</p>



<h3 class="wp-block-heading" id="h-what-happens-inside-the-detector-this-time">What happens inside the detector this time</h3>



<p>For each of the 10 mutated articles, the detector:</p>



<ol class="wp-block-list">
<li>Takes the article&#8217;s current (modified) content</li>



<li>Generates a new embedding via OpenAI</li>



<li>Retrieves the existing embedding from <code>article_embeddings_versioned</code></li>



<li>Computes cosine similarity between old and new</li>



<li>Applies the 0.95 threshold</li>
</ol>



<p>For the 34 other pending items (from Step 3, still without baseline embeddings), it still returns similarity=0.0.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python change_detector.py --analyze-queue

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
...34 articles without baseline still show EMBED (similarity=0.0000)...

2026-02-18 14:54:03 &#x5B;DETECTOR] INFO Article 5150: SKIP (similarity=0.9981)
2026-02-18 14:54:03 &#x5B;DETECTOR] INFO Article 6607: SKIP (similarity=0.9994)
2026-02-18 14:54:03 &#x5B;DETECTOR] INFO Article 7947: SKIP (similarity=0.9993)
2026-02-18 14:54:03 &#x5B;DETECTOR] INFO Article 19078: SKIP (similarity=0.9993)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 36870: SKIP (similarity=0.9997)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 49417: EMBED (similarity=0.9263)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 55579: EMBED (similarity=0.9255)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 70595: EMBED (similarity=0.9369)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 75802: EMBED (similarity=0.6256)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Article 92697: EMBED (similarity=0.5090)
2026-02-18 14:54:04 &#x5B;DETECTOR] INFO Results: 39 EMBED, 5 SKIP

</pre></div>


<h3 class="wp-block-heading" id="h-reading-the-results">Reading the results</h3>



<p><strong>What the similarity numbers mean:</strong></p>



<ul class="wp-block-list">
<li><strong>0.998–0.999 (typo fixes)</strong>: The old and new embeddings are nearly identical. Adding a period barely shifts the vector in 1536-dimensional space. The detector correctly says: &#8220;this content hasn&#8217;t meaningfully changed — skip the re-embed.&#8221; That avoids 5 unnecessary write operations, index churn, and version flips.</li>



<li><strong>0.925–0.937 (paragraph additions)</strong>: Adding 100 words of new information shifts the embedding enough to drop below 0.95. The detector correctly says: &#8220;the semantic content changed — re-embed.&#8221; The new paragraph about Dean Martin&#8217;s film career or Brandenburg Gate&#8217;s Cold War history needs to be reflected in the vector.</li>



<li><strong>0.509–0.626 (section rewrites)</strong>: Rewriting half the article dramatically changes the meaning. These similarities are far below the threshold — clearly needing re-embedding.</li>



<li><strong>0.0 (no baseline)</strong>: The 34 articles from Step 3 that still have no embeddings. Can&#8217;t compare what doesn&#8217;t exist yet.</li>
</ul>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>Cost honesty note</strong>: The detector uses embedding similarity, which means it calls OpenAI once per article to generate the comparison vector — even for articles it ultimately SKIPs. So SKIP doesn&#8217;t eliminate API spend; it eliminates <strong>unnecessary writes, index churn, and version flips</strong>. For single-chunk articles (the majority in this lab), the detection call is the same cost as the embedding call itself. The real savings show up with multi-chunk articles: the detector spends 1 API call to decide, versus N calls to re-embed all chunks. In production, you&#8217;d add <strong>cheaper pre-filters first</strong>, <code>content_hash</code> comparison (free, catches identical content), text diff ratio (cheap, catches typos),  and reserve embedding-similarity checks for borderline cases where the content changed but the semantic impact is unclear. That&#8217;s the graduation path Part 1 describes.</p>
</blockquote>



<p><strong>The key insight</strong>: there&#8217;s a <strong>clean gap</strong> between the typo group (lowest: 0.9981) and the paragraph group (highest: 0.9369). That gap from 0.937 to 0.998 is where our 0.95 threshold sits. It doesn&#8217;t fall in ambiguous territory. The change types cluster naturally, which is what makes threshold-based detection practical in the real world.</p>



<p>The queue now reflects the decisions:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# SELECT status, count(*) FROM embedding_queue GROUP BY status;
  status   | count
-----------+-------
 skipped   |     5
 completed |    10
 pending   |    39

</pre></div>


<ul class="wp-block-list">
<li><strong>5 skipped</strong>: the typo-level changes — unnecessary writes avoided, no quality loss</li>



<li><strong>10 completed</strong>: the baseline embeddings from Step 5</li>



<li><strong>39 pending</strong>: 34 no-baseline articles + 5 newly-detected EMBED items, waiting for the worker</li>
</ul>



<p>The <code>skipped</code> status is an audit trail — you can always go back and see what was skipped, when, and at what similarity score (recorded in <code>embedding_change_log</code>).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-7-freshness-monitoring-report">Step 7: Freshness Monitoring Report</h2>



<p>In production, you need a dashboard — not individual log lines. The <code>freshness_monitor.py</code> script consolidates all the monitoring queries from Part 1 into a single diagnostic report.</p>



<h3 class="wp-block-heading" id="h-what-freshness-monitor-py-does">What <code>freshness_monitor.py</code> does</h3>



<p>The script runs five monitoring queries against the database and formats them into a human-readable report:</p>



<ol class="wp-block-list">
<li><strong>Freshness summary</strong>: How many articles have embeddings? How many are stale (content changed since last embedding)?</li>



<li><strong>Stale articles detail</strong>: Which specific articles have drifted — showing both the current content hash and the embedding&#8217;s source hash so you can see the mismatch</li>



<li><strong>Queue health</strong>: Breakdown by status with timestamps — tells you if items are stuck or if the queue is draining properly</li>



<li><strong>Version coverage</strong>: Which embedding models are in use and how many articles/chunks each covers</li>



<li><strong>Change detection decisions</strong>: Aggregated SKIP/EMBED statistics with average similarity scores</li>
</ol>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python freshness_monitor.py --report

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Embedding Freshness Report — 2026-02-18 14:57:53

============================================================
  Freshness Summary
============================================================
  Total articles:        25000
  With embeddings:       10  (0.0%)
  Without embeddings:    24990
  Stale embeddings:      10  (100.0%)

</pre></div>


<p><strong>Reading this</strong>: Only 10 of 25,000 articles have versioned embeddings (from Step 5). All 10 are &#8220;stale&#8221; because we just mutated all of them in Step 6. In a real deployment, you&#8217;d see something like &#8220;23,450 with embeddings (93.8%), 312 stale (1.3%)&#8221; — and you&#8217;d alert if stale exceeded, say, 5%.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Stale Articles (content changed since embedding)
============================================================
  ID    | Title                                  | Current Hash     | Embed Hash       | ...
  ------+----------------------------------------+------------------+------------------+----
  5150  | 1787                                   | 5b14bc4a2d...    | 11e81bc4de...    | ...
  6607  | Needle                                 | 3ebb3c3cbb...    | 5c5290b5a7...    | ...
  49417 | Dean Martin                            | 7061f1803f...    | f7fd9f30e6...    | ...
  75802 | Brandenburg Gate                       | 7da53df7a0...    | 5a2dcc01f9...    | ...
  ...6 more...

</pre></div>


<p>The <code>Current Hash</code> and <code>Embed Hash</code> columns are the two MD5 fingerprints. When they don&#8217;t match, it means the article&#8217;s content has changed since we last generated its embedding. Article 5150 (&#8220;1787&#8221;) shows different hashes even though we only appended a period — the MD5 captures <em>any</em> content change, even trivial ones. The <strong>change detector</strong> is what decides whether the difference matters semantically (and it said SKIP for this one).</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Queue Health
============================================================
  Status    | Count | Oldest                 | Newest
  ----------+-------+------------------------+------------------------
  pending   | 39    | 2026-02-18 14:39:59    | 2026-02-18 14:53:55
  completed | 10    | 2026-02-18 14:39:58    | 2026-02-18 14:39:59
  skipped   | 5     | 2026-02-18 14:53:55    | 2026-02-18 14:53:55

</pre></div>


<p>The queue is healthy but has a backlog. 39 items pending, oldest from ~15 minutes ago. In production, you&#8217;d watch the gap between &#8220;Oldest&#8221; and &#8220;Newest&#8221; — if the oldest item keeps getting older while new items are added, your workers can&#8217;t keep up. That&#8217;s when you scale up workers (see Step 8) or increase batch size.</p>



<p>The 10 <code>completed</code> items are from Step 5, the 5 <code>skipped</code> from Step 6&#8217;s change detector.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Embedding Version Coverage
============================================================
  Model Version          | Articles | Chunks | Current
  -----------------------+----------+--------+--------
  text-embedding-3-small | 10       | 13     | 13

</pre></div>


<p> Only one model version in use, covering 10 articles with 13 chunks, all current. During a blue-green model upgrade (Part 1&#8217;s model versioning section), you&#8217;d see two rows here — v1 and v2 — and track coverage convergence.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Change Detection Decisions
============================================================
  Decision | Count | Avg Similarity
  ---------+-------+---------------
  EMBED    | 83    | 0.0473
  SKIP     | 5     | 0.9992

</pre></div>


<p>The average similarity for EMBED decisions is 0.0473 because most of those 83 decisions had similarity=0.0 (no baseline). The 5 SKIPs have an average of 0.9992 — confirming these were truly trivial changes. In a mature deployment, the EMBED average similarity would be higher (0.7–0.9 range) and the SKIP/EMBED ratio would tell you how efficient your threshold is.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-8-skip-locked-multi-worker-concurrency">Step 8: SKIP LOCKED — Multi-Worker Concurrency</h2>



<p>This is the demo that proves the theory from Part 1&#8217;s deep dive on <code>SELECT FOR UPDATE SKIP LOCKED</code>.</p>



<h3 class="wp-block-heading" id="h-what-demo-skip-locked-py-does">What <code>demo_skip_locked.py</code> does</h3>



<p>The script launches multiple Python threads that each behave like independent embedding workers. Each thread:</p>



<ol class="wp-block-list">
<li>Opens its own database connection</li>



<li>Runs <code>UPDATE embedding_queue SET status='processing' WHERE queue_id IN (SELECT queue_id FROM embedding_queue WHERE status='pending' ORDER BY queued_at FOR UPDATE SKIP LOCKED LIMIT n)</code> — the exact same claim query the real worker uses (note the <code>ORDER BY queued_at</code> — without it, selection order is not deterministic and oldest-first is not guaranteed)</li>



<li>Records which <code>queue_id</code> values it got</li>



<li>Does NOT actually call OpenAI (this is a concurrency demo, not an embedding demo)</li>
</ol>



<p>After all threads finish, the script checks for <strong>overlap</strong>: did any two workers claim the same item? The answer should always be zero.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python examples/demo_skip_locked.py --workers 4 --items 39

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Demo: SKIP LOCKED Multi-Worker Concurrency
  Workers: 4  |  Target items: 39
============================================================

Launching 4 workers (each requesting up to 14 items)...

  demo-worker-0: claimed 14 items  (articles: &#x5B;96746, 37330, 67708, 32834, 46541]...)
  demo-worker-1: claimed 14 items  (articles: &#x5B;57924, 20028, 65749, 92016, 24921]...)
  demo-worker-2: claimed 11 items  (articles: &#x5B;66390, 27221, 30148, 97917, 30449]...)
  demo-worker-3: claimed 0 items   (articles: &#x5B;])

========================================
  Total items claimed:  39
  Unique articles:      39
  Elapsed time:         0.05s

  ZERO OVERLAP — SKIP LOCKED working correctly!
============================================================

</pre></div>


<h3 class="wp-block-heading" id="h-reading-the-output">Reading the output</h3>



<ul class="wp-block-list">
<li><strong>14 + 14 + 11 + 0 = 39</strong> — every pending item was claimed exactly once</li>



<li><strong>Zero overlap</strong> — no item was processed by more than one worker</li>



<li><strong>0.05 seconds</strong> — the entire distribution happened in 50 milliseconds</li>



<li><strong>Worker 3 got 0 items</strong>: This is actually the ideal outcome. The first 3 workers were fast enough to drain the queue before Worker 3&#8217;s <code>SELECT ... SKIP LOCKED</code> could find any unlocked rows. In a real deployment where each item takes 300-500ms (OpenAI API call), all 4 workers would stay busy and you&#8217;d see approximately even distribution.</li>
</ul>



<p><strong>Why <code>SKIP LOCKED</code> and not regular <code>FOR UPDATE</code>?</strong> With regular <code>FOR UPDATE</code>, Worker 1 would lock rows and Worker 2 would <strong>wait</strong> (block) until Worker 1&#8217;s transaction commits. With <code>SKIP LOCKED</code>, Worker 2 <strong>skips</strong> the locked rows and grabs the next available ones immediately. No blocking, no deadlocks, no coordination.</p>



<p>This is pure PostgreSQL. No Redis, no RabbitMQ, no SQS. One SQL query, one feature (<code>SKIP LOCKED</code>), and you have a production-grade concurrent work queue. If you need to process your embedding queue faster, just add workers — throughput scales linearly.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-9a-end-to-end-trigger-flow">Step 9a: End-to-End Trigger Flow</h2>



<p>Every previous step ran parts of the pipeline in isolation. This demo shows the <strong>complete lifecycle</strong> of a single article change — from <code>UPDATE</code> to searchable embedding.</p>



<h3 class="wp-block-heading" id="h-what-demo-trigger-flow-py-does">What <code>demo_trigger_flow.py</code> does</h3>



<p>The script picks one article and walks through the full pipeline synchronously:</p>



<ol class="wp-block-list">
<li><strong>Checks the queue</strong> for this article (should be empty)</li>



<li><strong>Updates the article&#8217;s content</strong> (appending demo text)</li>



<li><strong>Verifies the trigger fired</strong> by checking the queue again (should now have a pending entry)</li>



<li><strong>Shows the article&#8217;s metadata</strong> (new content_hash, updated_at)</li>



<li><strong>Runs the worker</strong> for exactly this one item (calls OpenAI, writes embeddings)</li>



<li><strong>Verifies the embeddings</strong> are in <code>article_embeddings_versioned</code></li>
</ol>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python examples/demo_trigger_flow.py

</pre></div>

<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
============================================================
  Demo: End-to-End Trigger Flow
  Article: &#x5B;86698] Thin film transistor liquid crystal display
============================================================

1. Queue entries (pending) for article 86698 BEFORE update: 0

</pre></div>


<p>Nothing in the queue yet — clean starting state.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
2. Updated article content (appended demo text)

</pre></div>


<p>An <code>UPDATE articles SET content = content || '...' WHERE id = 86698</code> just ran. Two triggers fired: <code>trg_content_hash</code> (BEFORE, recomputed the MD5) and <code>trg_queue_embedding</code> (AFTER, inserted a queue entry).</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
3. Trigger fired! Queue entry created:
   Queue ID:     57
   Status:       pending
   Content Hash: b5a7c0820832fd54...
   Queued At:    2026-02-18 15:05:29.303062+00:00

</pre></div>


<p>The trigger did its job. A new <code>pending</code> item is in the queue with the article&#8217;s current content hash. Note the timestamp — in this lab, the trigger overhead was negligible.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
4. Article metadata updated:
   Content Hash: b5a7c0820832fd54...
   Updated At:   2026-02-18 15:05:29.303062+00:00

</pre></div>


<p>The article&#8217;s <code>content_hash</code> matches the queue entry&#8217;s hash — they were set by the same trigger. This hash will later be stored as <code>source_hash</code> on the embedding, creating the audit chain: <em>&#8220;this embedding was generated from this exact version of the content.&#8221;</em></p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
5. Running worker for one batch...
   Article 86698: embedded 3 chunks
   Processed 1 items

</pre></div>


<p>The worker claimed this item, called OpenAI 3 times (3 chunks), and wrote the embeddings to <code>article_embeddings_versioned</code>.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
6. Embeddings for article 86698:
   Current chunks: 3
   Last created:   2026-02-18 15:05:29.388968+00:00

============================================================
  Demo complete!
============================================================

</pre></div>


<p><strong>The complete flow — from content modification to searchable embeddings — took about 1 second.</strong> The latency breakdown: ~50ms for PostgreSQL (trigger + queue + insert), ~900ms for OpenAI (3 embedding API calls). In a production system with a continuously running worker, this latency would be the norm for every content change.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-step-9b-quality-feedback-loop">Step 9b: Quality Feedback Loop</h2>



<p>The final piece, and the one that closes the architecture. Everything so far reacts to <strong>content changes</strong>. But what if the embeddings are technically &#8220;fresh&#8221; (content hasn&#8217;t changed) yet <strong>search quality is degrading</strong>? Maybe the model isn&#8217;t capturing certain topics well, or the chunking strategy doesn&#8217;t work for some article types.</p>



<h3 class="wp-block-heading" id="h-what-demo-quality-drift-py-does">What <code>demo_quality_drift.py</code> does</h3>



<p>This script simulates the quality feedback loop described in Part 1&#8217;s monitoring section. It works in four phases:</p>



<p><strong>Phase 1 — Simulate retrieval quality logs</strong>: The script generates 20 fake search queries with associated quality metrics (nDCG, precision@k, user satisfaction scores). It deliberately creates a pattern where quality metrics decline for certain topics — simulating what would happen if embeddings for some subject areas became less effective over time.</p>



<p><strong>Phase 2 — Quality analysis</strong>: The script scans <code>retrieval_quality_log</code> looking for queries with poor results: low nDCG scores (below a configurable threshold) or negative user feedback. It identifies 8 queries where quality dropped.</p>



<p><strong>Phase 3 — Article correlation</strong>: For each poor-performing query, the script finds related articles using <code>title ILIKE '%keyword%'</code> matching. This is a simplified version of what a production system would do (where you&#8217;d use the query&#8217;s actual retrieved results instead of keyword matching). It identifies 29 articles that might be causing poor search results.</p>



<p><strong>Phase 4 — Automatic re-queuing</strong>: All 29 articles are inserted into <code>embedding_queue</code> with <code>change_type = 'quality_reembed'</code> instead of the usual <code>'content_update'</code>. This distinction is critical — it means the re-embedding is happening not because the content changed, but because the quality metrics flagged a problem.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python examples/demo_quality_drift.py

</pre></div>


<p>The demo runs through all four phases and produces a final queue state:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
wikipedia=# SELECT change_type, status, count(*) 
  FROM embedding_queue GROUP BY change_type, status ORDER BY change_type, status;
  change_type    |  status   | count
-----------------+-----------+-------
 content_update  | completed |    50
 content_update  | skipped   |     5
 quality_reembed | pending   |    29

</pre></div>


<h3 class="wp-block-heading" id="h-reading-the-queue-state">Reading the queue state</h3>



<p>Three distinct categories tell the full pipeline story:</p>



<ul class="wp-block-list">
<li><strong>50 <code>content_update</code> / <code>completed</code></strong>: the normal pipeline flow — content changed, trigger fired, worker embedded. This is Layers 1 and 2 doing their job.</li>



<li><strong>5 <code>content_update</code> / <code>skipped</code></strong>: the typo-level changes from Step 6 — the change detector said &#8220;not worth re-embedding.&#8221; This is Layer 2&#8217;s cost optimization.</li>



<li><strong>29 <code>quality_reembed</code> / <code>pending</code></strong>: the feedback loop&#8217;s contribution — these articles weren&#8217;t re-queued because their content changed (it may not have). They were re-queued because <strong>search quality dropped</strong> for queries related to them.</li>
</ul>



<p><strong>Why the <code>quality_reembed</code> change type matters</strong>: When the worker processes these items, it bypasses the change significance detector. If the detector were to analyze them, it might say &#8220;similarity=0.998 → SKIP&#8221; because the content barely changed. But that&#8217;s the whole point — the content didn&#8217;t change, yet the embeddings aren&#8217;t serving search well. The quality feedback overrides the filter.</p>



<p>This is the three-layer architecture from Part 1 working in practice:</p>



<ol class="wp-block-list">
<li><strong>Triggers</strong> (Layer 1): react to content changes immediately — the broadest net</li>



<li><strong>Change significance</strong> (Layer 2): filter out trivial changes, saving API cost — the optimization layer</li>



<li><strong>Quality feedback</strong> (Layer 3): catch what the filter missed or what wasn&#8217;t about content changes at all — the safety net</li>
</ol>



<p>Each layer compensates for the blind spots of the previous one.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-key-takeaways">Key Takeaways</h2>



<p><strong>1. The trigger is smarter than you think.</strong> Using <code>UPDATE OF content</code> means metadata-only changes never touch the embedding pipeline. In our test, 12% of mutations (6 out of 50) were filtered out at the trigger level, before any Python code ran. In a real knowledge base with tag edits, status changes, and metadata updates, this fraction could be substantially higher.</p>



<p><strong>2. The change detector needs a baseline.</strong> On the first run, every article shows <code>similarity=0.0</code> because there&#8217;s nothing to compare against. This is correct behavior, but you need to plan for the initial backfill being 100% EMBED. Budget the API cost and time accordingly.</p>



<p><strong>3. The 0.95 threshold is validated.</strong> Typo-level changes (appending a period) scored 0.998+, paragraph additions scored ~0.93, and section rewrites scored 0.51–0.63. There&#8217;s a clear gap between &#8220;trivial&#8221; and &#8220;significant&#8221; that the threshold exploits. You don&#8217;t need machine learning or complex heuristics — cosine similarity with a simple threshold works.</p>



<p><strong>4. SKIP LOCKED is production-ready.</strong> 4 workers, 39 items, zero overlap, 0.05 seconds. No external dependencies, no coordination service. This is the simplest correct way to build a concurrent work queue in PostgreSQL. Need more throughput? Add workers.</p>



<p><strong>5. Quality metrics close the loop.</strong> The change significance filter reduces unnecessary writes and index churn, but it can&#8217;t know if a small change was semantically important — or if the embedding was poor to begin with. The quality feedback loop catches those cases by correlating low-quality retrievals with specific articles and forcing re-embedding. Three layers, each compensating for the blind spots of the previous one.</p>



<p><strong>6. The bottleneck is the API, not PostgreSQL.</strong> 10 articles embedded in ~8 seconds, with each OpenAI call taking 300-600ms. In this lab, PostgreSQL&#8217;s trigger + queue overhead was negligible compared to API latency. If you need faster throughput, add workers (SKIP LOCKED scales linearly) or switch to a local embedding model like <code>nomic-embed-text</code> via Ollama.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-running-it-yourself">Running It Yourself</h2>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
git clone https://github.com/boutaga/pgvector_RAG_search_lab.git
cd pgvector_RAG_search_lab

# Ensure Wikipedia database is loaded (see Lab 2 in README)
# You&#039;ll need: PostgreSQL 17+, pgvector, pgvectorscale, an OpenAI API key

# Step 1: Apply schema
psql -d wikipedia -f lab/05_embedding_versioning/schema.sql

# Step 3: Simulate changes (Step 2 is a manual SQL test)
python lab/05_embedding_versioning/examples/simulate_document_changes.py --count 50

# Step 4: Run change detector (all EMBED on first run — no baseline yet)
python lab/05_embedding_versioning/change_detector.py --analyze-queue

# Step 5: Create baseline embeddings (requires OPENAI_API_KEY env var)
python lab/05_embedding_versioning/worker.py --once --batch-size 10

# Step 6: Apply targeted mutations, then re-run detector
python lab/05_embedding_versioning/examples/targeted_mutations.py
python lab/05_embedding_versioning/change_detector.py --analyze-queue

# Step 7: Full freshness report
python lab/05_embedding_versioning/freshness_monitor.py --report

# Step 8: SKIP LOCKED concurrency demo
python lab/05_embedding_versioning/examples/demo_skip_locked.py --workers 4

# Step 9a: End-to-end trigger flow
python lab/05_embedding_versioning/examples/demo_trigger_flow.py

# Step 9b: Quality feedback loop
python lab/05_embedding_versioning/examples/demo_quality_drift.py

</pre></div>


<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-what-s-next">What&#8217;s Next</h2>



<p>In the next post, I&#8217;ll explore <strong>benchmarking pgvectorscale&#8217;s StreamingDiskANN at scale</strong> — with real numbers on query latency, recall, index build time, and memory footprint at different dataset sizes. We&#8217;ll use the same Wikipedia dataset and the versioned embedding infrastructure from this lab.</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/">RAG Series – Embedding Versioning LAB</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>RAG Series – Embedding Versioning with pgvector: Why Event-Driven Architecture Is a Precondition to AI data workflows</title>
		<link>https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/</link>
					<comments>https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 22 Feb 2026 14:23:42 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[pgvector]]></category>
		<category><![CDATA[RAG]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=43038</guid>

					<description><![CDATA[<p>Introduction &#8220;Make it simple.&#8221; This is a principle I keep repeating, and I&#8217;ll repeat it again here. Because when it comes to keeping your RAG system&#8217;s embeddings fresh, the industry has somehow made it complicated. External orchestrators, custom Python cron jobs, microservices that call microservices, Airflow DAGs with 47 tasks, all to answer a simple [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/">RAG Series – Embedding Versioning with pgvector: Why Event-Driven Architecture Is a Precondition to AI data workflows</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading" id="h-introduction">Introduction</h2>



<p>&#8220;Make it simple.&#8221; This is a principle I keep repeating, and I&#8217;ll repeat it again here. Because when it comes to keeping your RAG system&#8217;s embeddings fresh, the industry has somehow made it complicated. External orchestrators, custom Python cron jobs, microservices that call microservices, Airflow DAGs with 47 tasks,  all to answer a simple question: <strong>when my source data changes, how do I update the corresponding embeddings?</strong></p>



<p>If you&#8217;ve followed this RAG series from <a href="https://www.dbi-services.com/blog/rag-series-naive-rag/">Naive RAG</a> through <a href="https://www.dbi-services.com/blog/rag-series-hybrid-search-with-re-ranking/">Hybrid Search</a>, <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG</a>, and <a href="https://www.dbi-services.com/blog/rag-series-agentic-rag/">Agentic RAG</a>, you&#8217;ve seen how retrieval quality is the backbone of any RAG system. But here&#8217;s what I didn&#8217;t cover explicitly: <strong>what happens when your retrieval quality silently degrades because your embeddings are stale?</strong></p>



<p>This is the silent killer of RAG in production. Nobody complains about the embedding pipeline, they complain that the chatbot gives wrong answers. And by the time you trace it back to stale embeddings, the trust is already gone.</p>



<p>In this post, I want to bridge two worlds that I&#8217;ve been working in simultaneously: the <strong>CDC/event-driven pipelines</strong> I demonstrated in my <a href="https://www.dbi-services.com/blog/postgresql-cdc-to-jdbc-sink-minimal-event-driven-architecture/">PostgreSQL CDC to JDBC Sink</a> and <a href="https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/">Oracle to PostgreSQL Migration with Flink CDC</a> posts, and the <strong>RAG/pgvector</strong> world from this series.</p>



<p>The thesis is straightforward: <strong>if you&#8217;re serious about production RAG, you need event-driven embedding refresh. Batch re-embedding is technical debt waiting to happen.</strong> Event-driven architecture and data pipelines are a precondition to hosting similarity search. Organizations that are still 100% batch-processed are all migrating towards event-driven because of a probable need for live KPIs instead of daily refreshes. This is facilitated by the current maturity of the solutions that are out there. The &#8220;hidden&#8221; bonus of streaming data from your data sources to a data lake and to your data marts is that it facilitates refreshes of embeddings as well.</p>



<p>This is Part 1 covering the architecture and design patterns I feel are relevant. In <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/" target="_blank" rel="noreferrer noopener">Part 2</a>, I walk through a hands-on LAB on 25,000 Wikipedia articles with real output, actual numbers, and some of the edge cases you would encounter applying this in practice.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-problem-stale-embeddings">The Problem: Stale Embeddings</h2>



<p>Let me paint a picture that I&#8217;ve seen in real consulting engagements.</p>



<p>A company builds a RAG system for internal documentation. Knowledge base: 50,000 documents in PostgreSQL. Embeddings generated with <code>text-embedding-3-small</code>, stored in pgvector. Everything works great on day one.</p>



<p>Three months later:</p>



<ul class="wp-block-list">
<li>2,000 documents have been updated</li>



<li>500 new documents have been added</li>



<li>300 documents have been deprecated</li>



<li>The embedding pipeline? It ran once during initial setup. Maybe someone re-ran it manually last month. Maybe not.</li>
</ul>



<p>The result: <strong>your vector index is lying to you</strong>. Similarity search returns chunks from outdated documents. The LLM generates answers based on stale context. Users lose trust.</p>



<p>This is not a hypothetical. This is the reality of most RAG deployments I&#8217;ve encountered.</p>



<h3 class="wp-block-heading" id="h-why-batch-re-embedding-doesn-t-scale">Why batch re-embedding doesn&#8217;t scale</h3>



<p>The naive approach is: &#8220;just re-embed everything periodically.&#8221; Let&#8217;s do the math.</p>



<p>For 50,000 documents, assuming an average of 10 chunks per document:</p>



<ul class="wp-block-list">
<li><strong>500,000 chunks</strong> to embed, ~500 tokens each — that&#8217;s 250 million tokens</li>



<li>At ~$0.02 per 1M tokens with <code>text-embedding-3-small</code>: <strong>~$5 per full re-embed</strong> (not terrible)</li>



<li>The OpenAI embeddings endpoint accepts <strong>arrays of inputs</strong>, so you can batch ~100 chunks per request. That&#8217;s ~5,000 requests. At Tier 1&#8217;s 3,000 RPM, RPM isn&#8217;t the bottleneck — <strong>TPM is</strong>. Depending on your tier&#8217;s token-per-minute limit (check your <a href="https://platform.openai.com/settings/organization/limits">project limits</a>), the real constraint is how fast the API will accept 250M tokens. Depending on your usage tier, this could take <strong>anywhere from under an hour to several hours</strong> of wall-clock time.</li>



<li>During which, if you&#8217;re replacing embeddings in-place (the typical batch approach), your index is in a <strong>partially-stale state</strong> — some embeddings are new, some are old. The versioned schema I&#8217;ll show below avoids this, but most batch implementations don&#8217;t bother with versioning.</li>



<li>In our lab experience, heavy churn from bulk re-inserts can degrade <strong>StreamingDiskANN recall</strong> (pgvectorscale). The index handles incremental updates well, but re-embedding 500K rows at once is not &#8220;incremental&#8221; — validate this on your own workload and treat large backfills as an operational event.</li>
</ul>



<p>Now multiply this by:</p>



<ul class="wp-block-list">
<li>Multiple embedding models you might want to test </li>



<li>Multiple environments (dev, staging, production)</li>



<li>Frequency: weekly? daily? hourly?</li>
</ul>



<p>The cost isn&#8217;t the API calls. The cost is the <strong>operational complexity</strong>: coordinating the backfill, monitoring progress, handling rate limit errors, and — critically — the lack of <strong>observability</strong> into which documents actually changed. Batch treats every document the same, whether it was modified yesterday or hasn&#8217;t been touched in six months.</p>



<h3 class="wp-block-heading" id="h-the-deeper-problem-you-can-t-fix-what-you-don-t-measure">The deeper problem: you can&#8217;t fix what you don&#8217;t measure</h3>



<p>But there&#8217;s a problem that comes before stale embeddings, and in my consulting experience, it&#8217;s far more common: <strong>most organizations don&#8217;t measure retrieval quality at all.</strong> They deploy a RAG system, it works in demo, it goes to production, and then nobody instruments it. There is no precision@k, no nDCG, no confidence scoring. The embedding pipeline might be stale, or it might be fine — they literally cannot tell.</p>



<p>In the <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG</a> post, I introduced the metrics framework that makes retrieval quality measurable: <strong>precision@k</strong> (are the retrieved documents relevant?), <strong>recall@k</strong> (are we finding all the relevant documents?), <strong>nDCG@k</strong> (are the best results ranked first?), and <strong>confidence scores</strong> (how certain is the system about its top result?). In the <a href="https://www.dbi-services.com/blog/rag-series-agentic-rag/">Agentic RAG</a> post, I added decision metrics on top of that — tracking whether the agent made the right call about when to retrieve. The evaluation framework in the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab repository</a> (<code>lab/evaluation/metrics.py</code>, <code>compare_search_configs.py</code>, <code>k_balance_experiment.py</code>) implements all of this concretely.</p>



<p>These metrics were originally designed to compare search strategies and tune parameters. But here&#8217;s the connection to embedding freshness that I want to make explicit: <strong>the same metrics that tell you whether your search is working also tell you whether your embeddings are drifting.</strong> If your weekly nDCG is declining, if your confidence distribution is shifting toward lower values, if precision@10 is dropping for a subset of queries — those are the leading indicators that your embeddings are falling behind your content. Not the queue depth, not the pipeline latency. The quality metrics.</p>



<p>I have seen architectures where teams built elaborate embedding pipelines — cron jobs, Airflow DAGs, custom orchestrators — but never implemented the measurement layer. The pipeline runs on schedule, embeddings get refreshed, and everyone assumes it&#8217;s working. But without retrieval quality metrics, you have no way to know if you are going in the right direction. You might be re-embedding documents that don&#8217;t need it (wasting API spend) and missing documents that do (degrading search quality). Worse, I have seen setups where the metrics exist but are so poorly instrumented — wrong ground truth sets, no temporal dimension, no per-topic breakdown — that the numbers are misleading. An aggregate nDCG of 0.82 can hide the fact that an entire topic cluster has dropped to 0.45.</p>



<p>Building the pipeline is one thing. Proving you&#8217;re going in the right direction is everything.</p>



<p>This is why this post covers both. The first two-thirds address the pipeline: how to detect changes, how to queue and process them, how to decide what&#8217;s worth re-embedding. But the final section — <a href="https://claude.ai/chat/30d5dae6-fac6-4da7-b1b6-336dfd902a16#monitoring-embedding-freshness">Monitoring Embedding Freshness</a> — is where it all comes together. That&#8217;s where the retrieval quality metrics from the Adaptive RAG post become <strong>operational canaries</strong> for embedding drift. The pipeline reacts to content changes; the monitoring layer tells you whether the pipeline is actually keeping your RAG system healthy. You need both.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-solution-event-driven-embedding-refresh">The Solution: Event-Driven Embedding Refresh</h2>



<p>The answer is the same pattern I demonstrated in the CDC posts: <strong>react to changes as they happen.</strong></p>



<p>Instead of asking &#8220;when should I re-embed?&#8221;, the question becomes: <strong>&#8220;a row changed — which embeddings need updating?&#8221;</strong></p>



<p>Here&#8217;s the architecture I&#8217;m proposing:</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="680" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9-1024x680.png" alt="" class="wp-image-43062" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9-1024x680.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9-300x199.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9-768x510.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9-1536x1019.png 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-9.png 1588w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>There are three levels of sophistication here, and I want to walk through each one because <strong>not every project needs the most complex solution</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-level-1-postgresql-triggers-the-simplest-path">Level 1: PostgreSQL Triggers — The Simplest Path</h2>



<p>If your source data and embeddings live in the same PostgreSQL instance (which they probably do if you&#8217;ve been following this series), you don&#8217;t need Flink. You don&#8217;t need Kafka. You need a trigger.</p>



<h3 class="wp-block-heading" id="h-schema-design-with-versioning">Schema design with versioning</h3>



<p>First, let&#8217;s design a proper embedding table that supports versioning. This is the piece most tutorials skip:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Source table (your knowledge base)
CREATE TABLE documents (
    doc_id          BIGSERIAL PRIMARY KEY,
    title           TEXT NOT NULL,
    content         TEXT NOT NULL,
    category        TEXT,
    content_hash    TEXT GENERATED ALWAYS AS (md5(content)) STORED,
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    is_active       BOOLEAN NOT NULL DEFAULT true
);

-- Embedding table with versioning support
CREATE TABLE document_embeddings (
    embedding_id    BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
    chunk_index     INT NOT NULL,
    chunk_text      TEXT NOT NULL,
    embedding       vector(1536),       -- text-embedding-3-small
    model_name      TEXT NOT NULL DEFAULT &#039;text-embedding-3-small&#039;,
    model_version   TEXT NOT NULL DEFAULT &#039;v1&#039;,
    source_hash     TEXT NOT NULL,       -- md5 of the source content at embed time
    embedded_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    is_current      BOOLEAN NOT NULL DEFAULT true,
    
    UNIQUE(doc_id, chunk_index, model_name, model_version)
);

-- Index for similarity search (only current embeddings)
-- Using pgvectorscale&#039;s StreamingDiskANN for better performance at scale
CREATE INDEX idx_embeddings_diskann ON document_embeddings 
    USING diskann (embedding vector_cosine_ops)
    WHERE is_current = true;

-- Index for version lookups
CREATE INDEX idx_embeddings_version ON document_embeddings (doc_id, model_version, is_current);

-- Index for staleness detection
CREATE INDEX idx_embeddings_staleness ON document_embeddings (source_hash, is_current)
    WHERE is_current = true;

-- Safety: prevent two &quot;current&quot; chunk sets for the same doc + model space
CREATE UNIQUE INDEX uq_doc_current_per_space
    ON document_embeddings (doc_id, model_name, model_version, chunk_index)
    WHERE is_current;

</pre></div>


<p>A few things to notice here:</p>



<ul class="wp-block-list">
<li><strong><code>content_hash</code></strong>: a generated column that gives us a fast way to detect if content actually changed (not just <code>updated_at</code>). If you&#8217;re adding this to an existing table with data, note that <code>ALTER TABLE ... ADD COLUMN ... GENERATED ALWAYS AS ... STORED</code> requires touching/recomputing all rows — plan a maintenance window, or use a <code>BEFORE UPDATE</code> trigger with <code>NEW.content_hash := md5(NEW.content)</code> instead. Both approaches are functionally equivalent.</li>



<li><strong><code>source_hash</code></strong> on the embedding: captures what the source content looked like when the embedding was generated</li>



<li><strong><code>is_current</code></strong>: soft versioning — old embeddings are kept for rollback. The partial unique index <code>uq_doc_current_per_space</code> guarantees at the database level that you can never have two &#8220;current&#8221; chunk sets for the same document within the same model space — even if your application has a bug.</li>



<li><strong>Partial DiskANN index</strong>: only indexes current embeddings, so similarity search is clean and performant at scale. Partial indexes (<code>CREATE INDEX ... WHERE ...</code>) are standard PostgreSQL — validated in our lab with pgvectorscale&#8217;s StreamingDiskANN (see <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/" target="_blank" rel="noreferrer noopener">Part 2 — Lab Walkthrough</a>). If your pgvectorscale version doesn&#8217;t support partial predicates, pgvector&#8217;s HNSW partial index is an equivalent fallback.</li>



<li><strong><code>model_version</code></strong>: critical for model upgrades (more on this later)</li>
</ul>



<h3 class="wp-block-heading" id="h-the-embedding-queue-pattern">The embedding queue pattern</h3>



<p>Rather than embedding synchronously in a trigger (which would block the transaction and hit external APIs), we use a <strong>queue pattern</strong>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Queue table for pending embedding work
CREATE TABLE embedding_queue (
    queue_id        BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL REFERENCES documents(doc_id),
    change_type     TEXT NOT NULL DEFAULT &#039;content_update&#039;,
    content_hash    TEXT,
    queued_at       TIMESTAMPTZ NOT NULL DEFAULT now(),
    claimed_at      TIMESTAMPTZ,            -- set when a worker claims the item
    processed_at    TIMESTAMPTZ,
    status          TEXT NOT NULL DEFAULT &#039;pending&#039; 
                    CHECK (status IN (&#039;pending&#039;, &#039;processing&#039;, &#039;completed&#039;, &#039;failed&#039;, &#039;skipped&#039;)),
    error_message   TEXT,
    retry_count     INT NOT NULL DEFAULT 0
);

CREATE INDEX idx_queue_pending ON embedding_queue (status, queued_at) 
    WHERE status = &#039;pending&#039;;

</pre></div>


<p>Note the <strong><code>skipped</code></strong> status — this is used by the change significance detector (covered later) when it determines that a content change is too minor to warrant re-embedding. The item stays in the queue for audit purposes, but no embedding API call is made.</p>



<h3 class="wp-block-heading" id="h-the-trigger">The trigger</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE OR REPLACE FUNCTION fn_queue_embedding_update()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = &#039;INSERT&#039; THEN
        INSERT INTO embedding_queue (doc_id, change_type, content_hash)
        VALUES (NEW.doc_id, &#039;content_update&#039;, NEW.content_hash);
        RETURN NEW;
        
    ELSIF TG_OP = &#039;UPDATE&#039; THEN
        -- Only queue if content actually changed (not just metadata)
        IF OLD.content_hash IS DISTINCT FROM NEW.content_hash THEN
            INSERT INTO embedding_queue (doc_id, change_type, content_hash)
            VALUES (NEW.doc_id, &#039;content_update&#039;, NEW.content_hash);
        END IF;
        RETURN NEW;
        
    ELSIF TG_OP = &#039;DELETE&#039; THEN
        INSERT INTO embedding_queue (doc_id, change_type, content_hash)
        VALUES (OLD.doc_id, &#039;delete&#039;, OLD.content_hash);
        RETURN OLD;
    END IF;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_embedding_queue
    AFTER INSERT OR UPDATE OR DELETE ON documents
    FOR EACH ROW
    EXECUTE FUNCTION fn_queue_embedding_update();

</pre></div>


<p><strong>The key insight here</strong> is the <code>content_hash</code> comparison on UPDATE. If someone updates the <code>category</code> or <code>title</code> but the actual content hasn&#8217;t changed, we don&#8217;t waste an API call re-embedding identical text. This is a simple optimization but it saves real money at scale. In my lab tests on 25K Wikipedia articles, 12% of simulated mutations were metadata-only — the trigger correctly skipped all of them.</p>



<p>An alternative approach that&#8217;s even more targeted: use <code>AFTER INSERT OR UPDATE OF content</code> to only fire the trigger when the content column is modified. This is what I did in the LAB (see <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/" target="_blank" rel="noreferrer noopener">Part 2</a>) because the <code>articles</code> table didn&#8217;t have a <code>content_hash</code> column originally. Both approaches achieve the same goal.</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p><strong>DBA note on <code>UPDATE OF</code></strong>: PostgreSQL&#8217;s column-specific trigger fires based on the <code>SET</code> list of the <code>UPDATE</code> command, not the actual row diff. If a <code>BEFORE UPDATE</code> trigger on another function silently modifies <code>NEW.content</code> without <code>content</code> appearing in the original <code>SET</code> clause, an <code>AFTER UPDATE OF content</code> trigger won&#8217;t fire — the content changed, but PostgreSQL doesn&#8217;t know. This is <a href="https://www.postgresql.org/docs/current/sql-createtrigger.html">documented behavior</a>. The <code>content_hash</code> comparison approach above doesn&#8217;t have this blind spot, because it compares actual values regardless of which columns were in the <code>SET</code> list.</p>
</blockquote>



<h3 class="wp-block-heading" id="h-the-worker-python">The worker (Python)</h3>



<p>The worker process polls the queue and generates embeddings. This is intentionally simple — no frameworks, no dependencies beyond <code>psycopg</code> and <code>openai</code>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
#!/usr/bin/env python3
&quot;&quot;&quot;
embedding_worker.py — polls the embedding_queue and processes pending items.
Run as: python3 embedding_worker.py
        python3 embedding_worker.py --once --batch-size 10  (single batch, for testing)
        python3 embedding_worker.py --workers 4             (multi-process)
&quot;&quot;&quot;

import os, time, hashlib, json
import psycopg
from openai import OpenAI

DB_URL = os.environ&#x5B;&quot;DATABASE_URL&quot;]
client = OpenAI()

MODEL_NAME    = &quot;text-embedding-3-small&quot;
MODEL_VERSION = &quot;v1&quot;
CHUNK_SIZE    = 500   # tokens (approximate via chars / 4)
CHUNK_OVERLAP = 50
BATCH_SIZE    = 10    # queue items per cycle
POLL_INTERVAL = 5     # seconds


def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -&gt; list&#x5B;str]:
    &quot;&quot;&quot;Simple character-based chunking. Replace with your preferred strategy.&quot;&quot;&quot;
    char_size = size * 4  # rough token-to-char ratio
    char_overlap = overlap * 4
    chunks = &#x5B;]
    start = 0
    while start &lt; len(text):
        end = start + char_size
        chunks.append(text&#x5B;start:end])
        start = end - char_overlap
    return chunks


def generate_embeddings(texts: list&#x5B;str]) -&gt; list&#x5B;list&#x5B;float]]:
    &quot;&quot;&quot;Batch embedding call to OpenAI.&quot;&quot;&quot;
    response = client.embeddings.create(
        input=texts,
        model=MODEL_NAME
    )
    return &#x5B;item.embedding for item in response.data]


def process_insert_or_update(conn, doc_id: str, content_hash: str):
    &quot;&quot;&quot;Generate fresh embeddings for a document.&quot;&quot;&quot;
    with conn.cursor() as cur:
        # Fetch current document content
        cur.execute(
            &quot;SELECT content FROM documents WHERE doc_id = %s AND is_active = true&quot;,
            (doc_id,)
        )
        row = cur.fetchone()
        if not row:
            return  # document was deleted or deactivated since queuing
        
        content = row&#x5B;0]
        
        # Verify content hasn&#039;t changed again since queuing
        current_hash = hashlib.md5(content.encode()).hexdigest()
        if current_hash != content_hash:
            return  # content changed again, a newer queue entry will handle it
        
        # Check if embeddings already exist for this hash (idempotency)
        # Scoped to model_name + model_version so parallel shadow-mode workers
        # don&#039;t falsely consider each other&#039;s embeddings as &quot;already done&quot;
        cur.execute(
            &quot;&quot;&quot;SELECT 1 FROM document_embeddings 
               WHERE doc_id = %s AND source_hash = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true
               LIMIT 1&quot;&quot;&quot;,
            (doc_id, content_hash, MODEL_NAME, MODEL_VERSION)
        )
        if cur.fetchone():
            return  # already embedded with this content
        
        # Chunk and embed
        chunks = chunk_text(content)
        embeddings = generate_embeddings(chunks)
        
        # Mark old embeddings as not current — scoped to this model space only
        # so shadow-mode v2 embeddings aren&#039;t flipped by v1 workers (or vice versa)
        cur.execute(
            &quot;&quot;&quot;UPDATE document_embeddings 
               SET is_current = false 
               WHERE doc_id = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true&quot;&quot;&quot;,
            (doc_id, MODEL_NAME, MODEL_VERSION)
        )
        
        # Insert new embeddings
        for idx, (chunk, emb) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                &quot;&quot;&quot;INSERT INTO document_embeddings 
                   (doc_id, chunk_index, chunk_text, embedding, 
                    model_name, model_version, source_hash)
                   VALUES (%s, %s, %s, %s, %s, %s, %s)&quot;&quot;&quot;,
                (doc_id, idx, chunk, emb, MODEL_NAME, MODEL_VERSION, content_hash)
            )
        
        conn.commit()


def process_delete(conn, doc_id: str):
    &quot;&quot;&quot;Mark embeddings as not current when source is deleted.&quot;&quot;&quot;
    with conn.cursor() as cur:
        cur.execute(
            &quot;&quot;&quot;UPDATE document_embeddings 
               SET is_current = false 
               WHERE doc_id = %s
                 AND model_name = %s AND model_version = %s
                 AND is_current = true&quot;&quot;&quot;,
            (doc_id, MODEL_NAME, MODEL_VERSION)
        )
        conn.commit()


def poll_and_process():
    &quot;&quot;&quot;Main loop: claim a batch, process, repeat.&quot;&quot;&quot;
    with psycopg.connect(DB_URL) as conn:
        while True:
            with conn.cursor() as cur:
                # Claim a batch (SELECT FOR UPDATE SKIP LOCKED)
                cur.execute(&quot;&quot;&quot;
                    UPDATE embedding_queue 
                    SET status = &#039;processing&#039;, claimed_at = now()
                    WHERE queue_id IN (
                        SELECT queue_id FROM embedding_queue
                        WHERE status = &#039;pending&#039;
                        ORDER BY queued_at
                        LIMIT %s
                        FOR UPDATE SKIP LOCKED
                    )
                    RETURNING queue_id, doc_id, change_type, content_hash
                &quot;&quot;&quot;, (BATCH_SIZE,))
                
                batch = cur.fetchall()
                conn.commit()
            
            if not batch:
                time.sleep(POLL_INTERVAL)
                continue
            
            for queue_id, doc_id, change_type, content_hash in batch:
                try:
                    if change_type in (&#039;content_update&#039;,):
                        process_insert_or_update(conn, doc_id, content_hash)
                    elif change_type == &#039;delete&#039;:
                        process_delete(conn, doc_id)
                    
                    with conn.cursor() as cur:
                        cur.execute(
                            &quot;&quot;&quot;UPDATE embedding_queue 
                               SET status = &#039;completed&#039;, processed_at = now()
                               WHERE queue_id = %s&quot;&quot;&quot;,
                            (queue_id,)
                        )
                        conn.commit()
                        
                except Exception as e:
                    conn.rollback()
                    with conn.cursor() as cur:
                        cur.execute(
                            &quot;&quot;&quot;UPDATE embedding_queue 
                               SET status = CASE WHEN retry_count &gt;= 3 THEN &#039;failed&#039; ELSE &#039;pending&#039; END,
                                   retry_count = retry_count + 1,
                                   error_message = %s
                               WHERE queue_id = %s&quot;&quot;&quot;,
                            (str(e), queue_id)
                        )
                        conn.commit()
                    print(f&quot;Error processing queue_id={queue_id}: {e}&quot;)


if __name__ == &quot;__main__&quot;:
    print(&quot;Embedding worker started. Polling...&quot;)
    poll_and_process()

</pre></div>


<p><strong>What I like about this pattern:</strong></p>



<ul class="wp-block-list">
<li>It&#8217;s <strong>transactional</strong>: the trigger and the queue insert are in the same transaction. If the INSERT/UPDATE fails, no queue entry is created.</li>



<li>It&#8217;s <strong>idempotent</strong>: the worker checks <code>content_hash</code> before embedding, so duplicate queue entries are harmless.</li>



<li>It uses <strong><code>SELECT FOR UPDATE SKIP LOCKED</code></strong> for safe concurrency (see below).</li>



<li>It handles <strong>retries</strong> gracefully: failed items go back to pending with a counter.</li>
</ul>



<h3 class="wp-block-heading" id="h-deep-dive-select-for-update-skip-locked">Deep dive: SELECT FOR UPDATE SKIP LOCKED</h3>



<p>This is the core of why this queue pattern works, and it&#8217;s a PostgreSQL feature that most people underuse. Let me explain it properly because it&#8217;s one of those things that looks simple in the SQL but has profound implications for how you scale workers.</p>



<p>The problem: you want to run <strong>multiple embedding workers in parallel</strong> to process the queue faster. But if two workers pick the same queue item, you&#8217;ve wasted an API call (double embedding) or worse, you get race conditions on the <code>document_embeddings</code> table.</p>



<p>The classic solutions are:</p>



<ul class="wp-block-list">
<li><strong>External locking</strong> (Redis, ZooKeeper): adds infrastructure, adds failure modes</li>



<li><strong>Application-level partitioning</strong> (worker 1 handles doc_id % 3 = 0, worker 2 handles doc_id % 3 = 1&#8230;): rigid, doesn&#8217;t adapt to load</li>



<li><strong>SELECT FOR UPDATE</strong>: locks the rows, but the second worker <strong>blocks and waits</strong> until the first one commits. This serializes your workers — you&#8217;re back to single-threaded throughput.</li>
</ul>



<p><code>SKIP LOCKED</code> changes everything. Here&#8217;s what happens step by step:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Timeline:
─────────────────────────────────────────────────────────────────

Worker A (t=0):
    BEGIN;
    UPDATE embedding_queue SET status = &#039;processing&#039;, claimed_at = now()
    WHERE queue_id IN (
        SELECT queue_id FROM embedding_queue
        WHERE status = &#039;pending&#039;
        ORDER BY queued_at
        LIMIT 5
        FOR UPDATE SKIP LOCKED    -- ← locks rows 1,2,3,4,5
    )
    RETURNING queue_id, doc_id, change_type, content_hash;
    
    → Returns: queue_id 1, 2, 3, 4, 5
    → These 5 rows are now locked by Worker A&#039;s transaction

Worker B (t=1, while Worker A is still processing):
    BEGIN;
    UPDATE embedding_queue SET status = &#039;processing&#039;, claimed_at = now()
    WHERE queue_id IN (
        SELECT queue_id FROM embedding_queue
        WHERE status = &#039;pending&#039;
        ORDER BY queued_at
        LIMIT 5
        FOR UPDATE SKIP LOCKED    -- ← sees rows 1-5 are locked, SKIPS them
    )
    RETURNING queue_id, doc_id, change_type, content_hash;
    
    → Returns: queue_id 6, 7, 8, 9, 10
    → No blocking, no waiting, no conflict

Worker C (t=2):
    → Gets queue_id 11, 12, 13, 14, 15
    → Same story: zero contention

</pre></div>


<p>The key behaviors:</p>



<ul class="wp-block-list">
<li><strong><code>FOR UPDATE</code></strong>: tells PostgreSQL &#8220;I intend to modify these rows, lock them for me&#8221;</li>



<li><strong><code>SKIP LOCKED</code></strong>: tells PostgreSQL &#8220;if a row is already locked by someone else, <strong>don&#8217;t wait</strong> — just pretend it doesn&#8217;t exist and move to the next one&#8221;</li>
</ul>



<p>This means:</p>



<ul class="wp-block-list">
<li>Workers <strong>never block each other</strong> — no waiting, no deadlocks</li>



<li>Workers <strong>never process the same item</strong> — each item is claimed by exactly one worker</li>



<li>You can <strong>scale horizontally</strong> just by starting more worker processes</li>



<li>If a worker crashes mid-processing, its transaction is rolled back, the locks are released, and the rows become visible to other workers again (the <code>status</code> was already set to <code>'processing'</code> via the UPDATE, so you&#8217;d need a cleanup mechanism for crashed workers — more on that below)</li>
</ul>



<h4 class="wp-block-heading" id="h-what-happens-without-skip-locked">What happens without SKIP LOCKED?</h4>



<p>Let&#8217;s compare. With plain <code>FOR UPDATE</code> (no SKIP LOCKED):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Worker A (t=0):  SELECT ... FOR UPDATE LIMIT 5;  → gets rows 1-5, locks them
Worker B (t=1):  SELECT ... FOR UPDATE LIMIT 5;  → tries row 1... BLOCKED ⏳
                                                    waits for Worker A to COMMIT
Worker A (t=10): COMMIT;                          → releases locks
Worker B (t=10): → finally gets rows 1-5 (but they&#039;re already processed!)
                 → returns empty set because status is no longer &#039;pending&#039;
                 → wasted 10 seconds waiting for nothing

</pre></div>


<p>With <code>SKIP LOCKED</code>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
Worker A (t=0):  SELECT ... FOR UPDATE SKIP LOCKED LIMIT 5;  → gets rows 1-5
Worker B (t=1):  SELECT ... FOR UPDATE SKIP LOCKED LIMIT 5;  → gets rows 6-10 instantly
                 → zero wait time, immediate useful work

</pre></div>


<p>This is exactly the behavior you want for a work queue.</p>



<h4 class="wp-block-heading" id="h-the-crash-recovery-problem">The crash recovery problem</h4>



<p>There&#8217;s one subtlety: if Worker A claims rows 1-5, sets their <code>status = 'processing'</code>, and then crashes (process killed, OOM, network failure), those rows are stuck in <code>'processing'</code> forever. The PostgreSQL locks are released (transaction was rolled back), but the status column still says <code>'processing'</code>.</p>



<p>You need a reaper — a periodic cleanup that reclaims stale items:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Reclaim items stuck in &#039;processing&#039; for more than 5 minutes
-- (embedding should never take that long)
-- Uses claimed_at, not queued_at — an item queued 30 minutes ago
-- but claimed 10 seconds ago should NOT be reclaimed
UPDATE embedding_queue 
SET status = &#039;pending&#039;,
    retry_count = retry_count + 1,
    error_message = &#039;reclaimed: worker timeout after 5 minutes&#039;
WHERE status = &#039;processing&#039; 
AND claimed_at &lt; now() - INTERVAL &#039;5 minutes&#039;;

</pre></div>


<p>Run this every minute via <code>pg_cron</code> or a simple cron job. It&#8217;s a safety net, not the primary flow.</p>



<h4 class="wp-block-heading" id="h-why-this-is-better-than-external-queue-systems">Why this is better than external queue systems</h4>



<p>For this specific use case (embedding queue), <code>SKIP LOCKED</code> gives you an <strong>in-database work queue</strong> with:</p>



<ul class="wp-block-list">
<li><strong>ACID guarantees</strong>: the queue and the embeddings are in the same database, same transaction scope</li>



<li><strong>No external dependencies</strong>: no Redis, no RabbitMQ, no SQS</li>



<li><strong>Exactly-once semantics</strong>: combined with the <code>content_hash</code> idempotency check</li>



<li><strong>Observability</strong>: it&#8217;s just a table — <code>SELECT count(*) FROM embedding_queue WHERE status = 'pending'</code> is your queue depth, queryable from any SQL client or monitoring tool</li>
</ul>



<p>The limitation is throughput: if you&#8217;re processing millions of queue items per second, you want Kafka or SQS. For an embedding queue where each item takes 100-500ms to process (API call), PostgreSQL can easily handle thousands of items per minute. That&#8217;s more than enough for any knowledge base I&#8217;ve seen in production.</p>



<p><strong>What I don&#8217;t like:</strong></p>



<ul class="wp-block-list">
<li>It&#8217;s <strong>polling-based</strong>: the worker checks every 5 seconds. For most use cases this is fine, but if you need sub-second latency, you want <code>LISTEN/NOTIFY</code>.</li>



<li>It requires a <strong>separate process</strong> to run. In production, that means a systemd service or a Kubernetes deployment.</li>
</ul>



<h3 class="wp-block-heading" id="h-upgrading-to-listen-notify">Upgrading to LISTEN/NOTIFY</h3>



<p>If you want to eliminate polling and react instantly, PostgreSQL&#8217;s <code>LISTEN/NOTIFY</code> mechanism is your friend. Add this to the trigger:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Add to fn_queue_embedding_update(), after each INSERT INTO embedding_queue:
-- Use NEW.doc_id for INSERT/UPDATE, OLD.doc_id for DELETE
PERFORM pg_notify(&#039;embedding_work&#039;, json_build_object(
    &#039;doc_id&#039;, COALESCE(NEW.doc_id, OLD.doc_id),
    &#039;operation&#039;, TG_OP
)::text);

</pre></div>


<p>And in the worker, replace the <code>time.sleep()</code> loop with:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
import select

conn = psycopg.connect(DB_URL, autocommit=True)
conn.execute(&quot;LISTEN embedding_work&quot;)

while True:
    if select.select(&#x5B;conn], &#x5B;], &#x5B;], 5.0) == (&#x5B;], &#x5B;], &#x5B;]):
        # Timeout — check for any missed items anyway
        process_pending_batch(conn)
    else:
        conn.execute(&quot;SELECT 1&quot;)  # consume notifications
        for notify in conn.notifies():
            process_pending_batch(conn)

</pre></div>


<p>This gives you near-real-time embedding refresh with zero polling overhead.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-level-2-logical-replication-cross-database-embedding-sync">Level 2: Logical Replication — Cross-Database Embedding Sync</h2>



<p>Now let&#8217;s go a level up. What if your source data lives in a different PostgreSQL instance than your vector store? Or what if the team that manages the knowledge base doesn&#8217;t want triggers on their production tables?</p>



<p>This is where <strong>PostgreSQL logical replication</strong> becomes the CDC mechanism. It&#8217;s built into PostgreSQL, it reads the WAL, and it has near-zero impact on the source.</p>



<h3 class="wp-block-heading" id="h-the-setup">The setup</h3>



<p>On the <strong>source</strong> (knowledge base database):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Ensure WAL is configured for logical replication
ALTER SYSTEM SET wal_level = &#039;logical&#039;;
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- Restart required

-- Create a publication for the documents table
CREATE PUBLICATION pub_documents FOR TABLE documents;

</pre></div>


<p>On the <strong>target</strong> (vector database, different instance):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Create the same documents table structure (or a subset)
CREATE TABLE documents_replica (
    doc_id          BIGINT PRIMARY KEY,
    title           TEXT NOT NULL,
    content         TEXT NOT NULL,
    content_hash    TEXT,
    updated_at      TIMESTAMPTZ,
    is_active       BOOLEAN
);

-- Create the subscription
CREATE SUBSCRIPTION sub_documents
    CONNECTION &#039;host=source-db port=5432 dbname=knowledge_base user=replicator password=...&#039;
    PUBLICATION pub_documents
    WITH (copy_data = true);  -- initial snapshot

</pre></div>


<p>Now the <code>documents_replica</code> table on your vector database is automatically kept in sync via WAL streaming. Every INSERT, UPDATE, DELETE on the source is replicated in near-real-time.</p>



<p>From here, you add the same <strong>trigger + queue + worker</strong> pattern from Level 1, but on the <code>documents_replica</code> table. The source database team doesn&#8217;t need to know or care about your embedding pipeline.</p>



<h3 class="wp-block-heading" id="h-architecture">Architecture</h3>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="383" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10-1024x383.png" alt="" class="wp-image-43063" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10-1024x383.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10-300x112.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10-768x287.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10-1536x575.png 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-10.png 1596w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p><strong>Why this is powerful:</strong></p>



<ul class="wp-block-list">
<li><strong>Zero impact on source</strong>: no triggers, no extra connections, just WAL reading</li>



<li><strong>Separation of concerns</strong>: the DBA managing the knowledge base doesn&#8217;t need to understand embeddings</li>



<li><strong>Built-in catch-up</strong>: if the embedding worker goes down, logical replication buffers changes in the WAL. When it comes back, all changes are processed in order</li>



<li><strong>No external dependencies</strong>: this is pure PostgreSQL, no Kafka, no Flink, no cloud services</li>
</ul>



<p><strong>Limitations:</strong></p>



<ul class="wp-block-list">
<li>Logical replication is <strong>PG → PG only</strong> (unlike Flink CDC which can source from Oracle, MySQL, etc.)</li>



<li><strong>DDL is not replicated</strong>: if the source adds a column, you need to handle it manually</li>



<li>The replication slot retains WAL until consumed — <strong>⚠️ Production pitfall</strong>: if the subscriber is down for too long, WAL can fill up the source disk. Set <code>max_slot_wal_keep_size</code> (PG 13+) to cap retention, and monitor <code>pg_replication_slots</code> for inactive slots. DBAs: this is the #1 risk with logical replication.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-level-3-flink-cdc-when-the-source-isn-t-postgresql-and-when-to-skip-re-embedding">Level 3: Flink CDC — When the Source Isn&#8217;t PostgreSQL (and When to Skip Re-embedding)</h2>



<p>If your knowledge base lives in Oracle, MySQL, or you need to fan out to multiple targets (pgvector + Elasticsearch + data lake), then we&#8217;re back in the territory of my CDC posts.</p>



<p>But here&#8217;s where it gets really interesting. Flink CDC gives us something that the trigger and logical replication approaches don&#8217;t: <strong>access to both the before and after images of every row change</strong>. Debezium, which Flink CDC uses under the hood, captures the full row state before and after the UPDATE. This means we can <strong>evaluate whether a change is significant enough to warrant re-embedding</strong> — directly inside the pipeline, before hitting any embedding API.</p>



<h3 class="wp-block-heading" id="h-why-this-matters">Why this matters</h3>



<p>Not every UPDATE to a document requires a new embedding. Think about it:</p>



<ul class="wp-block-list">
<li>Someone fixes a typo: &#8220;PostgreSLQ&#8221; → &#8220;PostgreSQL&#8221; — <strong>probably not worth re-embedding</strong></li>



<li>Someone updates a metadata field (status, last_reviewed_by) — <strong>definitely not worth re-embedding</strong> (metadata filtering should be done in the WHERE claude)</li>



<li>Someone rewrites two paragraphs and adds a new section — <strong>yes, re-embed</strong></li>



<li>Someone changes a single KPI number in a financial report — <strong>depends on context, but the semantic meaning shifted</strong></li>
</ul>



<p>In a busy knowledge base, most row-level changes are minor. If your pipeline blindly re-embeds on every UPDATE, you&#8217;re burning API credits, creating unnecessary load on the embedding worker, and churning your DiskANN index for no semantic gain. The question is: <strong>can we be smarter about this?</strong></p>



<h3 class="wp-block-heading" id="h-the-architecture-with-change-significance-filtering">The architecture with change significance filtering</h3>



<figure class="wp-block-image size-large"><img decoding="async" width="1024" height="934" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-11-1024x934.png" alt="" class="wp-image-43064" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-11-1024x934.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-11-300x274.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-11-768x701.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-11.png 1189w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>The key insight: <strong>separate the data replication (all changes) from the embedding trigger (only significant changes)</strong>. The data mart gets everything — it&#8217;s a faithful replica. But the embedding queue only receives changes where the content actually shifted enough to matter semantically.</p>



<h3 class="wp-block-heading" id="h-change-significance-the-approaches">Change significance: the approaches</h3>



<p>There are several ways to evaluate whether a change is &#8220;significant enough&#8221; for re-embedding. I want to walk through each one because they have very different trade-offs.</p>



<h4 class="wp-block-heading" id="h-approach-1-column-aware-filtering-simplest-start-here">Approach 1: Column-aware filtering (simplest, start here)</h4>



<p>The cheapest filter: only trigger re-embedding when specific content columns change. If someone updates <code>status</code>, <code>last_reviewed_by</code>, <code>category</code>, or any metadata field, skip the embedding entirely.</p>



<p>In Flink SQL, Debezium CDC exposes <code>op</code> (operation type) and you can access both the old and new values. Here&#8217;s how to implement it:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- CDC source table with before/after access
CREATE TABLE src_documents (
    doc_id          BIGINT,
    title           STRING,
    content         STRING,
    category        STRING,
    status          STRING,
    updated_at      TIMESTAMP(3),
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    &#039;connector&#039; = &#039;postgres-cdc&#039;,
    &#039;hostname&#039; = &#039;172.19.0.4&#039;,
    &#039;port&#039; = &#039;5432&#039;,
    &#039;username&#039; = &#039;postgres&#039;,
    &#039;password&#039; = &#039;...&#039;,
    &#039;database-name&#039; = &#039;knowledge_base&#039;,
    &#039;schema-name&#039; = &#039;public&#039;,
    &#039;table-name&#039; = &#039;documents&#039;,
    &#039;slot.name&#039; = &#039;flink_documents_slot&#039;,
    &#039;decoding.plugin.name&#039; = &#039;pgoutput&#039;,
    &#039;scan.incremental.snapshot.enabled&#039; = &#039;true&#039;
);

-- JDBC sink for ALL changes (data mart replication)
CREATE TABLE dm_documents (
    doc_id          BIGINT,
    title           STRING,
    content         STRING,
    category        STRING,
    status          STRING,
    updated_at      TIMESTAMP(3),
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    &#039;connector&#039; = &#039;jdbc&#039;,
    &#039;url&#039; = &#039;jdbc:postgresql://172.20.0.4:5432/vector_db&#039;,
    &#039;table-name&#039; = &#039;documents_replica&#039;,
    &#039;username&#039; = &#039;postgres&#039;,
    &#039;password&#039; = &#039;...&#039;,
    &#039;driver&#039; = &#039;org.postgresql.Driver&#039;
);

-- Replicate everything to the data mart
INSERT INTO dm_documents SELECT * FROM src_documents;

</pre></div>


<p>For the embedding queue, we need to be selective. This is where a Flink SQL view or a <code>ProcessFunction</code> comes in. Since Flink SQL CDC doesn&#8217;t natively expose the before-image in the SELECT, the simplest approach is to use the <strong>content_hash</strong> strategy from Level 1: the trigger on <code>documents_replica</code> compares <code>content_hash</code> and only queues when it actually changed.</p>



<p>But if you want the filtering to happen <strong>inside Flink</strong> (before hitting the target at all), you need a UDF.</p>



<h4 class="wp-block-heading" id="h-approach-2-text-diff-ratio-udf-the-sweet-spot">Approach 2: Text diff ratio (UDF — the sweet spot)</h4>



<p>This is where it gets interesting. We register a custom Flink UDF that computes the <strong>similarity ratio</strong> between the old and new content, and only emits the row to the embedding queue when the change exceeds a threshold.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
/**
 * Flink UDF: computes text change ratio between two strings.
 * Returns a value between 0.0 (completely different) and 1.0 (identical).
 * 
 * Uses a simplified approach: character-level diff ratio.
 * For production, consider token-level or sentence-level comparison.
 */
@FunctionHint(output = @DataTypeHint(&quot;DOUBLE&quot;))
public class TextChangeRatio extends ScalarFunction {
    
    public Double eval(String before, String after) {
        if (before == null || after == null) return 0.0;
        if (before.equals(after)) return 1.0;
        
        // Longest Common Subsequence ratio
        int lcs = lcsLength(before, after);
        int maxLen = Math.max(before.length(), after.length());
        
        return maxLen == 0 ? 1.0 : (double) lcs / maxLen;
    }
    
    private int lcsLength(String a, String b) {
        // Optimized for streaming: use rolling array, not full matrix
        int&#x5B;] prev = new int&#x5B;b.length() + 1];
        int&#x5B;] curr = new int&#x5B;b.length() + 1];
        for (int i = 1; i &lt;= a.length(); i++) {
            for (int j = 1; j &lt;= b.length(); j++) {
                if (a.charAt(i-1) == b.charAt(j-1)) {
                    curr&#x5B;j] = prev&#x5B;j-1] + 1;
                } else {
                    curr&#x5B;j] = Math.max(prev&#x5B;j], curr&#x5B;j-1]);
                }
            }
            int&#x5B;] tmp = prev; prev = curr; curr = tmp;
            java.util.Arrays.fill(curr, 0);
        }
        return prev&#x5B;b.length()];
    }
}

</pre></div>


<p>Register and use it in Flink SQL:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
-- Register the UDF
CREATE FUNCTION text_change_ratio AS &#039;com.example.TextChangeRatio&#039;;

</pre></div>


<p>Now, the challenge here is that <strong>Flink SQL CDC doesn&#8217;t directly expose the &#8220;before&#8221; image as a column you can SELECT</strong>. The changelog stream has INSERT (+I), UPDATE_BEFORE (-U), UPDATE_AFTER (+U), and DELETE (-D) operations, but in a standard <code>SELECT * FROM cdc_table</code>, you only see the latest state.</p>



<p>To access both before and after, you have two options:</p>



<p><strong>Option A: Stateful ProcessFunction (Java/Python)</strong></p>



<p>This is the cleanest approach. You write a <code>KeyedProcessFunction</code> that maintains the previous state of each document in Flink&#8217;s managed state, and compares it with the incoming change:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
# Pseudocode for the ProcessFunction approach
class ChangeSignificanceFilter(KeyedProcessFunction):
    
    def __init__(self, threshold=0.95):
        self.threshold = threshold  # 0.95 = skip if 95%+ similar
    
    def open(self, runtime_context):
        # Flink managed state: stores last known content per doc_id
        self.last_content = runtime_context.get_state(
            ValueStateDescriptor(&quot;last_content&quot;, Types.STRING())
        )
    
    def process_element(self, row, ctx):
        doc_id = row&#x5B;&#039;doc_id&#039;]
        new_content = row&#x5B;&#039;content&#039;]
        old_content = self.last_content.value()
        
        # Always update state
        self.last_content.update(new_content)
        
        if old_content is None:
            # INSERT: always emit (new document)
            yield Row(doc_id=doc_id, needs_embedding=True, 
                      change_ratio=0.0, operation=&#039;INSERT&#039;)
            return
        
        if old_content == new_content:
            # Content identical: metadata-only change, skip
            return
        
        # Compute change ratio
        ratio = text_similarity(old_content, new_content)
        
        if ratio &lt; self.threshold:
            # Significant change: emit for re-embedding
            yield Row(doc_id=doc_id, needs_embedding=True,
                      change_ratio=round(1.0 - ratio, 4), 
                      operation=&#039;UPDATE&#039;)
        else:
            # Minor change (typo fix, formatting): skip embedding
            # Optionally log for monitoring
            yield Row(doc_id=doc_id, needs_embedding=False,
                      change_ratio=round(1.0 - ratio, 4),
                      operation=&#039;SKIP&#039;)


def text_similarity(a: str, b: str) -&gt; float:
    &quot;&quot;&quot;Fast similarity using difflib SequenceMatcher.&quot;&quot;&quot;
    from difflib import SequenceMatcher
    return SequenceMatcher(None, a, b).ratio()

</pre></div>


<p><strong>Option B: Self-join with temporal table (Flink SQL)</strong></p>



<p>If you want to stay in pure SQL, you can maintain a &#8220;previous version&#8221; table and join against it:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Maintain a snapshot of previous content in a JDBC-backed table
CREATE TABLE content_snapshots (
    doc_id      BIGINT,
    content     STRING,
    content_md5 STRING,
    PRIMARY KEY (doc_id) NOT ENFORCED
) WITH (
    &#039;connector&#039; = &#039;jdbc&#039;,
    &#039;url&#039; = &#039;jdbc:postgresql://172.20.0.4:5432/vector_db&#039;,
    &#039;table-name&#039; = &#039;content_snapshots&#039;,
    &#039;username&#039; = &#039;postgres&#039;,
    &#039;password&#039; = &#039;...&#039;,
    &#039;driver&#039; = &#039;org.postgresql.Driver&#039;
);

-- Write ALL changes to the snapshot table (upsert)
INSERT INTO content_snapshots
SELECT doc_id, content, MD5(content) FROM src_documents;

</pre></div>


<p>Then in the embedding trigger on the target side, compare the incoming <code>content_md5</code> against the previously stored one. If they differ, queue for embedding. This is essentially what the Level 1 trigger does, but now the CDC pipeline is handling the cross-database transport.</p>



<h4 class="wp-block-heading" id="h-approach-3-structural-change-analysis-most-sophisticated">Approach 3: Structural change analysis (most sophisticated)</h4>



<p>For knowledge bases with structured content (Markdown, HTML, technical documentation), you can go deeper than raw text diff. Analyze <strong>what kind of change</strong> happened:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
def analyze_change_significance(old_content: str, new_content: str) -&gt; dict:
    &quot;&quot;&quot;
    Analyze the structural significance of a content change.
    Returns a dict with metrics to decide whether re-embedding is needed.
    &quot;&quot;&quot;
    import re
    from difflib import SequenceMatcher
    
    result = {
        &#039;char_ratio&#039;: SequenceMatcher(None, old_content, new_content).ratio(),
        &#039;paragraphs_added&#039;: 0,
        &#039;paragraphs_removed&#039;: 0,
        &#039;paragraphs_modified&#039;: 0,
        &#039;heading_changed&#039;: False,
        &#039;needs_embedding&#039;: False
    }
    
    # Split into paragraphs
    old_paras = &#x5B;p.strip() for p in old_content.split(&#039;\n\n&#039;) if p.strip()]
    new_paras = &#x5B;p.strip() for p in new_content.split(&#039;\n\n&#039;) if p.strip()]
    
    old_set = set(old_paras)
    new_set = set(new_paras)
    
    result&#x5B;&#039;paragraphs_added&#039;] = len(new_set - old_set)
    result&#x5B;&#039;paragraphs_removed&#039;] = len(old_set - new_set)
    
    # Check if headings changed (strong signal for semantic shift)
    old_headings = set(re.findall(r&#039;^#{1,3}\s+(.+)$&#039;, old_content, re.MULTILINE))
    new_headings = set(re.findall(r&#039;^#{1,3}\s+(.+)$&#039;, new_content, re.MULTILINE))
    result&#x5B;&#039;heading_changed&#039;] = old_headings != new_headings
    
    # Decision logic
    if result&#x5B;&#039;heading_changed&#039;]:
        result&#x5B;&#039;needs_embedding&#039;] = True
        result&#x5B;&#039;reason&#039;] = &#039;heading_changed&#039;
    elif result&#x5B;&#039;paragraphs_added&#039;] &gt; 0 or result&#x5B;&#039;paragraphs_removed&#039;] &gt; 0:
        result&#x5B;&#039;needs_embedding&#039;] = True
        result&#x5B;&#039;reason&#039;] = &#039;structural_change&#039;
    elif result&#x5B;&#039;char_ratio&#039;] &lt; 0.90:
        result&#x5B;&#039;needs_embedding&#039;] = True
        result&#x5B;&#039;reason&#039;] = &#039;significant_text_change&#039;
    else:
        result&#x5B;&#039;needs_embedding&#039;] = False
        result&#x5B;&#039;reason&#039;] = &#039;minor_change&#039;
    
    return result

</pre></div>


<p>The idea here is that structural changes (new headings, added/removed sections) almost always shift the semantic meaning enough to warrant re-embedding, while inline text modifications need to cross a threshold.</p>



<h3 class="wp-block-heading" id="h-choosing-the-right-threshold">Choosing the right threshold</h3>



<p>This is the part where I have to be honest: <strong>I don&#8217;t have a definitive answer on the optimal threshold</strong>. It depends on your data, your embedding model, and your quality requirements.</p>



<p>What I can tell you from experimentation:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Change Type</th><th>Text Diff Ratio</th><th>Should Re-embed?</th><th>Why</th></tr></thead><tbody><tr><td>Typo fix (&#8220;PostgreSLQ&#8221; → &#8220;PostgreSQL&#8221;)</td><td>0.99+</td><td>No</td><td>Semantic meaning unchanged</td></tr><tr><td>Reformatting (whitespace, punctuation)</td><td>0.95+</td><td>No</td><td>Embedding models are robust to formatting</td></tr><tr><td>Single sentence rewritten</td><td>0.85-0.95</td><td>Maybe</td><td>Depends on the sentence&#8217;s importance</td></tr><tr><td>Paragraph added/removed</td><td>0.70-0.85</td><td>Yes</td><td>New information or removed context</td></tr><tr><td>Major rewrite (&gt;30% changed)</td><td>&lt;0.70</td><td>Absolutely</td><td>Different document semantically</td></tr><tr><td>Metadata-only (status, category)</td><td>1.0 (content)</td><td>No</td><td>Content columns unchanged</td></tr></tbody></table></figure>



<p>My <strong>starting recommendation</strong>: set the threshold at <strong>0.95</strong> (i.e., re-embed when more than 5% of the text changed). Then monitor your RAG quality metrics (nDCG, retrieval precision from <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/" target="_blank" rel="noreferrer noopener">rag series  &#8211; adaptive RAG</a>) and adjust. If you&#8217;re missing relevant results, lower the threshold. If you&#8217;re burning too many API credits on trivial changes, raise it.</p>



<p>I validated these numbers on the Wikipedia dataset in <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/" target="_blank" rel="noreferrer noopener">Part 2 of this post</a>. The results cleanly confirmed the 0.95 threshold: typo fixes scored 0.998+ (SKIP), paragraph additions scored ~0.93 (EMBED), and section rewrites scored 0.51–0.63 (definitely EMBED).</p>



<h3 class="wp-block-heading" id="h-the-monitoring-table">The monitoring table</h3>



<p>Whatever approach you choose, log the decisions. This is invaluable for tuning:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE TABLE embedding_change_log (
    log_id          BIGSERIAL PRIMARY KEY,
    doc_id          BIGINT NOT NULL,
    similarity      NUMERIC(5,4),       -- 0.0000 to 1.0000
    decision        TEXT NOT NULL,       -- &#039;EMBED&#039; or &#039;SKIP&#039;
    reason          TEXT,                -- &#039;structural_change&#039;, &#039;minor_change&#039;, etc.
    old_content_md5 TEXT,
    new_content_md5 TEXT,
    details         JSONB,              -- optional: paragraph_similarity, char_diff, etc.
    decided_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- How many re-embeddings are we avoiding?
SELECT decision, count(*), 
       round(100.0 * count(*) / sum(count(*)) OVER (), 1) AS pct
FROM embedding_change_log
WHERE decided_at &gt; now() - INTERVAL &#039;7 days&#039;
GROUP BY decision;

-- Result example:
--  decision | count | pct
-- ----------+-------+------
--  SKIP     |  1847 | 73.2
--  EMBED    |   675 | 26.8

</pre></div>


<p>In this example, 73% of the content changes were minor enough to skip. That&#8217;s 73% fewer embedding API calls, 73% less index churn, and a quieter, more stable RAG system.</p>



<h3 class="wp-block-heading" id="h-a-note-on-baseline-the-first-run">A note on baseline: the first run</h3>



<p>One thing that&#8217;s not obvious until you deploy this: <strong>the change detector needs existing embeddings to compare against</strong>. On the very first run, or for any document that has never been embedded, the similarity will be 0.0 (no previous embedding to compare), and the decision will always be EMBED. The SKIP optimization only kicks in on subsequent changes after a baseline exists.</p>



<p>This is correct behavior, but it means your initial backfill will process everything regardless of the threshold setting. Plan for it.</p>



<h3 class="wp-block-heading" id="h-full-architecture-recap">Full architecture recap</h3>



<p>I won&#8217;t repeat the full Flink setup here, refer to my <a href="https://www.dbi-services.com/blog/postgresql-cdc-to-jdbc-sink-minimal-event-driven-architecture/">CDC to JDBC Sink</a> and <a href="https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/">Oracle to PostgreSQL Migration with Flink CDC</a> posts for the step-by-step LAB. The addition here is the significance filter sitting between the CDC source and the embedding sink.</p>



<p>One option I want to flag but that I haven&#8217;t fully tested at scale: <strong>embedding directly in the Flink pipeline</strong>. You could write a custom <code>ProcessFunction</code> that calls the embedding API and writes both the source data and the embeddings to the target in one atomic checkpoint. This eliminates the queue entirely. The concern is rate limiting and latency of embedding API calls within a streaming pipeline, if the API is slow, it creates backpressure all the way to the CDC source. For now, I&#8217;d recommend the <strong>JDBC sink + trigger + worker</strong> approach as the safer pattern, and explore inline embedding only if you have a local embedding model (like Ollama) with predictable latency.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-model-versioning-the-upgrade-problem">Model Versioning: The Upgrade Problem</h2>



<p>Everything above handles <strong>content changes</strong>. But there&#8217;s another dimension: <strong>model changes</strong>.</p>



<p>When you upgrade from <code>text-embedding-3-small</code> to <code>text-embedding-3-large</code>, or from <code>v1</code> to <code>v2</code> of any model, <strong>all your existing embeddings become incompatible</strong>. This is not optional. Different models produce different vector spaces. You cannot mix embeddings from different models in the same index — the similarity scores become meaningless.</p>



<p>This is why the <code>model_version</code> column in our schema matters. Here&#8217;s the upgrade procedure:</p>



<h3 class="wp-block-heading" id="h-step-1-deploy-new-embeddings-alongside-old-ones">Step 1: Deploy new embeddings alongside old ones</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
-- Create a new worker (or update the config) with the new model
-- MODEL_VERSION = &quot;v2&quot;
-- MODEL_NAME = &quot;text-embedding-3-large&quot;

-- The worker will populate document_embeddings with model_version = &#039;v2&#039;
-- while model_version = &#039;v1&#039; embeddings remain untouched and is_current = true

</pre></div>


<h3 class="wp-block-heading" id="h-step-2-build-a-separate-index-for-the-new-model">Step 2: Build a separate index for the new model</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- New partial index for v2 embeddings (3072 dimensions for text-embedding-3-large)
CREATE INDEX idx_embeddings_diskann_v2 ON document_embeddings 
    USING diskann (embedding vector_cosine_ops)
    WHERE is_current = true AND model_version = &#039;v2&#039;;

</pre></div>


<h3 class="wp-block-heading" id="h-step-3-run-both-in-parallel-shadow-mode">Step 3: Run both in parallel (shadow mode)</h3>



<p>During shadow mode, <strong>both v1 and v2 have <code>is_current = true</code></strong>,  that&#8217;s intentional. Your search queries must always scope by <code>model_version</code>, not just <code>is_current</code>. Each partial index covers one version, so PostgreSQL uses the correct index when the query includes <code>AND model_version = 'v2'</code>.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
# In your RAG query pipeline, query both and compare
results_v1 = search(query, model_version=&#039;v1&#039;)
results_v2 = search(query, model_version=&#039;v2&#039;)

# Log both, serve v1 to users, compare nDCG scores
log_comparison(query, results_v1, results_v2)

</pre></div>


<h3 class="wp-block-heading" id="h-step-4-cut-over">Step 4: Cut over</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Once confident, mark v1 as not current
UPDATE document_embeddings 
SET is_current = false 
WHERE model_version = &#039;v1&#039;;

-- Drop old index
DROP INDEX idx_embeddings_diskann;

-- Optionally archive old embeddings
-- DELETE FROM document_embeddings WHERE model_version = &#039;v1&#039;;

</pre></div>


<h3 class="wp-block-heading" id="h-step-5-update-the-worker-config">Step 5: Update the worker config</h3>



<p>Switch the worker to produce <code>v2</code> embeddings for all new changes going forward.</p>



<p><strong>The point is</strong>: with the versioned schema and partial indexes, model upgrades become a <strong>blue-green deployment for embeddings</strong>. No downtime, no inconsistent state, full rollback capability. This is exactly the same principle as <a href="https://www.dbi-services.com/blog/postgresql-17-to-18-migration-blue-green/">the PostgreSQL 17→18 blue-green upgrade</a> I wrote about, applied to vector data.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-a-note-on-pgai-vectorizer">A Note on pgai Vectorizer</h2>



<p>I want to mention <strong>pgai Vectorizer</strong> by Timescale because it solves a lot of what I&#8217;ve described above out of the box. It uses PostgreSQL triggers internally, handles automatic synchronization, supports chunking configuration, and manages the embedding lifecycle with a declarative SQL command:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT ai.create_vectorizer(
    &#039;documents&#039;::regclass,
    loading     =&gt; ai.loading_column(&#039;content&#039;),
    destination =&gt; ai.destination_table(&#039;document_embeddings&#039;),
    embedding   =&gt; ai.embedding_openai(&#039;text-embedding-3-small&#039;, 768),
    chunking    =&gt; ai.chunking_recursive_character_text_splitter(500, 50)
);

</pre></div>


<p>After this, any INSERT/UPDATE/DELETE on <code>documents</code> automatically triggers re-embedding. The vectorizer worker handles batching, rate limit retries, and error recovery. It&#8217;s essentially the Level 1 pattern I described, but packaged as a production-ready tool, and since April 2025, it works with any PostgreSQL database (not just Timescale Cloud) via a Python library.</p>



<p><strong>Why I still showed you the manual approach first</strong>: because in consulting, I rarely see a greenfield setup. Most projects have constraints, specific PostgreSQL versions, restricted extensions, air-gapped environments, or the need to integrate with an existing CDC pipeline. Understanding the underlying pattern lets you adapt it to your context. pgai Vectorizer is excellent if it fits your deployment, but the principles remain the same regardless of the tooling.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-monitoring-embedding-freshness">Monitoring Embedding Freshness</h2>



<p>One more thing that nobody talks about: <strong>how do you know your embeddings are stale?</strong></p>



<p>There are two categories of signals: <strong>infrastructure signals</strong> (is the pipeline healthy?) and <strong>quality signals</strong> (is retrieval degrading?). Most teams only monitor the first. The second is what actually matters to your users.</p>



<h3 class="wp-block-heading" id="h-infrastructure-signals-pipeline-health">Infrastructure signals: pipeline health</h3>



<p>Here are the queries I use in production to monitor the embedding pipeline itself:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- 1. Documents with no current embeddings
SELECT d.doc_id, d.title, d.updated_at
FROM documents d
LEFT JOIN document_embeddings e 
    ON d.doc_id = e.doc_id AND e.is_current = true
WHERE e.embedding_id IS NULL AND d.is_active = true
ORDER BY d.updated_at DESC;

-- 2. Documents where content changed since last embedding
-- Uses LATERAL join to pick one representative row per document deterministically,
-- avoiding edge cases where chunks have mixed source_hash values (partial retries, etc.)
SELECT d.doc_id, d.title,
       d.updated_at AS doc_updated,
       e.embedded_at AS last_embedded,
       d.updated_at - e.embedded_at AS staleness
FROM documents d
JOIN LATERAL (
    SELECT embedded_at, source_hash
    FROM document_embeddings
    WHERE doc_id = d.doc_id
      AND is_current = true
    ORDER BY embedded_at DESC
    LIMIT 1
) e ON true
WHERE d.is_active = true
  AND d.content_hash IS DISTINCT FROM e.source_hash
ORDER BY staleness DESC;

-- 3. Queue health check
SELECT status, count(*), 
       avg(EXTRACT(EPOCH FROM (processed_at - claimed_at)))::int AS avg_processing_sec,
       avg(EXTRACT(EPOCH FROM (claimed_at - queued_at)))::int AS avg_wait_sec
FROM embedding_queue
WHERE queued_at &gt; now() - INTERVAL &#039;24 hours&#039;
GROUP BY status;

-- 4. Embedding coverage by model version
SELECT model_version, 
       count(DISTINCT doc_id) AS documents,
       count(*) AS total_chunks,
       count(*) FILTER (WHERE is_current) AS current_chunks
FROM document_embeddings
GROUP BY model_version;

</pre></div>


<p>Put these in a Grafana dashboard or your monitoring of choice. The staleness query (#2) is your early warning system — if documents are drifting from their embeddings, something is wrong in your pipeline.</p>



<p>But here&#8217;s the thing: <strong>a healthy pipeline doesn&#8217;t guarantee good retrieval</strong>. Your queue could be empty, your workers could be processing in sub-second latency, and your embeddings could still be degraded. Why? Because the pipeline only tells you that <em>something was embedded</em> — not that <em>the embeddings are good</em>.</p>



<h3 class="wp-block-heading" id="h-quality-signals-when-your-rag-tells-you-embeddings-are-stale">Quality signals: when your RAG tells you embeddings are stale</h3>



<p>This is the section I promised earlier when I said building the pipeline is one thing, but proving you&#8217;re going in the right direction is everything. This is where the work we did in the <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG</a> post becomes critical. The metrics we introduced there,  <strong>precision@k, recall@k, nDCG@k, and confidence scores</strong> are not just evaluation tools for tuning your search weights. They are early warning signals for embedding drift. </p>



<p>Think about what happens when embeddings go stale:</p>



<ul class="wp-block-list">
<li>A document was updated with important new information, but the embedding still reflects the old content</li>



<li>Similarity search retrieves the document (the old embedding is close enough), but the chunk text no longer matches the query&#8217;s intent</li>



<li>The LLM generates an answer based on outdated context</li>



<li><strong>Precision drops</strong>: retrieved documents are less relevant</li>



<li><strong>nDCG drops</strong>: the ranking quality degrades because truly relevant (updated) documents are ranked lower than stale ones that happen to have closer embeddings</li>



<li><strong>Confidence drops</strong>: the gap between top results narrows, the system becomes less certain</li>
</ul>



<p>The pattern is subtle but measurable. Here&#8217;s how to capture it.</p>



<h4 class="wp-block-heading" id="h-retrieval-quality-logging-table">Retrieval quality logging table</h4>



<p>Extend the evaluation log from the Adaptive RAG post to include a timestamp dimension that allows you to track drift over time:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE TABLE retrieval_quality_log (
    log_id          BIGSERIAL PRIMARY KEY,
    query_text      TEXT NOT NULL,
    query_type      TEXT,                -- &#039;factual&#039;, &#039;conceptual&#039;, &#039;exploratory&#039;
    search_method   TEXT NOT NULL,       -- &#039;adaptive&#039;, &#039;hybrid&#039;, &#039;naive&#039;
    confidence      NUMERIC(4,3),        -- 0.000 to 1.000
    precision_at_10 NUMERIC(4,3),
    recall_at_10    NUMERIC(4,3),
    ndcg_at_10      NUMERIC(4,3),
    avg_similarity  NUMERIC(4,3),        -- average cosine similarity of top-10
    top1_score      NUMERIC(4,3),        -- score of the #1 result
    score_gap       NUMERIC(4,3),        -- gap between #1 and #2 (confidence signal)
    logged_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Index for time-series analysis
CREATE INDEX idx_quality_log_time ON retrieval_quality_log (logged_at DESC);

</pre></div>


<h4 class="wp-block-heading" id="h-the-drift-detection-queries">The drift detection queries</h4>



<p>Now the interesting part. These queries detect embedding staleness <em>through retrieval quality degradation</em>, not through pipeline metrics:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- 5. Weekly nDCG trend — is ranking quality degrading over time?
SELECT date_trunc(&#039;week&#039;, logged_at) AS week,
       round(avg(ndcg_at_10), 3) AS avg_ndcg,
       round(avg(precision_at_10), 3) AS avg_precision,
       round(avg(confidence), 3) AS avg_confidence,
       count(*) AS queries
FROM retrieval_quality_log
WHERE logged_at &gt; now() - INTERVAL &#039;3 months&#039;
GROUP BY week
ORDER BY week;

-- What you&#039;re looking for:
-- A slow, steady decline in avg_ndcg and avg_confidence over weeks
-- This is the signature of embedding drift — the pipeline is running,
-- but the embeddings are gradually falling behind the content

-- 6. Confidence distribution shift — are more queries becoming uncertain?
SELECT date_trunc(&#039;week&#039;, logged_at) AS week,
       round(100.0 * count(*) FILTER (WHERE confidence &gt;= 0.7) / count(*), 1) 
           AS pct_high_confidence,
       round(100.0 * count(*) FILTER (WHERE confidence BETWEEN 0.5 AND 0.7) / count(*), 1) 
           AS pct_medium_confidence,
       round(100.0 * count(*) FILTER (WHERE confidence &lt; 0.5) / count(*), 1) 
           AS pct_low_confidence
FROM retrieval_quality_log
WHERE logged_at &gt; now() - INTERVAL &#039;3 months&#039;
GROUP BY week
ORDER BY week;

-- If pct_low_confidence is climbing week over week, your embeddings 
-- are losing alignment with the actual content

</pre></div>


<h4 class="wp-block-heading" id="h-closing-the-loop-from-quality-signal-to-re-embedding-trigger">Closing the loop: from quality signal to re-embedding trigger</h4>



<p>Here&#8217;s where this connects back to the event-driven architecture. The quality metrics don&#8217;t just sit in a dashboard, they can <strong>trigger re-embedding</strong> for documents that the pipeline&#8217;s change significance filter might have skipped.</p>



<p>Remember the threshold discussion from Level 3: we set a 0.95 similarity ratio as the default, meaning changes under 5% are skipped. But what if a 3% change in a critical document is causing retrieval failures?</p>



<p>The feedback loop:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="705" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-12-1024x705.png" alt="" class="wp-image-43065" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-12-1024x705.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-12-300x207.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-12-768x529.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/image-12.png 1474w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>In practice, you would implement this as a periodic job (daily or weekly) that correlates low-quality retrievals with stale documents. The correlation can be as simple as ILIKE matching query terms against document titles, or as sophisticated as tracking which document IDs were returned in low-confidence results. The key is that <code>change_type = 'quality_reembed'</code> is a distinct signal in your queue — it tells you the re-embedding was triggered by quality degradation, not by a content change event.</p>



<p>This is the complete picture: the event-driven pipeline handles the primary flow (react to data changes), the change significance filter optimizes it (skip trivial changes), and the quality monitoring loop catches what the filter missed. Three layers, each progressively more sophisticated, each compensating for the blind spots of the previous one.</p>



<p>As I wrote in the <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG post</a>: an old BI principle is to know your KPI, what it really measures but also when it fails to measure. The infrastructure metrics (queue depth, latency, skip rate) measure pipeline health. The quality metrics (precision, nDCG, confidence) measure <strong>what your users experience</strong>. You need both.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-summary">Summary</h2>



<p>Throughout this post, we&#8217;ve covered a progression from simple to complex:</p>



<p><strong>Level 1 — Triggers + Queue</strong>: Best for single-database setups. Zero external dependencies. PostgreSQL does the heavy lifting. Use <code>LISTEN/NOTIFY</code> for sub-second latency. This covers 80% of use cases.</p>



<p><strong>Level 2 — Logical Replication</strong>: Best when source and vector databases are separate PostgreSQL instances. The source team doesn&#8217;t need to modify anything. Built-in WAL-based CDC with automatic catch-up.</p>



<p><strong>Level 3 — Flink CDC + Change Significance Filtering</strong>: Best for heterogeneous sources (Oracle, MySQL) or fan-out to multiple targets. The change significance filter is the key addition — by comparing before/after images in the pipeline, you skip re-embedding for minor changes (typo fixes, metadata-only updates, formatting), which in practice eliminates 60-80% of unnecessary embedding API calls. Start with column-aware filtering, graduate to text diff ratio with a 0.95 threshold, and tune based on your RAG quality metrics.</p>



<p><strong>Model Versioning</strong>: Regardless of which level you choose, version your embeddings. Track <code>model_name</code>, <code>model_version</code>, and <code>source_hash</code>. Use partial DiskANN indexes (pgvectorscale). Treat model upgrades like blue-green deployments.</p>



<p><strong>Measurement</strong>: None of the above matters if you don&#8217;t instrument retrieval quality. The precision@k, recall@k, nDCG@k, and confidence metrics from the <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG</a> post aren&#8217;t a nice-to-have — they&#8217;re the only way to know whether your pipeline is actually keeping your RAG system healthy. Track them over time. Break them down by topic. Watch for drift. If you build the pipeline without the measurement layer, you&#8217;re flying blind. The evaluation framework in the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab repository</a> (<code>lab/evaluation/</code>) gives you a concrete starting point.</p>



<p><strong>The core principle</strong>: Event-driven architecture is a precondition for production RAG — but it&#8217;s not sufficient. The moment you accept batch re-embedding as &#8220;good enough,&#8221; you&#8217;re accepting that your RAG system will silently degrade between batches. The trigger/CDC approach doesn&#8217;t just keep embeddings fresh — it gives you <strong>observability</strong> into what changed, when it was embedded, and whether the change was significant enough to matter. But the pipeline only proves that work was done. The quality metrics prove the work was effective. Log every decision. Measure the skip rate. Tune the threshold. Track nDCG weekly. This is how you operationalize RAG.</p>



<p>If you&#8217;ve been building RAG systems without thinking about embedding freshness, now is the time to retrofit it. And if you&#8217;re starting a new RAG project — please, design the embedding pipeline as event-driven from day one. Your future self will thank you. One of the thing that I didn&#8217;t mention as well is what should be embedded ? This is not really a technical question in the sense that it is more link to your knowledge of the data, your business, your applications, your data workflows&#8230; </p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-what-s-next">What&#8217;s Next</h2>



<p>In <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-lab/" target="_blank" rel="noreferrer noopener">Part 2</a>, I apply everything from this post to the <strong>25,000-article Wikipedia dataset</strong> from the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab</a> repository. You&#8217;ll see:</p>



<ul class="wp-block-list">
<li>How to adapt the schema to an existing table (no greenfield luxury)</li>



<li>Real SKIP vs EMBED decisions with actual similarity scores</li>



<li>The <code>SKIP LOCKED</code> multi-worker demo with 4 concurrent workers and zero overlap</li>



<li>A complete freshness monitoring report</li>



<li>The quality feedback loop triggering re-embeddings automatically</li>
</ul>



<p>A probable futur blog post would I guess, benchmarks with pgvector and diskann indexes&#8230; </p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/">RAG Series – Embedding Versioning with pgvector: Why Event-Driven Architecture Is a Precondition to AI data workflows</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>RAG, MCP, Skills — Three Paradigms for LLMs Talking to Your Database, and Why Governance Changes Everything</title>
		<link>https://www.dbi-services.com/blog/rag-mcp-skills-three-paradigms-for-llms-talking-to-your-database-and-why-governance-changes-everything/</link>
					<comments>https://www.dbi-services.com/blog/rag-mcp-skills-three-paradigms-for-llms-talking-to-your-database-and-why-governance-changes-everything/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 15 Feb 2026 18:10:05 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[AI/LLM]]></category>
		<category><![CDATA[embeddings]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[MCP Server]]></category>
		<category><![CDATA[pgvector]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Skills]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=42941</guid>

					<description><![CDATA[<p>Introduction Throughout my RAG series, I&#8217;ve explored how to build retrieval-augmented generation systems with PostgreSQL and pgvector — from Naive RAG through Hybrid Search, Adaptive RAG, Agentic RAG, and Embedding Versioning (not yet released) . The focus has always been on retrieval quality: precision, recall, nDCG, confidence. But I want to take a step back. [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-mcp-skills-three-paradigms-for-llms-talking-to-your-database-and-why-governance-changes-everything/">RAG, MCP, Skills — Three Paradigms for LLMs Talking to Your Database, and Why Governance Changes Everything</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading" id="h-introduction">Introduction</h2>



<p>Throughout my <a href="https://www.dbi-services.com/blog/rag-series-naive-rag/">RAG series</a>, I&#8217;ve explored how to build retrieval-augmented generation systems with PostgreSQL and pgvector — from <a href="https://www.dbi-services.com/blog/rag-series-naive-rag/">Naive RAG</a> through <a href="https://www.dbi-services.com/blog/rag-series-hybrid-search-with-re-ranking/">Hybrid Search</a>, <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG</a>, <a href="https://www.dbi-services.com/blog/rag-series-agentic-rag/">Agentic RAG</a>, and <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-pgvector/">Embedding Versioning</a> (not yet released) . The focus has always been on retrieval quality: precision, recall, nDCG, confidence.</p>



<p>But I want to take a step back.</p>



<p>In conversations with colleagues, clients, and fellow PostgreSQL enthusiasts — most recently at the PG Day at CERN — I keep running into the same fundamental question: <strong>which paradigm should we use to connect LLMs to our data?</strong> Not &#8220;which is best for retrieval quality&#8221; but &#8220;which can we actually deploy in a regulated environment where data governance isn&#8217;t optional?&#8221;</p>



<p>Because in banking, healthcare, or any FINMA-regulated context, the question is never just &#8220;does it work?&#8221; The question is: <strong>can you prove it works safely, explain how it works, and guarantee it respects access controls?</strong></p>



<p>Today, three paradigms dominate how LLMs interact with structured and unstructured data:</p>



<ul class="wp-block-list">
<li><strong>RAG</strong> (Retrieval-Augmented Generation): the LLM receives pre-retrieved context, never touches the database directly</li>



<li><strong>MCP</strong> (Model Context Protocol): the LLM generates and executes queries against live data sources through standardized tool interfaces</li>



<li><strong>Skills</strong>: the LLM follows procedural instructions that define not just what to answer, but <em>how</em> to perform tasks step by step</li>
</ul>



<p>Each one draws the trust boundary in a fundamentally different place. And that changes everything for governance.</p>



<p>But let me be clear upfront: <strong>these three paradigms are complementary, not competing.</strong> There is overlap between them, and in some cases one can partially do what another does. But they each have distinct strengths, and the real architectural question is not &#8220;which one should I pick?&#8221; but &#8220;where in my system does each one belong?&#8221; A well-designed LLM-powered application might use RAG for knowledge retrieval, MCP for analytical queries, and Skills for procedural workflows — all within the same product, each governed appropriately for its trust model. The mistake I see teams make is treating this as an either/or decision when it&#8217;s fundamentally a composition problem.</p>



<p>It&#8217;s also worth noting that the boundaries between these paradigms are already blurring. Some MCP servers now leverage RAG components built-in — vector search as a tool the LLM can call, rather than a separate pipeline. And over the past months, MCP itself has become increasingly abstracted: higher-level orchestration layers sit on top, hiding the protocol details from both the LLM and the developer. The consequence is that the frontiers between RAG, MCP, and Skills are becoming harder to draw cleanly.</p>



<p>This mirrors something I&#8217;ve observed more broadly in IT. Before AI-driven data workflows, roles had clear separations and were (too) easily siloed: infrastructure admins managed servers, developers wrote application code, DBAs owned the database layer. Each team had its perimeter, its tools, its responsibilities. With AI data workflows, the overlap and shared responsibilities are growing — the DBA needs to understand how the LLM consumes data, the developer needs to understand database-level access controls, the security team needs to understand embedding pipelines and prompt behavior. The silos don&#8217;t hold anymore. And I think this is actually one of the positive consequences of this shift: at least in IT, people will have to talk more and break silos. The governance challenges I&#8217;ll describe in this post are not solvable by any single team in isolation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-a-quick-overview-what-each-paradigm-does">A Quick Overview: What Each Paradigm Does</h2>



<p>Before diving into governance, let&#8217;s make sure we&#8217;re on the same page about what each paradigm actually does. I&#8217;ll keep this brief because the nuances matter more than the definitions.</p>



<h3 class="wp-block-heading" id="h-rag-the-llm-reads-what-you-give-it">RAG — The LLM Reads What You Give It</h3>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="335" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-1024x335.png" alt="" class="wp-image-42944" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-1024x335.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-300x98.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-768x251.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-1536x503.png 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_babw14babw14babw-2048x671.png 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>The LLM never sees your database. It sees chunks of text that a retrieval layer selected based on vector similarity. The database is behind a wall — the RAG pipeline decides what crosses it.</p>



<p>This is the paradigm I&#8217;ve spent the most time on in my series, and there&#8217;s a reason: <strong>from a DBA&#8217;s perspective, it&#8217;s the most controllable.</strong> Row-Level Security, data masking, column filtering, TDE — all the tools we&#8217;ve spent years mastering apply directly, because the retrieval layer queries the database on behalf of the user, and the database enforces its own rules.</p>



<h3 class="wp-block-heading" id="h-mcp-the-llm-talks-to-your-database">MCP — The LLM Talks to Your Database</h3>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="285" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4-1024x285.png" alt="" class="wp-image-42945" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4-1024x285.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4-300x84.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4-768x214.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4-1536x428.png 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_ywj4ttywj4ttywj4.png 1952w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>With MCP (Anthropic&#8217;s <a href="https://modelcontextprotocol.io/">Model Context Protocol</a>), the LLM doesn&#8217;t receive pre-filtered chunks. It receives <em>tools</em> — standardized interfaces to data sources, APIs, and services. The LLM decides which tool to call, formulates the query (often SQL), and interprets the results.</p>



<p>This is powerful. The LLM can explore data, join across tables, aggregate, filter — things that are impossible in a RAG pipeline where the retrieval is limited to vector similarity over pre-chunked text.</p>



<p>But the trust boundary has moved. The LLM is now an <strong>active agent</strong> inside your data perimeter, not a passive consumer of pre-filtered context.</p>



<h3 class="wp-block-heading" id="h-skills-the-llm-follows-your-procedures">Skills — The LLM Follows Your Procedures</h3>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="318" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w-1024x318.png" alt="" class="wp-image-42946" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w-1024x318.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w-300x93.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w-768x238.png 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w-1536x477.png 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2026/02/Gemini_Generated_Image_pn6wwxpn6wwxpn6w.png 1856w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>Skills go beyond querying data. A Skill is a set of instructions that tells the LLM not just <em>what</em> to answer, but <em>how</em> to perform a task — step by step, with specific tools, in a specific order, following specific constraints. Think of it as encoding a procedure, a runbook, or an expertise template that the LLM executes.</p>



<p>For example: a Skill might instruct the LLM to read a client&#8217;s portfolio from the database, apply risk weighting rules from a compliance document, generate a report in a specific format, and flag any positions that exceed regulatory thresholds — all as a single orchestrated workflow.</p>



<p>The trust boundary hasn&#8217;t just moved — it has expanded. The LLM is no longer just reading data or generating queries. It&#8217;s <strong>executing multi-step procedures</strong> that combine data access, computation, and output generation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-trust-boundary-problem">The Trust Boundary Problem</h2>



<p>Here&#8217;s the framework I use when advising clients on these paradigms. It comes down to one question: <strong>where does the security perimeter sit, and who crosses it?</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td><strong>Feature</strong></td><td><strong>RAG</strong></td><td><strong>MCP</strong></td><td><strong>Skills</strong></td></tr><tr><td><strong>Trust boundary</strong></td><td>Pipeline acts as gatekeeper</td><td>LLM acts as query agent</td><td>LLM acts as procedure executor with tool access</td></tr><tr><td><strong>LLM sees</strong></td><td>Pre-filtered chunks only</td><td>Raw query results from live DB</td><td>Raw data + intermediate computation results</td></tr><tr><td><strong>DBA controls</strong></td><td>✅ Full (RLS, masking, filtering at retrieval time)</td><td>⚠️ Partial (connection-level, but LLM can infer beyond single query)</td><td>❌ Minimal (depends on Skill design and LLM behavioral compliance)</td></tr></tbody></table></figure>



<p>Let me unpack each one.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-rag-the-dba-s-comfort-zone">RAG: The DBA&#8217;s Comfort Zone</h2>



<p>I&#8217;ll be direct: RAG is where I&#8217;m most comfortable as a DBA, and there are solid reasons for that.</p>



<h3 class="wp-block-heading" id="h-why-governance-is-straightforward">Why governance is straightforward</h3>



<p>In a RAG architecture, the retrieval layer is a <strong>regular database client</strong>. It connects with a specific role, it runs queries (vector similarity search), and it returns results. Everything the DBA has built over the past decades applies:</p>



<p><strong>Row-Level Security (RLS):</strong> You can enforce per-user access policies at the PostgreSQL level. If user A shouldn&#8217;t see documents classified as &#8220;confidential-level-3&#8221;, the RLS policy filters them out before the embedding search even returns results. The LLM never sees them — not in the chunks, not in the context, not in the answer.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Example: RLS on the document_embeddings table
CREATE POLICY embedding_access ON document_embeddings
    FOR SELECT
    USING (
        doc_id IN (
            SELECT doc_id FROM documents 
            WHERE classification_level &lt;= current_setting(&#039;app.user_clearance&#039;)::int
        )
    );
</pre></div>


<p><strong>Data masking:</strong> If certain fields need to be masked (client names, account numbers), you apply masking rules at the view or column level. The chunks the LLM receives already have masked data. There&#8217;s no risk of the LLM &#8220;accidentally&#8221; revealing something it shouldn&#8217;t — it never had it.</p>



<p><strong>Audit logging:</strong> Every query to the vector store is a standard SQL query. <code>pgaudit</code>, <code>log_statement</code>, your existing monitoring — it all works. You can trace exactly which chunks were retrieved for which query at which time.</p>



<p><strong>TDE (Transparent Data Encryption):</strong> Embeddings at rest are encrypted like any other column. <code>pgcrypto</code> or filesystem-level encryption applies without modification.</p>



<h3 class="wp-block-heading" id="h-the-measurement-advantage">The measurement advantage</h3>



<p>This is the point I made in my <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG post</a>: RAG has a well-defined quality measurement framework.</p>



<p>You build <strong>golden question/answer pairs</strong> with domain experts. You run the retrieval pipeline against these pairs. You measure precision@k, recall@k, nDCG@k. You track these over time (as I described in the <a href="https://www.dbi-services.com/blog/rag-series-embedding-versioning-pgvector/">embedding versioning post</a>).</p>



<p>The key property is that the output space is bounded: given a query, the system returns a ranked list of chunks. You can evaluate whether the right chunks were returned, in the right order, with the right confidence. This is a well-studied problem in information retrieval — decades of academic work support it.</p>



<h3 class="wp-block-heading" id="h-the-limitations">The limitations</h3>



<p>RAG is not universally applicable, and I don&#8217;t want to oversell it:</p>



<ul class="wp-block-list">
<li><strong>It can&#8217;t do aggregation</strong>: &#8220;What&#8217;s the total exposure across all clients in sector X?&#8221; requires a SQL query, not a similarity search over chunks.</li>



<li><strong>It can&#8217;t join across data sources</strong>: if the answer requires combining data from multiple tables or systems, the pre-chunked approach breaks down.</li>



<li><strong>It&#8217;s bound by the chunking strategy</strong>: if the relevant information spans multiple chunks, or if the chunking split a critical paragraph, retrieval quality suffers regardless of how good your embeddings are.</li>



<li><strong>It requires upfront indexing</strong>: every document must be chunked and embedded before it can be retrieved. The event-driven pipelines I described in the embedding versioning post address the freshness problem, but the fundamental constraint remains.</li>
</ul>



<p>These limitations are exactly what push teams toward MCP.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-mcp-when-the-llm-writes-the-queries">MCP: When the LLM Writes the Queries</h2>



<p>MCP is compelling because it removes the constraints I just listed. The LLM can query live data, aggregate, join, filter — anything SQL can express. For analytical workloads, it&#8217;s transformative.</p>



<p>But the governance implications are profound.</p>



<h3 class="wp-block-heading" id="h-the-identity-problem-who-is-the-user">The identity problem: who is the user?</h3>



<p>In a traditional application, the database sees a connection from a known application user with a defined role. RLS policies evaluate <code>current_user</code> or <code>current_setting('app.user_id')</code>, and access is scoped accordingly.</p>



<p>With MCP, the LLM is the intermediary. The MCP server connects to PostgreSQL, but <strong>who is the user?</strong> Is it:</p>



<ul class="wp-block-list">
<li>The human who asked the question?</li>



<li>The LLM instance processing the request?</li>



<li>The MCP server&#8217;s service account?</li>
</ul>



<p>In most current MCP implementations, it&#8217;s the service account. The MCP server connects with a single set of credentials, and the LLM&#8217;s queries execute with that role&#8217;s permissions. This creates a fundamental problem in regulated environments: <strong>the database cannot distinguish between queries on behalf of different users with different access levels.</strong></p>



<p>You can work around this with application-level filtering — the MCP server injects <code>WHERE</code> clauses based on the requesting user&#8217;s profile. But this moves access control from the database (where it&#8217;s declarative, auditable, and enforced by the engine) to the application layer (where it depends on correct implementation, and where the LLM might find creative ways around it).</p>



<p>This is where I have a hard time enforcing data privacy rules outside of the RAG paradigm. As a DBA, tools like RLS, TDE, data masking, and filtering are well-known and battle-tested. In an MCP paradigm, these tools still exist at the database level, but <strong>the trust that they&#8217;ll be invoked correctly now depends on the MCP server&#8217;s implementation and the LLM&#8217;s behavioral compliance.</strong></p>



<h3 class="wp-block-heading" id="h-the-inference-problem">The inference problem</h3>



<p>This is the subtler issue and, in my opinion, the more dangerous one for FINMA-regulated environments.</p>



<p>Even if you perfectly control what data the LLM can query, the LLM can <strong>infer</strong> information by combining results from multiple queries. Consider:</p>



<ol class="wp-block-list">
<li>Query 1: &#8220;How many clients are in sector Energy?&#8221; → 3</li>



<li>Query 2: &#8220;What&#8217;s the total exposure in sector Energy?&#8221; → CHF 450M</li>



<li>Query 3: &#8220;What&#8217;s the largest single position across all sectors?&#8221; → CHF 200M</li>
</ol>



<p>Individually, each query might be within the user&#8217;s access rights. But combined, the LLM can infer: &#8220;One client in the Energy sector represents nearly half the total exposure — approximately CHF 200M.&#8221; This might be information the user is not authorized to derive.</p>



<p>This is not a new problem — it&#8217;s the classic <strong>inference attack</strong> from database security literature. But LLMs make it dramatically easier because:</p>



<ul class="wp-block-list">
<li>They&#8217;re excellent at combining information from multiple queries</li>



<li>They do it naturally, without being explicitly instructed to</li>



<li>The user might not even realize they received derived information they shouldn&#8217;t have</li>
</ul>



<p>In a RAG architecture, the inference surface is much smaller because the LLM only sees pre-selected chunks, not raw query results it can freely combine.</p>



<h3 class="wp-block-heading" id="h-the-auditability-challenge">The auditability challenge</h3>



<p>In FINMA&#8217;s <a href="https://www.finma.ch/en/">operational risk</a> framework (and Basel III/IV more broadly), financial institutions must maintain comprehensive audit trails for data access. This means you need to answer: <strong>who accessed what data, when, and for what purpose?</strong></p>



<p>With MCP:</p>



<ul class="wp-block-list">
<li>The &#8220;who&#8221; is partially obscured (service account vs. actual user)</li>



<li>The &#8220;what&#8221; is a generated SQL query that might be complex and non-obvious</li>



<li>The &#8220;for what purpose&#8221; is the natural language question, which may not clearly map to the data accessed</li>
</ul>



<p>You can log everything — the natural language input, the generated SQL, the results, the final answer. But the audit trail is now <strong>multi-layered and interpretive</strong>: a compliance officer reviewing the logs needs to understand not just that a query ran, but why the LLM chose that particular query, and whether the answer faithfully represents the data without unauthorized inference.</p>



<p>Compare this to RAG, where the audit trail is: &#8220;User asked X. The system retrieved chunks Y1, Y2, Y3 from documents the user has access to. The LLM generated answer Z based on those chunks.&#8221; Simpler. More linear. Easier to review.</p>



<h3 class="wp-block-heading" id="h-where-mcp-shines-despite-the-challenges">Where MCP shines despite the challenges</h3>



<p>I don&#8217;t want to paint MCP as undeployable. It solves real problems:</p>



<ul class="wp-block-list">
<li><strong>Analytical queries</strong>: &#8220;Show me the monthly trend of client onboarding over the past year&#8221; — this requires aggregation that RAG simply cannot do</li>



<li><strong>Exploratory data access</strong>: when users don&#8217;t know exactly what they&#8217;re looking for, the ability to query flexibly is invaluable</li>



<li><strong>Multi-source integration</strong>: MCP servers can connect to multiple backends (PostgreSQL, APIs, file systems) through a single standardized protocol</li>



<li><strong>Reduced indexing overhead</strong>: no need to chunk, embed, and maintain vector indexes — the data is queried live</li>
</ul>



<p>The measurement story is also more tractable than Skills (see below). Academic work on <strong>text-to-SQL</strong> evaluation provides established benchmarks. Projects like <a href="https://github.com/salesforce/WikiSQL">WikiSQL</a> and <a href="https://bird-bench.github.io/">BIRD-SQL</a> offer golden datasets of natural language questions paired with correct SQL queries and expected results.</p>



<p>However — and this is a crucial point I raised during discussion with colleagues — <strong>these academic benchmarks don&#8217;t transfer to your specific domain</strong>. WikiSQL doesn&#8217;t know your banking schema. BIRD-SQL doesn&#8217;t understand your compliance rules. You still need to build your own evaluation dataset with domain experts, golden queries, and expected results specific to your data model.</p>



<p>The good news: the <em>methodology</em> transfers. You know what to measure (query correctness, result accuracy, execution safety). You know how to structure the evaluation (golden pairs). The work is in building the domain-specific test suite, not in inventing the evaluation framework.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-skills-when-the-llm-executes-procedures">Skills: When the LLM Executes Procedures</h2>



<p>Skills represent the most powerful — and the most challenging — paradigm from a governance perspective.</p>



<p>In a Skill, the LLM is not just retrieving information or generating queries. It&#8217;s following a procedural set of instructions that might involve: reading data, applying business logic, making decisions based on intermediate results, generating outputs in specific formats, and potentially writing data back.</p>



<h3 class="wp-block-heading" id="h-the-instruction-adherence-problem">The instruction adherence problem</h3>



<p>With RAG, you measure retrieval quality. With MCP, you measure query correctness. With Skills, you need to measure something much harder: <strong>did the LLM follow the procedure correctly?</strong></p>



<p>Two responses can both be &#8220;compliant&#8221; with a Skill&#8217;s instructions and yet differ significantly in quality. Consider a Skill that instructs the LLM to:</p>



<ol class="wp-block-list">
<li>Read a client&#8217;s transaction history from the database</li>



<li>Apply anti-money-laundering (AML) screening rules</li>



<li>Flag suspicious patterns according to FINMA guidelines</li>



<li>Generate a structured report</li>
</ol>



<p>Response A might flag 5 transactions as suspicious, with clear reasoning linked to specific FINMA rules. Response B might flag 3 of the same 5, miss 2 edge cases, but also flag 1 false positive — with equally valid-sounding reasoning. Both followed the Skill&#8217;s instructions. Both produced structured reports. But one is meaningfully better than the other.</p>



<p><strong>How do you measure this?</strong> This is the question that keeps me up at night, and I don&#8217;t have a clean answer.</p>



<p>For RAG, we have precision, recall, nDCG — well-defined metrics with decades of research behind them. For MCP/text-to-SQL, we have execution accuracy, result set matching, and query equivalence. For Skills, we need to evaluate:</p>



<ul class="wp-block-list">
<li><strong>Instruction adherence</strong>: did the LLM follow each step in order?</li>



<li><strong>Completeness</strong>: were all required steps executed?</li>



<li><strong>Correctness of intermediate decisions</strong>: at each decision point, did the LLM make the right call?</li>



<li><strong>Output quality</strong>: does the final deliverable meet the specification?</li>



<li><strong>Safety</strong>: did the LLM stay within the boundaries of the Skill, or did it improvise?</li>
</ul>



<p>This is by far more subjective and creates more edge cases. You can build evaluation frameworks for each of these dimensions, but the composite &#8220;is this response good?&#8221; judgment requires human review at a level that doesn&#8217;t scale easily.</p>



<h3 class="wp-block-heading" id="h-the-behavioral-safety-gap">The behavioral safety gap</h3>



<p>In regulated environments, the <em>process</em> matters as much as the <em>result</em>. In banking, it&#8217;s not enough to produce the correct AML report — you must produce it through the correct procedure, applying the correct rules, in the correct order, with the correct audit trail.</p>



<p>Skills encode this process. But the LLM is a probabilistic system. It will sometimes:</p>



<ul class="wp-block-list">
<li>Reorder steps when it &#8220;thinks&#8221; a different order is equivalent (it might not be, regulatorily)</li>



<li>Skip a step it deems unnecessary (it might be required for compliance)</li>



<li>Interpolate between the Skill&#8217;s instructions and its general training (introducing reasoning that wasn&#8217;t prescribed)</li>



<li>Handle edge cases creatively (which might mean non-compliantly)</li>
</ul>



<p>In a FINMA audit, &#8220;the AI decided to skip step 3 because it seemed redundant&#8221; is not an acceptable explanation. The procedure exists for a reason. Compliance is about following the prescribed process, not just arriving at a plausible result.</p>



<h3 class="wp-block-heading" id="h-the-data-access-surface">The data access surface</h3>



<p>Skills often need broad data access to perform their multi-step procedures. An AML screening Skill might need access to transaction history, client profiles, country risk ratings, and regulatory threshold configurations. This is a wide data surface — wider than most RAG retrieval patterns and potentially wider than individual MCP queries.</p>



<p>The challenge is that this access is <strong>implicit in the Skill design</strong>, not explicit in a per-query access control check. When the Skill says &#8220;read the client&#8217;s transaction history,&#8221; the underlying MCP call or database query is executed with whatever permissions the Skill&#8217;s execution context has. There&#8217;s no natural point where a per-user RLS check happens unless you&#8217;ve specifically engineered it into the Skill&#8217;s execution layer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-governance-matrix">The Governance Matrix</h2>



<p>Let me bring this together in a framework that I&#8217;ve found useful when discussing these paradigms with CISOs and compliance teams:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Governance Dimension</th><th>RAG</th><th>MCP</th><th>Skills</th></tr></thead><tbody><tr><td><strong>Data access control</strong></td><td>✅ Database-native (RLS, views, masking) — data is filtered <em>before</em> the LLM sees it</td><td>⚠️ Depends on MCP server implementation; LLM sees raw query results</td><td>❌ Broad access often required; implicit in Skill design</td></tr><tr><td><strong>Inference protection</strong></td><td>✅ Limited surface — LLM sees only pre-selected chunks</td><td>⚠️ LLM can combine results from multiple queries to derive unauthorized information</td><td>❌ Multi-step procedures inherently combine information</td></tr><tr><td><strong>Audit trail clarity</strong></td><td>✅ Linear: query → chunks → answer. Easy to review</td><td>⚠️ Multi-step: question → SQL(s) → results → answer. Requires interpretation</td><td>❌ Complex: task → steps → intermediate results → decisions → output. Hard to audit</td></tr><tr><td><strong>Identity propagation</strong></td><td>✅ Retrieval runs as the user (or user-scoped service)</td><td>⚠️ MCP server connects with service account; user identity must be passed through</td><td>⚠️ Execution context may not map to individual user identity</td></tr><tr><td><strong>Relevance measurement</strong></td><td>✅ Mature: precision, recall, nDCG on golden datasets</td><td>⚠️ Text-to-SQL benchmarks exist but must be domain-adapted</td><td>❌ Instruction adherence is subjective; no standard metrics</td></tr><tr><td><strong>Behavioral predictability</strong></td><td>✅ Output bounded by retrieved context</td><td>⚠️ LLM chooses which queries to run; output depends on query strategy</td><td>❌ LLM executes procedures; may reorder, skip, or improvise steps</td></tr><tr><td><strong>Regulatory explainability</strong></td><td>✅ &#8220;The system retrieved these documents and generated this answer&#8221;</td><td>⚠️ &#8220;The system ran these queries and synthesized this answer&#8221;</td><td>❌ &#8220;The system followed this procedure, making these intermediate decisions&#8221;</td></tr><tr><td><strong>Data residency / TDE</strong></td><td>✅ Standard PostgreSQL encryption; chunks are just rows</td><td>✅ Standard — queries execute within the DB perimeter</td><td>⚠️ Intermediate results may exist outside the DB during processing</td></tr><tr><td><strong>Anonymization-utility balance</strong></td><td>✅ Transformation at embedding time; LLM only sees pre-abstracted chunks</td><td>⚠️ Transformation at query time; every query path must return abstracted data</td><td>❌ Transformation needed at every procedure step; hard to guarantee consistency</td></tr></tbody></table></figure>



<p>The trend is clear: <strong>as you move from RAG to MCP to Skills, the governance burden shifts from the database to the application layer</strong>, and the DBA&#8217;s ability to enforce controls diminishes.</p>



<p>This doesn&#8217;t mean MCP and Skills are unusable in regulated environments. It means the governance responsibility shifts, and different teams need to own different pieces.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-real-challenge-anonymization-is-not-enough">The Real Challenge: Anonymization Is Not Enough</h2>



<p>The governance matrix above might give the impression that the hard problem is access control — deciding <em>whether</em> the LLM sees the data. In practice, at least in the enterprise environments where I consult, the harder problem is what happens <em>to</em> the data before the LLM sees it.</p>



<p>Everyone agrees on the baseline: <strong>strip PII before sending anything to an external LLM</strong>. Replace client names with tokens, mask account numbers, redact personal identifiers. This is table stakes, and there are mature tools for it — both at the PostgreSQL level (dynamic masking, views) and at the application layer (regex-based scrubbing, NER-based entity detection).</p>



<p>But here&#8217;s what nobody warns you about: <strong>if you anonymize too aggressively, the LLM can&#8217;t do its job.</strong> And if you don&#8217;t anonymize enough, you&#8217;ve just sent regulated data to a third-party API.</p>



<p>The balance is not a binary &#8220;mask or don&#8217;t mask.&#8221; It&#8217;s a spectrum of <strong>semantic-preserving transformation</strong> — and finding the right point on that spectrum is, in my experience, the most under-discussed practical challenge of deploying LLMs in regulated environments.</p>



<h3 class="wp-block-heading" id="h-the-anonymization-utility-trade-off">The anonymization-utility trade-off</h3>



<p>Let me illustrate with a concrete example from a banking context. Suppose a relationship manager asks the RAG system: &#8220;What are the key risk factors for this client&#8217;s portfolio?&#8221;</p>



<p>The original chunk from the knowledge base might contain:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Client: Jean-Pierre Müller (ID: CH-98234)
Portfolio value: CHF 4.2M
Concentrated position: 47% in Nestlé S.A. (NESN.SW)
Recent transactions: Sold CHF 200K of Roche Holding AG on 2025-11-15
KYC renewal due: 2026-03-01
Risk rating: Medium-High (upgraded from Medium on 2025-09-20)
</pre></div>


<p>Now let&#8217;s look at what happens at different levels of anonymization:</p>



<p><strong>Level 1 — PII-only masking</strong> (replace identifiers, keep everything else):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Client: &#x5B;CLIENT_A] (ID: &#x5B;REDACTED])
Portfolio value: CHF 4.2M
Concentrated position: 47% in Nestlé S.A. (NESN.SW)
Recent transactions: Sold CHF 200K of Roche Holding AG on 2025-11-15
KYC renewal due: 2026-03-01
Risk rating: Medium-High (upgraded from Medium on 2025-09-20)
</pre></div>


<p>The LLM can still reason perfectly about the portfolio risk. It sees the concentration in a specific stock, the transaction history, the risk upgrade. This is semantically rich — the answer will be excellent.</p>



<p>But the problem: Nestlé + Roche + CHF 4.2M + specific dates might be enough to <strong>re-identify the client</strong> through correlation. A determined actor with access to client lists could narrow this down to a handful of people, potentially one. The PII is gone, but the <em>data fingerprint</em> remains.</p>



<p><strong>Level 2 — Aggressive anonymization</strong> (mask everything that could identify):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Client: &#x5B;CLIENT_A] (ID: &#x5B;REDACTED])
Portfolio value: &#x5B;AMOUNT]
Concentrated position: &#x5B;PERCENTAGE] in &#x5B;COMPANY_A]
Recent transactions: Sold &#x5B;AMOUNT] of &#x5B;COMPANY_B] on &#x5B;DATE]
KYC renewal due: &#x5B;DATE]
Risk rating: &#x5B;RATING] (upgraded from &#x5B;PREVIOUS_RATING] on &#x5B;DATE])
</pre></div>


<p>Now the data is safe from re-identification. But the LLM is blind. It can&#8217;t tell you that a 47% concentration in a single stock is a risk factor, because it doesn&#8217;t know it&#8217;s 47%. It can&#8217;t assess whether the recent sale was material, because it doesn&#8217;t know the amount relative to the portfolio. It can&#8217;t flag the KYC timeline. The answer will be generic boilerplate about portfolio risk — useless to the relationship manager.</p>



<p><strong>Level 3 — Semantic-preserving abstraction</strong> (this is where the real work happens):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Client: &#x5B;CLIENT_A] (ID: &#x5B;REDACTED])
Portfolio value: CHF 4-5M range
Concentrated position: &amp;gt;40% in a single Swiss large-cap equity
Recent transactions: Significant sale in Swiss pharma sector, Q4 2025
KYC renewal due: Q1 2026
Risk rating: Medium-High (recently upgraded)
</pre></div>


<p>Now we&#8217;ve achieved something interesting. The data is:</p>



<ul class="wp-block-list">
<li><strong>Not re-identifiable</strong>: ranges, sectors, and quarters instead of exact values, company names, and dates</li>



<li><strong>Semantically sufficient</strong>: the LLM can still reason that &gt;40% in a single stock is a concentration risk, that a significant sale in pharma might be rebalancing, that a recent risk upgrade plus upcoming KYC renewal warrants attention</li>



<li><strong>Contextually accurate</strong>: the abstraction preserves the relationships between data points (concentration → risk rating → KYC timeline)</li>
</ul>



<p>This is the sweet spot — but getting there requires deliberate design, not just running a PII scanner.</p>



<h3 class="wp-block-heading" id="h-building-the-abstraction-layer-data-mapping-in-postgresql">Building the abstraction layer: data mapping in PostgreSQL</h3>



<p>In practice, I implement this as a <strong>transformation layer</strong> in PostgreSQL — views or functions that produce the &#8220;LLM-safe&#8221; version of the data, with the abstraction rules encoded declaratively.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Abstraction mapping for amounts
CREATE OR REPLACE FUNCTION anonymize_amount(amount NUMERIC, currency TEXT DEFAULT &#039;CHF&#039;)
RETURNS TEXT AS $$
BEGIN
    -- Preserve magnitude and currency, remove precision
    RETURN CASE
        WHEN amount &lt; 100000 THEN currency || &#039; &lt;100K&#039;
        WHEN amount &lt; 500000 THEN currency || &#039; 100K-500K range&#039;
        WHEN amount &lt; 1000000 THEN currency || &#039; 500K-1M range&#039;
        WHEN amount &lt; 5000000 THEN currency || &#039; 1-5M range&#039;
        WHEN amount &lt; 10000000 THEN currency || &#039; 5-10M range&#039;
        WHEN amount &lt; 50000000 THEN currency || &#039; 10-50M range&#039;
        ELSE currency || &#039; 50M+&#039;
    END;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Abstraction mapping for dates (reduce to quarter)
CREATE OR REPLACE FUNCTION anonymize_date(d DATE)
RETURNS TEXT AS $$
BEGIN
    RETURN &#039;Q&#039; || EXTRACT(QUARTER FROM d)::TEXT || &#039; &#039; || EXTRACT(YEAR FROM d)::TEXT;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Abstraction mapping for percentages (bucketize)
CREATE OR REPLACE FUNCTION anonymize_percentage(pct NUMERIC)
RETURNS TEXT AS $$
BEGIN
    RETURN CASE
        WHEN pct &lt; 10 THEN &#039;&lt;10%&#039;
        WHEN pct &lt; 25 THEN &#039;10-25%&#039;
        WHEN pct &lt; 40 THEN &#039;25-40%&#039;
        WHEN pct &lt; 60 THEN &#039;&gt;40%&#039;          -- &quot;concentrated&quot;
        WHEN pct &lt; 80 THEN &#039;&gt;60%&#039;          -- &quot;highly concentrated&quot;
        ELSE &#039;&gt;80%&#039;                         -- &quot;dominant position&quot;
    END;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- Sector mapping: company → sector (prevents re-identification via company name)
CREATE TABLE company_sector_map (
    company_name    TEXT PRIMARY KEY,
    sector          TEXT NOT NULL,       -- &#039;Swiss pharma&#039;, &#039;European tech&#039;, etc.
    market_cap_tier TEXT NOT NULL        -- &#039;large-cap&#039;, &#039;mid-cap&#039;, &#039;small-cap&#039;
);

-- The LLM-safe view: this is what gets chunked and embedded (for RAG)
-- or what the MCP server exposes (for MCP)
CREATE OR REPLACE VIEW v_portfolio_llm_safe AS
SELECT
    p.client_id,                                          -- internal ref only
    anonymize_amount(p.portfolio_value) AS portfolio_size,
    anonymize_percentage(p.concentration_pct) 
        || &#039; in a single &#039; 
        || csm.market_cap_tier 
        || &#039; &#039; || csm.sector AS concentration_description,
    anonymize_amount(t.amount) || &#039; in &#039; || csm_t.sector 
        || &#039;, &#039; || anonymize_date(t.trade_date) AS recent_activity,
    anonymize_date(p.kyc_renewal_date) AS kyc_timeline,
    p.risk_rating,
    CASE WHEN p.risk_rating_changed_at &gt; now() - INTERVAL &#039;6 months&#039;
         THEN &#039;recently upgraded&#039; ELSE &#039;stable&#039; 
    END AS risk_trend
FROM portfolios p
LEFT JOIN company_sector_map csm ON p.top_holding = csm.company_name
LEFT JOIN LATERAL (
    SELECT amount, trade_date, company_name 
    FROM transactions 
    WHERE client_id = p.client_id 
    ORDER BY trade_date DESC LIMIT 1
) t ON true
LEFT JOIN company_sector_map csm_t ON t.company_name = csm_t.company_name;
</pre></div>


<p>The key design principles:</p>



<p><strong>Bucketize, don&#8217;t mask.</strong> Instead of replacing <code>CHF 4.2M</code> with <code>[REDACTED]</code>, replace it with <code>CHF 1-5M range</code>. The LLM loses precision but retains the magnitude — and magnitude is what drives most reasoning. The bucket boundaries should be defined with the business: what granularity does the LLM need to produce useful answers?</p>



<p><strong>Abstract to sector, not to nothing.</strong> Instead of removing <code>Nestlé S.A.</code>, replace it with <code>Swiss large-cap equity</code>. The LLM can still reason about sector concentration, geographic exposure, and market cap risk. A <code>company_sector_map</code> table (maintained by the business or compliance team) drives this consistently.</p>



<p><strong>Preserve relationships, anonymize individuals.</strong> The temporal relationship between risk rating upgrade, KYC renewal, and recent trading activity is preserved — <code>recently upgraded</code> + <code>Q1 2026 KYC</code> + <code>Q4 2025 sale</code>. The LLM can reason about the pattern without knowing who, exactly how much, or which stock.</p>



<p><strong>Encode the rules declaratively.</strong> The abstraction logic lives in PostgreSQL views and functions, not in application code. This means it&#8217;s auditable (you can review the view definition), testable (run the view and inspect the output), and consistent (every query path through this view applies the same rules).</p>



<h3 class="wp-block-heading" id="h-the-mapping-problem-when-context-requires-real-entities">The mapping problem: when context requires real entities</h3>



<p>Here&#8217;s where it gets genuinely hard. Some questions require the LLM to know real entity names to be useful.</p>



<p>&#8220;Should the client reduce their Nestlé position given the recent earnings miss?&#8221;</p>



<p>If you&#8217;ve abstracted Nestlé to &#8220;Swiss large-cap FMCG,&#8221; the LLM can&#8217;t connect it to actual earnings data. It can give generic advice about concentration risk, but it can&#8217;t reason about Nestlé-specific fundamentals — which is what the relationship manager actually needs.</p>



<p>There are two approaches I&#8217;ve seen work in practice:</p>



<p><strong>Approach A — Split context, split model calls.</strong> Use the anonymized context for client-specific reasoning (portfolio risk, concentration, suitability) and a separate, non-anonymized call for market/public data reasoning (Nestlé earnings, sector outlook). The LLM never sees both the client identity and the company name in the same context. The application layer merges the two responses.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Call 1 (anonymized): &quot;Client has &amp;gt;40% concentration in a single 
                      Swiss large-cap FMCG stock. Risk rating recently 
                      upgraded. Assess concentration risk.&quot;

Call 2 (public data): &quot;Analyze recent Nestlé S.A. earnings and 
                       outlook for institutional holders.&quot;

Application layer:    Merge both responses for the RM.
</pre></div>


<p>This is architecturally clean but operationally complex. You need a reliable merging layer, and the LLM can&#8217;t reason holistically across both contexts.</p>



<p><strong>Approach B — Reversible pseudonymization with a mapping table.</strong> Replace real entities with consistent pseudonyms that the LLM treats as real entities. The mapping is stored in PostgreSQL, never sent to the LLM, and used by the application layer to re-hydrate the response before displaying it to the user.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- Pseudonym mapping (never exposed to the LLM)
CREATE TABLE entity_pseudonyms (
    real_name       TEXT PRIMARY KEY,
    pseudonym       TEXT UNIQUE NOT NULL,
    entity_type     TEXT NOT NULL  -- &#039;company&#039;, &#039;person&#039;, &#039;fund&#039;
);

-- Example entries:
-- (&#039;Nestlé S.A.&#039;, &#039;Alpine Corp&#039;, &#039;company&#039;)
-- (&#039;Roche Holding AG&#039;, &#039;Glacier Pharma&#039;, &#039;company&#039;)
-- (&#039;Jean-Pierre Müller&#039;, &#039;Client Alpha&#039;, &#039;person&#039;)
</pre></div>


<p>The LLM sees: &#8220;Client Alpha has a 47% position in Alpine Corp and recently sold Glacier Pharma stock.&#8221; It can reason about concentration, sector correlation, and transaction patterns using these pseudonyms as if they were real entities. The application layer maps the pseudonyms back to real names before showing the response to the user.</p>



<p>The advantage: the LLM gets rich, entity-level context. It can reason about &#8220;Alpine Corp&#8221; as a coherent entity across multiple chunks and queries.</p>



<p>The risk: pseudonym consistency must be airtight. If &#8220;Alpine Corp&#8221; appears as &#8220;Nestlé&#8221; in one chunk due to a mapping error, you&#8217;ve leaked. The mapping table must be maintained carefully, and the transformation must happen at a single controlled layer — ideally the PostgreSQL view, not scattered across application code.</p>



<h3 class="wp-block-heading" id="h-how-this-differs-across-paradigms">How this differs across paradigms</h3>



<p>The anonymization-utility balance plays out differently depending on the paradigm:</p>



<p><strong>RAG</strong>: You control the transformation at <strong>embedding time</strong>. The chunks stored in pgvector can already be the anonymized/abstracted version. The LLM never has the opportunity to see raw data — it&#8217;s been transformed before it was even indexed. This is the safest model, and it&#8217;s where the PostgreSQL view approach works best: embed from <code>v_portfolio_llm_safe</code>, not from the raw tables.</p>



<p><strong>MCP</strong>: The transformation must happen at <strong>query time</strong>, in the MCP server. This is harder because the LLM generates arbitrary SQL, and you need to ensure that <em>every possible query path</em> returns abstracted data. You can force the MCP server to query through the anonymized views rather than the base tables, but you need to be rigorous about not exposing the raw tables at all. One misconfigured permission and the LLM can <code>SELECT * FROM portfolios</code> directly.</p>



<p><strong>Skills</strong>: The transformation must happen at <strong>every step of the procedure</strong> where data flows through the LLM&#8217;s context. A multi-step Skill might read raw data in step 1, transform it in step 2, and reason about it in step 3 — but if the Skill&#8217;s instructions aren&#8217;t precise about when and how to transform, the LLM might shortcut the process and pass raw data into its reasoning context.</p>



<p>The pattern is consistent with the governance matrix: <strong>as you move from RAG to MCP to Skills, maintaining the anonymization-utility balance gets progressively harder</strong> because you have less control over when and how the LLM encounters the data.</p>



<h3 class="wp-block-heading" id="h-the-measurement-gap">The measurement gap</h3>



<p>Finally, there&#8217;s a quality measurement dimension to this that I haven&#8217;t seen well-addressed anywhere.</p>



<p>When you anonymize or abstract data before sending it to the LLM, you need to verify that the transformation didn&#8217;t destroy the LLM&#8217;s ability to answer correctly. This means your golden question/answer evaluation pairs (from the <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">Adaptive RAG post</a>) need to be run <strong>on the abstracted data, not the raw data</strong>.</p>



<p>If your nDCG score drops from 0.85 on raw data to 0.62 on abstracted data, your bucketization is too aggressive — the LLM is losing too much context. If it stays at 0.83, the abstraction is working. You need to measure this explicitly during development, and re-measure whenever you change the abstraction rules.</p>



<p>This creates a feedback loop between the security team (who wants maximum anonymization) and the business team (who wants maximum answer quality). The DBA sits in the middle, tuning the PostgreSQL views and measuring the impact. In my experience, this negotiation — finding the right bucket boundaries, the right sector mappings, the right level of temporal abstraction — takes more time than building the RAG pipeline itself. But it&#8217;s the work that makes the difference between a system that compliance signs off on and one that stays in the sandbox.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-what-finma-expects-and-where-each-paradigm-stands">What FINMA Expects (and Where Each Paradigm Stands)</h2>



<p>Without going into a full regulatory analysis, FINMA&#8217;s <a href="https://www.finma.ch/en/">Circular 2023/1 on operational risks and resilience</a> and the broader EBA guidelines on ICT and security risk management establish expectations that directly affect paradigm choice:</p>



<p><strong>Traceability and auditability</strong>: every data access must be traceable to a specific user, purpose, and time. RAG satisfies this naturally through database audit logs. MCP requires careful logging at the server layer. Skills require comprehensive step-by-step execution logging.</p>



<p><strong>Data classification enforcement</strong>: sensitive data must be protected according to its classification level. RAG enforces this at retrieval time via RLS and masking. MCP and Skills require the enforcement to happen at the tool/execution layer, with no guarantee that the LLM won&#8217;t combine or infer beyond what&#8217;s permitted.</p>



<p><strong>Outsourcing and third-party risk</strong>: if the LLM is a cloud service (OpenAI, Anthropic API), the data sent to it matters. RAG sends chunks — you control exactly what leaves your perimeter. MCP sends query results — potentially broader. Skills might send intermediate computation results, client data, or procedure outputs.</p>



<p><strong>Model risk management</strong>: FINMA expects institutions to understand and manage model risk. For RAG, the &#8220;model&#8221; is the embedding model + the retrieval logic — both well-defined and testable. For MCP, the model risk includes the LLM&#8217;s query generation behavior. For Skills, the model risk encompasses the LLM&#8217;s procedural execution behavior — much harder to bound.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-the-irony-we-need-databases-again">The Irony: We Need Databases Again</h2>



<p>I want to close with an observation that keeps coming back to me.</p>



<p>As LLM-powered applications grow more complex — longer conversations, more context, more tools, more steps — we&#8217;re running into a fundamental problem: <strong>how do you manage context effectively across long-running interactions?</strong></p>



<p>Today&#8217;s LLMs have fixed context windows. When the conversation exceeds that window, information is lost. The current solutions? Summarization (lossy), truncation (lossy), or… indexing the context into retrievable storage and fetching relevant pieces when needed.</p>



<p>Sound familiar? That&#8217;s essentially what databases have been doing since the 1960s. There&#8217;s a <a href="https://arxiv.org/pdf/2512.24601" target="_blank" rel="noreferrer noopener">recent paper on Recursive Language Models</a> that literally recreates context indexing into files to improve accuracy and avoid losing details in long-running conversations. It&#8217;s the application-database paradigm, but at the ISAM level.</p>



<p>The irony is not lost on me. We spent decades building sophisticated data management systems — indexing, caching, query optimization, transaction management, access control. Now the AI community is rediscovering these patterns from first principles, often without the benefit of the lessons we&#8217;ve already learned.</p>



<p>As the AI ecosystem matures, I believe we&#8217;ll see the database layer become <em>more</em> central, not less. Whether it&#8217;s pgvector for RAG, PostgreSQL as an MCP server, or a context store for long-running agent conversations — the principles of data management don&#8217;t change just because the client is a language model instead of a human.</p>



<p>And that&#8217;s good news for those of us who&#8217;ve spent our careers in databases. The skills transfer. The patterns transfer. RLS still works. Audit logging still matters. ACID still matters. The challenge is adapting our expertise to new trust boundaries and new failure modes.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-summary"><strong>Summary</strong></h2>



<p>The choice between RAG, MCP, and Skills is not primarily a technical decision. It&#8217;s a governance decision.</p>



<p>RAG keeps the LLM outside the data perimeter. MCP lets the LLM inside, as a query agent. Skills give the LLM the keys to execute procedures. Each step outward increases capability but also increases the surface area for data leakage, inference attacks, and compliance failures.</p>



<p>There&#8217;s another dimension to this progression that deserves explicit attention: as you move from RAG to MCP to Skills, you are also shifting trust from your own architecture to the LLM platform provider. With RAG, the LLM is a stateless text generator — your retrieval pipeline, your database, your access controls do the heavy lifting. With MCP and Skills, you are increasingly relying on the LLM&#8217;s behavioral compliance, its tool-use reliability, and the platform&#8217;s guarantees around data handling, logging, and isolation. In practice, this means trusting Anthropic, OpenAI, or whichever provider powers your inference layer to uphold the security properties your regulator demands.</p>



<p>To their credit, these providers are investing heavily in enterprise readiness. Anthropic and OpenAI both now offer features specifically designed for regulated environments — data residency controls, zero data retention options, SOC 2 compliance, audit logging, and increasingly granular access management. The MCP specification itself was donated to the Linux Foundation&#8217;s Agentic AI Foundation in December 2025, signaling a move toward vendor-neutral governance of the protocol layer. These are meaningful steps. But they don&#8217;t change the fundamental architectural reality: every capability you delegate to the LLM platform is a capability you no longer enforce within your own perimeter. For a CISO in a FINMA-regulated institution, &#8220;the provider is SOC 2 compliant&#8221; is a necessary condition, not a sufficient one.</p>



<p>For regulated environments — banking, healthcare, critical infrastructure — this progression matters. The question is not &#8220;which paradigm is most powerful?&#8221; but &#8220;which paradigm can I govern, audit, and explain to my regulator?&#8221;</p>



<p>As a DBA, my bias is clear: keep as much as possible within the database&#8217;s governance perimeter. PostgreSQL&#8217;s security model has been battle-tested for decades. The LLM is a powerful new client — but it&#8217;s still a client, and the database&#8217;s rules should still apply. Data pipelines should allow you to set up within the architecture a vertical defensibility and decouple your governance and business logic from the LLM.</p>



<p>The industry will figure out governance for MCP and Skills. The academic work on text-to-SQL evaluation is advancing. The tooling for behavioral evaluation of AI agents is improving. Enterprise features from LLM providers will continue to mature. But today, in February 2026, if a CISO asks me &#8220;can I deploy this safely?&#8221; — for RAG, I can say yes with confidence. For MCP, I can say yes with caveats. For Skills, I say: let&#8217;s build the evaluation framework first.</p>



<p></p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-mcp-skills-three-paradigms-for-llms-talking-to-your-database-and-why-governance-changes-everything/">RAG, MCP, Skills — Three Paradigms for LLMs Talking to Your Database, and Why Governance Changes Everything</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/rag-mcp-skills-three-paradigms-for-llms-talking-to-your-database-and-why-governance-changes-everything/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Oracle to PostgreSQL Migration with Flink CDC</title>
		<link>https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/</link>
					<comments>https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 30 Nov 2025 22:04:54 +0000</pubDate>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[cdc]]></category>
		<category><![CDATA[Flink]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[postgresql]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=41688</guid>

					<description><![CDATA[<p>Introduction When wanting to migrate from the big red to PostgreSQL, most of the time you can afford the downtime of the export/import process and starting from something fresh. It is simple and reliable. Ora2pg being one of the go-to tools for that. But sometimes, you can afford the downtime, either because the database is [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/">Oracle to PostgreSQL Migration with Flink CDC</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading" id="h-introduction">Introduction</h2>



<p>When wanting to migrate from the big red to PostgreSQL, most of the time you can afford the downtime of the export/import process and starting from something fresh. It is simple and reliable. Ora2pg being one of the go-to tools for that. But sometimes, you can afford the downtime, either because the database is critical for business operations or either because the DB is to big to run the export/import process. <br>Hence the following example of using &#8220;Logical replication&#8221; between Oracle and PostgreSQL using Flink CDC. I call it like that even though it is a even stream process because for DBAs it will have roughly the same limitations and constraints as standard logical replication. </p>



<p>Here is the layout : </p>



<p><strong>Oracle Source → Flink CDC → JDBC Sink → PostgreSQL Target</strong></p>



<p>This approach is based on production experience migrating large Oracle databases, where we achieved throughput of <strong>19,500 records per second</strong>—a 65x improvement over our initial baseline. But more importantly, it transformed a &#8220;big bang&#8221; migration event into a controlled, observable, and recoverable process.</p>



<p>The geek in me says that Flink CDC is a powerful tool for migrations. The consultant says it should not be used blindly—it&#8217;s relevant for specific use cases where the benefits outweigh the operational complexity.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-what-each-piece-does">What Each Piece Does</h2>



<ul class="wp-block-list">
<li><strong>Oracle (source)</strong>: The source database. Flink CDC reads directly from tables via JDBC for snapshot mode, or from redo logs for streaming CDC mode.</li>



<li><strong>Flink CDC source (Oracle)</strong>: A Flink table that wraps the Debezium Oracle connector. It reads data and turns it into a changelog stream (insert/update/delete). Key options control snapshot mode, parallelism, and fetch sizes.</li>



<li><strong>Flink runtime</strong>: Runs a streaming job that:
<ul class="wp-block-list">
<li><strong>Snapshot</strong>: Reads current table state, optionally in parallel chunks</li>



<li><strong>Checkpoints</strong>: State is stored so restarts resume exactly from the last acknowledged point</li>



<li><strong>Transforms</strong>: You can filter, project, cast types, and even aggregate in Flink SQL</li>
</ul>
</li>



<li><strong>JDBC sink (PostgreSQL)</strong>: Another Flink table. With a PRIMARY KEY defined, the connector performs UPSERT semantics (<code>INSERT ... ON CONFLICT DO UPDATE</code> in PostgreSQL). It writes in batches, flushes on checkpoints, and retries on transient errors.</li>



<li><strong>PostgreSQL (target)</strong>: Receives the stream and ends up with the migrated data. With proper tuning (especially <code>rewriteBatchedInserts=true</code>), it can handle high throughput.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-flink-and-debezium-how-cdc-works">Flink and Debezium: How CDC Works</h2>



<p>Flink CDC connectors use Debezium which is an open-source platform for Change Data Capture that captures row-level changes in databases by reading transaction logs.</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
┌───────────────────────────────────────────────────────────────────────┐
│                        Flink CDC Architecture                         │
│                                                                       │
│  ┌──────────────┐    ┌────────────────────────────────┐    ┌────────┐ │
│  │    Oracle    │    │      Flink CDC Connector       │    │  Sink  │ │
│  │   Database   │    │  ┌─────────────────────────┐   │    │Database│ │
│  │              │───►│  │   Debezium (embedded)   │   │───►│        │ │
│  │  • Redo logs │    │  │   ─────────────────     │   │    │  Post- │ │
│  │  • Tables    │    │  │   • Oracle connector    │   │    │  greSQL│ │
│  │              │    │  │   • Log parsing         │   │    │        │ │
│  └──────────────┘    │  │   • Event streaming     │   │    └────────┘ │
│                      │  └─────────────────────────┘   │               │
│                      └────────────────────────────────┘               │
└───────────────────────────────────────────────────────────────────────┘
</pre></div>


<h3 class="wp-block-heading" id="h-why-debezium">Why Debezium?</h3>



<ul class="wp-block-list">
<li><strong>Log-based CDC</strong>: Reads database transaction logs, not polling tables—much lower overhead</li>



<li><strong>Low impact</strong>: Minimal performance hit on source database</li>



<li><strong>Exactly-once delivery</strong>: When combined with Flink&#8217;s checkpointing</li>



<li><strong>Schema tracking</strong>: Handles schema evolution in streaming scenarios</li>
</ul>



<h3 class="wp-block-heading" id="h-snapshot-vs-cdc-modes">Snapshot vs. CDC Modes</h3>



<p>When you configure a Flink CDC source, you can choose:</p>



<ol class="wp-block-list">
<li><strong>Snapshot Only</strong>: Read current table state (what we use in this demo)—fastest for one-time migrations</li>



<li><strong>Snapshot + CDC</strong>: Initial snapshot, then stream ongoing changes—for zero-downtime migrations</li>



<li><strong>CDC Only</strong>: Stream only new changes (requires existing snapshot)<br><br><em>Note : Snapshot itself can be done with in one transaction (can be long for big tables) or using incremental snapshot</em>.<em> Since I am using an Oracle express edition for this demo I will stick with the normal Snapshot.</em> <em>In case having big tables to load standard/enterprise editions are required for supplemental logs</em>.  </li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-anatomy-of-a-flink-sql-pipeline">Anatomy of a Flink SQL Pipeline</h2>



<p>A Flink SQL migration pipeline has <strong>four distinct parts</strong>. Understanding each part is critical for troubleshooting and optimization.</p>



<h3 class="wp-block-heading" id="h-part-1-runtime-configuration-set-statements">Part 1: Runtime Configuration (SET Statements)</h3>



<p>These settings control how the Flink job executes. Think of them as the &#8220;knobs&#8221; you turn to tune behavior:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
-- Pipeline identification
SET &#039;pipeline.name&#039; = &#039;Oracle-to-PostgreSQL: CUSTOMERS Migration&#039;;

-- Runtime mode: STREAMING for CDC, BATCH for one-time loads
SET &#039;execution.runtime-mode&#039; = &#039;STREAMING&#039;;

-- Parallelism: how many workers process data concurrently
SET &#039;parallelism.default&#039; = &#039;4&#039;;

-- Checkpointing: how often Flink saves progress for recovery
SET &#039;execution.checkpointing.mode&#039; = &#039;AT_LEAST_ONCE&#039;;
SET &#039;execution.checkpointing.interval&#039; = &#039;60 s&#039;;
SET &#039;execution.checkpointing.timeout&#039; = &#039;10 min&#039;;
SET &#039;execution.checkpointing.min-pause&#039; = &#039;30 s&#039;;

-- Restart strategy: what happens on failure
SET &#039;restart-strategy.type&#039; = &#039;fixed-delay&#039;;
SET &#039;restart-strategy.fixed-delay.attempts&#039; = &#039;3&#039;;
SET &#039;restart-strategy.fixed-delay.delay&#039; = &#039;10 s&#039;;
</pre></div>


<p><strong>Key points:</strong></p>



<ul class="wp-block-list">
<li><code>AT_LEAST_ONCE</code> is faster than <code>EXACTLY_ONCE</code> for snapshot migrations where idempotency is guaranteed by upserts</li>



<li>Checkpoint interval affects both recovery granularity and overhead</li>



<li>Higher parallelism isn&#8217;t always better—you can hit contention on the target</li>
</ul>



<h3 class="wp-block-heading" id="h-part-2-source-table-definition-oracle-cdc">Part 2: Source Table Definition (Oracle CDC)</h3>



<p>This defines how Flink reads from Oracle. The column definitions must match your Oracle schema, using Flink SQL types:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
DROP TABLE IF EXISTS src_customers;

CREATE TABLE src_customers (
    -- Column definitions must match Oracle schema
    -- Use Flink SQL types that map to Oracle types
    CUSTOMER_ID   DECIMAL(10,0),
    FIRST_NAME    STRING,
    LAST_NAME     STRING,
    EMAIL         STRING,
    PHONE         STRING,
    CREATED_AT    TIMESTAMP(6),
    STATUS        STRING,
    -- Primary key is required for CDC (NOT ENFORCED = Flink won&#039;t validate)
    PRIMARY KEY (CUSTOMER_ID) NOT ENFORCED
) WITH (
    -- Connector type: oracle-cdc (uses Debezium internally)
    &#039;connector&#039; = &#039;oracle-cdc&#039;,

    -- Oracle connection details
    &#039;hostname&#039; = &#039;oracle&#039;,
    &#039;port&#039; = &#039;1521&#039;,
    &#039;username&#039; = &#039;demo&#039;,
    &#039;password&#039; = &#039;demo&#039;,

    -- Database configuration (pluggable database for Oracle XE)
    -- Use url to connect via service name instead of SID
    &#039;url&#039; = &#039;jdbc:oracle:thin:@//oracle:1521/XEPDB1&#039;,
    &#039;database-name&#039; = &#039;XEPDB1&#039;,
    &#039;schema-name&#039; = &#039;DEMO&#039;,
    &#039;table-name&#039; = &#039;CUSTOMERS&#039;,

    -- Snapshot mode: &#039;initial&#039; = full snapshot, then stop (for snapshot-only)
    &#039;scan.startup.mode&#039; = &#039;initial&#039;,

    -- IMPORTANT: Disable incremental snapshot for this demo
    -- Incremental snapshot requires additional Oracle configuration
    &#039;scan.incremental.snapshot.enabled&#039; = &#039;false&#039;,

    -- Debezium snapshot configuration
    &#039;debezium.snapshot.mode&#039; = &#039;initial&#039;,
    &#039;debezium.snapshot.fetch.size&#039; = &#039;10000&#039;
);
</pre></div>


<p><strong>Key concepts:</strong></p>



<ul class="wp-block-list">
<li><strong>PRIMARY KEY NOT ENFORCED</strong>: Tells Flink the key exists but it won&#8217;t validate uniqueness</li>



<li><strong><code>scan.incremental.snapshot.enabled</code></strong>: Set to <code>false</code> for simple snapshots; <code>true</code> requires Oracle archive log mode and supplemental logging</li>



<li><strong><code>debezium.snapshot.fetch.size</code></strong>: How many rows to fetch per database round-trip—larger = fewer round-trips</li>
</ul>



<h3 class="wp-block-heading" id="h-part-3-sink-table-definition-postgresql-jdbc">Part 3: Sink Table Definition (PostgreSQL JDBC)</h3>



<p>This defines how Flink writes to PostgreSQL:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
DROP TABLE IF EXISTS sink_customers;

CREATE TABLE sink_customers (
    -- Column definitions for PostgreSQL target
    customer_id   BIGINT,
    first_name    STRING,
    last_name     STRING,
    email         STRING,
    phone         STRING,
    created_at    TIMESTAMP(6),
    status        STRING,
    PRIMARY KEY (customer_id) NOT ENFORCED
) WITH (
    -- Connector type: jdbc
    &#039;connector&#039; = &#039;jdbc&#039;,

    -- PostgreSQL connection with batch optimization
    -- rewriteBatchedInserts=true is CRITICAL for performance (5-10x improvement)
    &#039;url&#039; = &#039;jdbc:postgresql://postgres:5432/demo?rewriteBatchedInserts=true&#039;,
    &#039;table-name&#039; = &#039;customers&#039;,
    &#039;username&#039; = &#039;demo&#039;,
    &#039;password&#039; = &#039;demo&#039;,
    &#039;driver&#039; = &#039;org.postgresql.Driver&#039;,

    -- Sink parallelism (tune based on target DB capacity)
    -- Too high can cause contention; 4-8 is usually optimal
    &#039;sink.parallelism&#039; = &#039;4&#039;,

    -- Buffer configuration for throughput
    &#039;sink.buffer-flush.max-rows&#039; = &#039;10000&#039;,
    &#039;sink.buffer-flush.interval&#039; = &#039;5 s&#039;,

    -- Retry configuration
    &#039;sink.max-retries&#039; = &#039;3&#039;
);
</pre></div>


<p><strong>Key optimization: <code>rewriteBatchedInserts=true</code></strong> is critical for PostgreSQL performance. This tells the JDBC driver to rewrite individual INSERT statements into a single multi-row INSERT:</p>



<p>Without this:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
INSERT INTO t (a,b) VALUES (1,&#039;x&#039;);
INSERT INTO t (a,b) VALUES (2,&#039;y&#039;);
INSERT INTO t (a,b) VALUES (3,&#039;z&#039;);
</pre></div>


<p>With <code>rewriteBatchedInserts=true</code>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
INSERT INTO t (a,b) VALUES (1,&#039;x&#039;),(2,&#039;y&#039;),(3,&#039;z&#039;);
</pre></div>


<p>This single change gave us a <strong>5-10x throughput improvement</strong> in production.</p>



<h3 class="wp-block-heading" id="h-part-4-data-flow-insert-select">Part 4: Data Flow (INSERT…SELECT)</h3>



<p>This starts the actual data migration. The <code>CAST</code> operations convert Oracle types to PostgreSQL types:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
INSERT INTO sink_customers
SELECT
    CAST(CUSTOMER_ID AS BIGINT) AS customer_id,
    FIRST_NAME AS first_name,
    LAST_NAME AS last_name,
    EMAIL AS email,
    PHONE AS phone,
    CREATED_AT AS created_at,
    STATUS AS status
FROM src_customers;
</pre></div>


<p>This single statement:</p>



<ol class="wp-block-list">
<li>Reads from the Oracle source table</li>



<li>Transforms data types (CAST operations)</li>



<li>Writes to the PostgreSQL sink table</li>



<li>Handles batching, parallelism, and fault tolerance automatically</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-complete-sql-example">Complete SQL Example</h2>



<p>Here is the complete migration pipeline that you can run in Flink SQL Client. This is production-ready code with all the optimizations we&#8217;ve discussed:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- =============================================================================
-- Flink CDC Pipeline: Migrate CUSTOMERS table (Oracle -&gt; PostgreSQL)
-- =============================================================================
-- Mode: Snapshot-only (no incremental, no streaming)
-- Source: Oracle XE 21c
-- Target: PostgreSQL 18
-- =============================================================================

-- ============================================================================
-- PART 1: Runtime Configuration (SET statements)
-- ============================================================================
-- These settings control how the Flink job executes

SET &#039;pipeline.name&#039; = &#039;Oracle-to-PostgreSQL: CUSTOMERS Migration&#039;;
SET &#039;execution.runtime-mode&#039; = &#039;STREAMING&#039;;
SET &#039;parallelism.default&#039; = &#039;4&#039;;

-- Checkpointing configuration
-- AT_LEAST_ONCE is faster for snapshot/migration workloads
SET &#039;execution.checkpointing.mode&#039; = &#039;AT_LEAST_ONCE&#039;;
SET &#039;execution.checkpointing.interval&#039; = &#039;60 s&#039;;
SET &#039;execution.checkpointing.timeout&#039; = &#039;10 min&#039;;
SET &#039;execution.checkpointing.min-pause&#039; = &#039;30 s&#039;;

-- Restart strategy for fault tolerance
SET &#039;restart-strategy.type&#039; = &#039;fixed-delay&#039;;
SET &#039;restart-strategy.fixed-delay.attempts&#039; = &#039;3&#039;;
SET &#039;restart-strategy.fixed-delay.delay&#039; = &#039;10 s&#039;;

-- ============================================================================
-- PART 2: Source Table Definition (Oracle CDC)
-- ============================================================================
-- This defines how Flink reads from Oracle using Debezium under the hood

DROP TABLE IF EXISTS src_customers;

CREATE TABLE src_customers (
    -- Column definitions must match Oracle schema
    -- Use Flink SQL types that map to Oracle types
    CUSTOMER_ID   DECIMAL(10,0),
    FIRST_NAME    STRING,
    LAST_NAME     STRING,
    EMAIL         STRING,
    PHONE         STRING,
    CREATED_AT    TIMESTAMP(6),
    STATUS        STRING,
    -- Primary key is required for CDC (NOT ENFORCED = Flink won&#039;t validate)
    PRIMARY KEY (CUSTOMER_ID) NOT ENFORCED
) WITH (
    -- Connector type: oracle-cdc (uses Debezium internally)
    &#039;connector&#039; = &#039;oracle-cdc&#039;,

    -- Oracle connection details
    &#039;hostname&#039; = &#039;oracle&#039;,
    &#039;port&#039; = &#039;1521&#039;,
    &#039;username&#039; = &#039;demo&#039;,
    &#039;password&#039; = &#039;demo&#039;,

    -- Database configuration (pluggable database for Oracle XE)
    -- Use url to connect via service name instead of SID
    &#039;url&#039; = &#039;jdbc:oracle:thin:@//oracle:1521/XEPDB1&#039;,
    &#039;database-name&#039; = &#039;XEPDB1&#039;,
    &#039;schema-name&#039; = &#039;DEMO&#039;,
    &#039;table-name&#039; = &#039;CUSTOMERS&#039;,

    -- Snapshot mode: &#039;initial&#039; = full snapshot, then stop (for snapshot-only)
    &#039;scan.startup.mode&#039; = &#039;initial&#039;,

    -- IMPORTANT: Disable incremental snapshot for this demo
    -- Incremental snapshot requires additional Oracle configuration
    &#039;scan.incremental.snapshot.enabled&#039; = &#039;false&#039;,

    -- Debezium snapshot configuration
    &#039;debezium.snapshot.fetch.size&#039; = &#039;10000&#039;
);

-- ============================================================================
-- PART 3: Sink Table Definition (PostgreSQL JDBC)
-- ============================================================================
-- This defines how Flink writes to PostgreSQL

DROP TABLE IF EXISTS sink_customers;

CREATE TABLE sink_customers (
    -- Column definitions for PostgreSQL target
    customer_id   BIGINT,
    first_name    STRING,
    last_name     STRING,
    email         STRING,
    phone         STRING,
    created_at    TIMESTAMP(6),
    status        STRING,
    PRIMARY KEY (customer_id) NOT ENFORCED
) WITH (
    -- Connector type: jdbc
    &#039;connector&#039; = &#039;jdbc&#039;,

    -- PostgreSQL connection with batch optimization
    -- rewriteBatchedInserts=true is CRITICAL for performance (5-10x improvement)
    &#039;url&#039; = &#039;jdbc:postgresql://postgres:5432/demo?rewriteBatchedInserts=true&#039;,
    &#039;table-name&#039; = &#039;customers&#039;,
    &#039;username&#039; = &#039;demo&#039;,
    &#039;password&#039; = &#039;demo&#039;,
    &#039;driver&#039; = &#039;org.postgresql.Driver&#039;,

    -- Sink parallelism (tune based on target DB capacity)
    -- Too high can cause contention; 4-8 is usually optimal
    &#039;sink.parallelism&#039; = &#039;4&#039;,

    -- Buffer configuration for throughput
    &#039;sink.buffer-flush.max-rows&#039; = &#039;10000&#039;,
    &#039;sink.buffer-flush.interval&#039; = &#039;5 s&#039;,

    -- Retry configuration
    &#039;sink.max-retries&#039; = &#039;3&#039;
);

-- ============================================================================
-- PART 4: Data Flow (INSERT...SELECT)
-- ============================================================================
-- This starts the actual data migration
-- CAST operations convert Oracle types to PostgreSQL types

INSERT INTO sink_customers
SELECT
    CAST(CUSTOMER_ID AS BIGINT) AS customer_id,
    FIRST_NAME AS first_name,
    LAST_NAME AS last_name,
    EMAIL AS email,
    PHONE AS phone,
    CREATED_AT AS created_at,
    STATUS AS status
FROM src_customers;
</pre></div>


<hr class="wp-block-separator has-alpha-channel-opacity" />



<h3 class="wp-block-heading" id="h-lab-setup">LAB Setup </h3>



<p>In my LAB I am using PG18 and Oracle XE Docker container and the Flink task and job manager container with the follwing definition : <br> <br><strong>Create a <code>docker-compose.yml</code>:</strong></p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: yaml; title: ; notranslate">
services:
  # Oracle Database 21c XE (Source)
  oracle:
    image: gvenzl/oracle-xe:21-slim-faststart
    container_name: oracle-demo
    environment:
      ORACLE_PASSWORD: OracleDemo123
      APP_USER: demo
      APP_USER_PASSWORD: demo
    ports:
      - &quot;1521:1521&quot;
    volumes:
      - oracle-data:/opt/oracle/oradata
      - ./oracle-init:/container-entrypoint-initdb.d
    healthcheck:
      test: &#x5B;&quot;CMD&quot;, &quot;healthcheck.sh&quot;]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 120s
    networks:
      - flink-network

  # PostgreSQL 18 (Target)
  postgres:
    image: postgres:18
    container_name: postgres-demo
    environment:
      POSTGRES_USER: demo
      POSTGRES_PASSWORD: demo
      POSTGRES_DB: demo
    ports:
      - &quot;5432:5432&quot;
    volumes:
      - postgres-data:/var/lib/postgresql
      - ./postgres-init:/docker-entrypoint-initdb.d
    healthcheck:
      test: &#x5B;&quot;CMD-SHELL&quot;, &quot;pg_isready -U demo -d demo&quot;]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - flink-network

  # Flink Job Manager
  flink-jobmanager:
    build:
      context: ./flink
      dockerfile: Dockerfile
    container_name: flink-jobmanager
    ports:
      - &quot;8081:8081&quot;
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        jobmanager.memory.process.size: 1600m
        parallelism.default: 4
        state.backend.type: rocksdb
    volumes:
      - ./pipelines:/opt/flink/pipelines
    networks:
      - flink-network
    depends_on:
      oracle:
        condition: service_healthy
      postgres:
        condition: service_healthy

  # Flink Task Manager
  flink-taskmanager:
    build:
      context: ./flink
      dockerfile: Dockerfile
    container_name: flink-taskmanager
    command: taskmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: flink-jobmanager
        taskmanager.memory.process.size: 2048m
        taskmanager.numberOfTaskSlots: 8
    networks:
      - flink-network
    depends_on:
      - flink-jobmanager

volumes:
  oracle-data:
  postgres-data:

networks:
  flink-network:
    driver: bridge
</pre></div>


<p></p>



<p><strong>Create <code>flink/Dockerfile</code>:</strong></p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: yaml; title: ; notranslate">
FROM flink:1.20.3-scala_2.12-java11

# Download Flink CDC connector for Oracle
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-oracle-cdc/3.5.0/flink-sql-connector-oracle-cdc-3.5.0.jar

# Download JDBC connector
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/apache/flink/flink-connector-jdbc/3.2.0-1.19/flink-connector-jdbc-3.2.0-1.19.jar

# Download PostgreSQL JDBC driver
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.4/postgresql-42.7.4.jar

# Download Oracle JDBC driver
RUN wget -P /opt/flink/lib/ \
    https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc11/23.5.0.24.07/ojdbc11-23.5.0.24.07.jar
</pre></div>


<p></p>



<p>Access the Flink Web UI at: http://localhost:8081</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-running-the-migration">Running the Migration</h2>



<p>Let&#8217;s execute the actual migration with full command outputs.</p>



<h3 class="wp-block-heading" id="h-step-1-verify-source-data-before-migration">Step 1: Verify Source Data (Before Migration)</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
$ docker exec oracle-demo bash -c &quot;echo &#039;SELECT COUNT(*) FROM customers;&#039; | \
    sqlplus -s demo/demo@//localhost:1521/XEPDB1&quot;

  COUNT(*)
----------
     10000
</pre></div>


<h3 class="wp-block-heading" id="h-step-2-verify-target-is-empty-before-migration">Step 2: Verify Target is Empty (Before Migration)</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
$ docker exec postgres-demo psql -U demo -d demo -c &quot;SELECT COUNT(*) FROM customers;&quot;

 count
-------
     0
(1 row)
</pre></div>


<h3 class="wp-block-heading" id="h-step-3-run-the-migration-pipeline">Step 3: Run the Migration Pipeline</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
$ docker exec flink-jobmanager /opt/flink/bin/sql-client.sh \
    -f /opt/flink/pipelines/migrate-customers.sql

&#x5B;INFO] Executing SQL from file.
Flink SQL&gt; SET &#039;pipeline.name&#039; = &#039;Oracle-to-PostgreSQL: CUSTOMERS Migration&#039;;
&#x5B;INFO] Execute statement succeeded.
...
Flink SQL&gt; INSERT INTO sink_customers SELECT ...
&#x5B;INFO] Submitting SQL update statement to the cluster...
&#x5B;INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: c554d99dce69b084607080502c13ffca
</pre></div>


<h3 class="wp-block-heading" id="h-step-4-monitor-progress">Step 4: Monitor Progress</h3>



<p>Check the Flink Web UI at http://localhost:8081 or use the REST API:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
$ curl -s http://localhost:8081/jobs | jq &#039;.jobs&#x5B;] | select(.status == &quot;RUNNING&quot; or .status == &quot;FINISHED&quot;)&#039;
{
  &quot;id&quot;: &quot;c554d99dce69b084607080502c13ffca&quot;,
  &quot;status&quot;: &quot;FINISHED&quot;
}
</pre></div>


<h3 class="wp-block-heading" id="h-step-5-verify-migration-after">Step 5: Verify Migration (After)</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
$ docker exec postgres-demo psql -U demo -d demo -c &quot;SELECT COUNT(*) FROM customers;&quot;

 count
-------
 10000
(1 row)

$ docker exec postgres-demo psql -U demo -d demo -c &quot;SELECT * FROM customers LIMIT 3;&quot;

 customer_id |  first_name   |  last_name   |          email           |    phone
-------------+---------------+--------------+--------------------------+-------------
        8836 | FirstName8836 | LastName8836 | customer8836@example.com | +1-555-8836
        4740 | FirstName4740 | LastName4740 | customer4740@example.com | +1-555-4740
        8835 | FirstName8835 | LastName8835 | customer8835@example.com | +1-555-8835
(3 rows)
</pre></div>


<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-type-mapping-reference">Type Mapping Reference</h2>



<p>Before migrating data, you need to understand the type conversions. Here&#8217;s a reference:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Oracle</th><th>Flink SQL</th><th>PostgreSQL</th><th>Notes</th></tr></thead><tbody><tr><td>NUMBER(10)</td><td>DECIMAL(10,0)</td><td>BIGINT</td><td>Use CAST in SELECT</td></tr><tr><td>NUMBER(12,2)</td><td>DECIMAL(12,2)</td><td>NUMERIC(12,2)</td><td>Direct mapping</td></tr><tr><td>VARCHAR2(n)</td><td>STRING</td><td>VARCHAR(n)</td><td>Direct mapping</td></tr><tr><td>DATE</td><td>TIMESTAMP(6)</td><td>TIMESTAMP</td><td>Oracle DATE includes time</td></tr><tr><td>TIMESTAMP</td><td>TIMESTAMP(6)</td><td>TIMESTAMP</td><td>Direct mapping</td></tr><tr><td>CLOB</td><td>STRING</td><td>TEXT</td><td>Large text</td></tr><tr><td>BLOB</td><td>BYTES</td><td>BYTEA</td><td>Binary data</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-performance-optimization-what-we-learned">Performance Optimization: What We Learned</h2>



<p>Based on production experience, here are the key optimizations that improved throughput from <strong>300 rec/sec to 19,500 rec/sec</strong> (65x improvement).</p>



<h3 class="wp-block-heading" id="h-understanding-cpu-bound-vs-iops-bound-pipelines">Understanding CPU-Bound vs. IOPS-Bound Pipelines</h3>



<p>Before tuning, you need to understand what&#8217;s limiting your pipeline. This is critical because the solutions are different:</p>



<p><strong>CPU-Bound Pipeline:</strong></p>



<ul class="wp-block-list">
<li>Symptoms: High CPU usage on Flink Task Manager, low disk I/O on target database</li>



<li>Cause: Complex transformations, serialization/deserialization overhead, too few parallel workers</li>



<li>Solution: Increase parallelism, simplify transformations, use more Task Manager slots</li>
</ul>



<p><strong>IOPS-Bound Pipeline:</strong></p>



<ul class="wp-block-list">
<li>Symptoms: Low CPU usage on Flink, high disk I/O or lock contention on target database</li>



<li>Cause: Too many small writes, target database bottleneck, excessive parallelism causing lock contention</li>



<li>Solution: Larger batch sizes, <code>rewriteBatchedInserts=true</code>, reduce sink parallelism, tune target database</li>
</ul>



<p><strong>Network-Bound Pipeline:</strong></p>



<ul class="wp-block-list">
<li>Symptoms: High network wait times, gaps between source reads and sink writes</li>



<li>Cause: Small fetch sizes, high latency between Flink and databases</li>



<li>Solution: Larger fetch sizes, co-locate components when possible</li>
</ul>



<h3 class="wp-block-heading" id="h-how-to-identify-your-bottleneck">How to Identify Your Bottleneck</h3>



<p>In the Flink Web UI, look at:</p>



<ol class="wp-block-list">
<li><strong>Backpressure indicators</strong>: Red/yellow backpressure on source = sink can&#8217;t keep up (IOPS-bound)</li>



<li><strong>Records sent/received</strong>: Compare source output rate vs. sink input rate</li>



<li><strong>Checkpoint duration</strong>: Long checkpoints often indicate IOPS issues on state backend</li>



<li><strong>Task Manager metrics</strong>: CPU%, memory usage, GC pauses</li>
</ol>



<p>On your databases:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
# Oracle: Check redo log generation rate
SELECT * FROM V$SYSSTAT WHERE NAME LIKE &#039;%redo%&#039;;

# PostgreSQL: Check write activity
SELECT * FROM pg_stat_bgwriter;
SELECT * FROM pg_stat_database WHERE datname = &#039;demo&#039;;
</pre></div>


<h3 class="wp-block-heading" id="h-critical-optimizations">Critical Optimizations</h3>



<h4 class="wp-block-heading" id="h-1-jdbc-batch-rewriting-5-10x-improvement">1. JDBC Batch Rewriting (5-10x Improvement)</h4>



<p>The single most impactful optimization for IOPS-bound pipelines:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
&#039;url&#039; = &#039;jdbc:postgresql://host/db?rewriteBatchedInserts=true&#039;
</pre></div>


<p>This is so important I&#8217;ll repeat it: <strong>this single parameter gave us 5-10x throughput improvement</strong>. Without it, every row is a separate INSERT statement. With it, rows are batched into efficient multi-row INSERTs.</p>



<h4 class="wp-block-heading" id="h-2-sink-parallelism-2-4x-improvement">2. Sink Parallelism (2-4x Improvement)</h4>



<p>More workers can process more data—but there&#8217;s a sweet spot:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
&#039;sink.parallelism&#039; = &#039;12&#039;
</pre></div>


<p>Our testing showed:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parallelism</th><th>Throughput</th><th>Notes</th></tr></thead><tbody><tr><td>1</td><td>5,000 rec/sec</td><td>Baseline</td></tr><tr><td>4</td><td>12,000 rec/sec</td><td>Good improvement</td></tr><tr><td>8</td><td>17,000 rec/sec</td><td>Still scaling</td></tr><tr><td>12</td><td>19,500 rec/sec</td><td>Sweet spot</td></tr><tr><td>16</td><td>18,000 rec/sec</td><td>Contention starts</td></tr><tr><td>24</td><td>15,000 rec/sec</td><td>Too much contention</td></tr></tbody></table></figure>



<p><strong>Why does too much parallelism hurt?</strong> Lock contention on the target database. Each parallel writer tries to acquire locks, and beyond a certain point, they spend more time waiting than writing.</p>



<h4 class="wp-block-heading" id="h-3-buffer-size-tuning">3. Buffer Size Tuning</h4>



<p>Larger buffers = fewer flushes = better throughput (at cost of memory and latency):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
&#039;sink.buffer-flush.max-rows&#039; = &#039;50000&#039;
&#039;sink.buffer-flush.interval&#039; = &#039;10 s&#039;
</pre></div>


<p>For IOPS-bound pipelines, larger buffers are critical. For CPU-bound pipelines, smaller buffers with higher parallelism may be better.</p>



<h4 class="wp-block-heading" id="h-4-source-fetch-size">4. Source Fetch Size</h4>



<p>Reduce round-trips to the source database:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
-- For JDBC connector:
&#039;scan.fetch-size&#039; = &#039;20000&#039;

-- For CDC connector:
&#039;debezium.snapshot.fetch.size&#039; = &#039;20000&#039;
</pre></div>


<p>Larger fetch sizes reduce network overhead but increase memory usage. Find your balance based on available memory.</p>



<h4 class="wp-block-heading" id="h-5-checkpointing-mode">5. Checkpointing Mode</h4>



<p>For migrations (where exactly-once is less critical):</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET &#039;execution.checkpointing.mode&#039; = &#039;AT_LEAST_ONCE&#039;;
</pre></div>


<p><code>AT_LEAST_ONCE</code> is faster than <code>EXACTLY_ONCE</code> because it doesn&#8217;t require barriers to align data across all parallel paths. Since our sink uses upserts (INSERT ON CONFLICT), duplicate processing is idempotent anyway.</p>



<h4 class="wp-block-heading" id="h-6-checkpoint-interval">6. Checkpoint Interval</h4>



<p>Longer intervals = less overhead, but longer recovery time on failure:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SET &#039;execution.checkpointing.interval&#039; = &#039;60 s&#039;;
</pre></div>


<p>For our production migrations, 45-60 seconds was optimal. Shorter intervals caused excessive state backend I/O (another IOPS consideration).</p>



<h3 class="wp-block-heading" id="h-performance-reference">Performance Reference</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Setting</th><th>Baseline</th><th>Optimized</th><th>Impact</th></tr></thead><tbody><tr><td>rewriteBatchedInserts</td><td>false</td><td>true</td><td>5-10x</td></tr><tr><td>sink.parallelism</td><td>1</td><td>12</td><td>2-4x</td></tr><tr><td>buffer-flush.max-rows</td><td>1000</td><td>50000</td><td>1.5-2x</td></tr><tr><td>fetch-size</td><td>1000</td><td>20000</td><td>1.3-1.5x</td></tr><tr><td>checkpoint.mode</td><td>EXACTLY_ONCE</td><td>AT_LEAST_ONCE</td><td>1.2-1.3x</td></tr><tr><td><strong>Combined Throughput</strong></td><td>300 rec/sec</td><td>19,500 rec/sec</td><td><strong>65x</strong></td></tr></tbody></table></figure>



<h3 class="wp-block-heading" id="h-real-world-tuning-process">Real-World Tuning Process</h3>



<p>Here&#8217;s how I approach tuning a new migration:</p>



<ol class="wp-block-list">
<li><strong>Start with defaults</strong>: Run the pipeline and observe behavior in Flink UI</li>



<li><strong>Identify the bottleneck</strong>: Is it CPU, IOPS, or network?</li>



<li><strong>Apply the biggest lever first</strong>: Usually <code>rewriteBatchedInserts=true</code> for PostgreSQL</li>



<li><strong>Increase parallelism gradually</strong>: Watch for the point where throughput stops improving</li>



<li><strong>Tune batch sizes</strong>: Larger for IOPS-bound, smaller for CPU-bound</li>



<li><strong>Monitor the target database</strong>: Watch for lock contention, checkpoint lag, WAL accumulation</li>



<li><strong>Document your findings</strong>: Each environment is different; what works for one may not work for another</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-incremental-snapshot-for-large-databases">Incremental Snapshot for Large Databases</h2>



<p>For databases larger than ~100GB, incremental snapshot mode is essential. Instead of reading entire tables at once (which can cause locks and memory issues), incremental snapshot divides tables into chunks.</p>



<h3 class="wp-block-heading" id="h-what-is-incremental-snapshot">What is Incremental Snapshot?</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
┌─────────────────────────────────────────────────────────────────┐
│                   Incremental Snapshot                          │
│                                                                 │
│  Table (1M rows, chunked by ID):                                │
│                                                                 │
│  ┌───────┬───────┬───────┬───────┬───────┐                      │
│  │ Chunk │ Chunk │ Chunk │ Chunk │ Chunk │                      │
│  │  1    │  2    │  3    │  4    │  5    │  ...                 │
│  │ 1-200K│200K-  │400K-  │600K-  │800K-  │                      │
│  │       │ 400K  │ 600K  │ 800K  │ 1M    │                      │
│  └───┬───┴───┬───┴───┬───┴───┬───┴───┬───┘                      │
│      │       │       │       │                                  │
│      ▼       ▼       ▼       ▼                                  │
│   Process in parallel, no table locks                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
</pre></div>


<h3 class="wp-block-heading" id="h-oracle-requirements">Oracle Requirements</h3>



<p>Incremental snapshot with CDC requires additional Oracle configuration:</p>



<ol class="wp-block-list">
<li><strong>Archive Log Mode</strong>: Must be enabled <code>-- Check current mode SELECT LOG_MODE FROM V$DATABASE; -- Enable (requires DB restart) SHUTDOWN IMMEDIATE; STARTUP MOUNT; ALTER DATABASE ARCHIVELOG; ALTER DATABASE OPEN;</code></li>



<li><strong>Supplemental Logging</strong>: <code>ALTER DATABASE ADD SUPPLEMENTAL LOG DATA; ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;</code></li>



<li><strong>LogMiner Privileges</strong> for the CDC user: <code>GRANT SELECT ON V_$DATABASE TO cdc_user; GRANT SELECT ON V_$LOG TO cdc_user; GRANT SELECT ON V_$LOGFILE TO cdc_user; GRANT SELECT ON V_$ARCHIVED_LOG TO cdc_user; GRANT EXECUTE ON DBMS_LOGMNR TO cdc_user; GRANT EXECUTE ON DBMS_LOGMNR_D TO cdc_user; GRANT SELECT ON V_$LOGMNR_LOGS TO cdc_user; GRANT SELECT ON V_$LOGMNR_CONTENTS TO cdc_user; GRANT FLASHBACK ANY TABLE TO cdc_user;</code></li>
</ol>



<h3 class="wp-block-heading" id="h-pipeline-configuration">Pipeline Configuration</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
CREATE TABLE src_large_table (...) WITH (
    &#039;connector&#039; = &#039;oracle-cdc&#039;,
    &#039;url&#039; = &#039;jdbc:oracle:thin:@//oracle:1521/XEPDB1&#039;,
    &#039;database-name&#039; = &#039;XEPDB1&#039;,
    &#039;schema-name&#039; = &#039;DEMO&#039;,
    &#039;table-name&#039; = &#039;LARGE_TABLE&#039;,

    -- Enable incremental snapshot
    &#039;scan.incremental.snapshot.enabled&#039; = &#039;true&#039;,
    &#039;scan.incremental.snapshot.chunk.size&#039; = &#039;100000&#039;,
    &#039;scan.incremental.snapshot.chunk.key-column&#039; = &#039;ID&#039;,

    -- Debezium settings
    &#039;debezium.snapshot.fetch.size&#039; = &#039;20000&#039;
);
</pre></div>


<h3 class="wp-block-heading" id="h-when-to-use-incremental-snapshot">When to Use Incremental Snapshot</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Database Size</th><th>Recommendation</th></tr></thead><tbody><tr><td>&lt; 10 GB</td><td>Standard snapshot (JDBC)</td></tr><tr><td>10-100 GB</td><td>Either approach works</td></tr><tr><td>&gt; 100 GB</td><td>Incremental snapshot required</td></tr><tr><td>Active production DB</td><td>Incremental snapshot recommended</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-production-implementation-advice">Production Implementation Advice</h2>



<p>Before taking this approach to production, there are several considerations to keep in mind. First, this lab setup runs Flink in standalone mode which is fine for testing but lacks persistence—if you restart the Flink processes, your pipelines are lost. For production, you&#8217;ll want to deploy on Kubernetes using the official Flink Kubernetes Operator, which provides proper state management, automatic recovery, and horizontal scaling. Second, pay close attention to version compatibility because not all latest versions of Flink, CDC connectors, and JDBC drivers work together—I learned this the hard way, so check the compatibility matrix before building your stack and stick with LTS versions like Flink 1.20 when possible. Third, externalize your checkpoints to durable storage like S3, MinIO, or HDFS rather than local filesystem, as this enables true fault tolerance and job recovery across restarts. Fourth, implement proper monitoring by connecting Flink&#8217;s metrics to Prometheus and Grafana, setting up alerts for checkpoint failures, backpressure, and throughput drops—the Web UI is great for debugging but not for 24/7 operations. Fifth, secure your connections by using SSL/TLS for database connections, storing credentials in a secrets manager rather than plain text in SQL files, and implementing network segmentation between Flink and your databases. Finally, if your organization allows it, seriously consider managed services like AWS Managed Flink, Confluent Cloud, or Azure Stream Analytics, which eliminate most of the operational burden of running Flink clusters yourself. The official documentation provides comprehensive guidance for production deployments: <a href="https://nightlies.apache.org/flink/flink-cdc-docs-stable/docs/get-started/introduction/">Apache Flink CDC Introduction</a>.<br><br>As per example, in a migration project for an Oracle database of 800GB, around 1500 tables and 4.8 Billions rows the VM that hosted the Flink services was 16 cores and 48GB of RAM. The initial incremental snapshot lasted for 3.5 days with a throughput of 18 000 records/sec and 15k IOPS. Several automation had to be created like how to generate the pipelines for all tables and how to sequentially go from the initial load to the streaming part while maintaining the CPU cores busy. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-what-we-ve-learned">What We&#8217;ve Learned</h2>



<p>Through this guide, we&#8217;ve explored database migration with Flink CDC and learned several important lessons. On the technical side, start simple with snapshot mode first and add complexity like incremental or streaming CDC only when needed—don&#8217;t overengineer for a one-time migration. Understanding your bottleneck is critical because the tuning strategy differs completely depending on whether your pipeline is CPU-bound, IOPS-bound, or network-bound. The <code>rewriteBatchedInserts=true</code> parameter is magic for PostgreSQL, giving us a 5-10x improvement with a single setting, and parallelism has a sweet spot where more isn&#8217;t always better—we found 12 workers optimal before lock contention started hurting performance. Checkpointing is a trade-off between throughput and recovery time, with 45-60 seconds being optimal for migrations, and type mapping matters because incorrect Oracle → Flink → PostgreSQL conversions cause silent data corruption. Operationally, monitor everything using the Flink Web UI alongside source and target database metrics, test thoroughly on a test environment first because production surprises are expensive, have a rollback plan by keeping the source database running until cutover is verified, and document your tuning because each environment is different. Strategically, know when NOT to use Flink since simpler tools are better for small databases or same-technology migrations, factor in the operational complexity of maintaining another system, and consider cloud-managed Flink/CDC solutions if your organization allows it.</p>



<h2 class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>Flink CDC transforms database migrations from anxious &#8220;big bang&#8221; events into controlled, observable, and recoverable processes by combining real-time monitoring in the Flink Web UI, fault tolerance through checkpointing, configurable parallelism for performance, and transform capabilities in Flink SQL—making it a powerful tool for cross-technology migrations. We achieved a 65x throughput improvement (300 → 19,500 rec/sec) by understanding our bottlenecks and applying targeted optimizations, with the key insight being to identify whether you&#8217;re CPU-bound or IOPS-bound and tune accordingly. As with any tool, use it where it fits: for large, cross-technology migrations with near zero-downtime requirements, Flink CDC is excellent, but for small databases or simple same-technology copies, stick with native tools.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-resources">Resources</h2>



<ul class="wp-block-list">
<li><a href="https://flink.apache.org/docs/">Apache Flink Documentation</a></li>



<li><a href="https://github.com/apache/flink-cdc">Flink CDC Connectors</a></li>



<li><a href="https://debezium.io/documentation/reference/connectors/oracle.html">Debezium Oracle Connector</a></li>



<li><a href="https://ora2pg.darold.net/">ora2pg Tool</a></li>



<li><a href="https://jdbc.postgresql.org/">PostgreSQL JDBC Driver</a></li>
</ul>



<p></p>
<p>L’article <a href="https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/">Oracle to PostgreSQL Migration with Flink CDC</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/oracle-to-postgresql-migration-with-flink-cdc/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>PostgreSQL 19: Two nice little improvements: log_autoanalyze_min_duration and search_path in the psql prompt</title>
		<link>https://www.dbi-services.com/blog/postgresql-19-two-nice-little-improvements-log_autoanalyze_min_duration-and-search_path-in-the-psql-prompt/</link>
					<comments>https://www.dbi-services.com/blog/postgresql-19-two-nice-little-improvements-log_autoanalyze_min_duration-and-search_path-in-the-psql-prompt/#respond</comments>
		
		<dc:creator><![CDATA[Daniel Westermann]]></dc:creator>
		<pubDate>Wed, 29 Oct 2025 07:10:34 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=41345</guid>

					<description><![CDATA[<p>Two nice little improvements have been committed for PostgreSQL 19. The first one is about logging the duration of automatic analyze while the second one is about displaying the current search_path in psql&#8217;s prompt. Lets start with the improvement for psql. As you probably know, the default prompt in psql looks like this: While I [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/postgresql-19-two-nice-little-improvements-log_autoanalyze_min_duration-and-search_path-in-the-psql-prompt/">PostgreSQL 19: Two nice little improvements: log_autoanalyze_min_duration and search_path in the psql prompt</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Two nice little improvements have been committed for PostgreSQL 19. The first one is about logging the duration of automatic analyze while the second one is about displaying the current search_path in psql&#8217;s prompt.</p>



<p>Lets start with the improvement for psql. As you probably know, the default prompt in psql looks like this:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
postgres@:/home/postgres/ &#x5B;pgdev] psql
psql (19devel)
Type &quot;help&quot; for help.

postgres=# 
</pre></div>


<p>While I do not have an issue with the default prompt, you maybe want to see more information. An example of what you might do is this:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
postgres=# \set PROMPT1 &#039;%M:%&amp;gt; %n@%/%R%#%x &#039;
&#x5B;local]:5432 postgres@postgres=# 
</pre></div>


<p>Now you immediately see that this is a connection over a socket on port 5432, and you&#8217;re connected as the &#8220;postgres&#8221; user to the &#8220;postgres&#8221; database. If you want to make this permanent, add it to your &#8220;<a href="https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-FILES-PSQLRC" target="_blank" rel="noreferrer noopener">.psqlrc</a>&#8221; file.</p>



<p>The new prompting option which will come with PostgreSQL 19 is &#8220;%S&#8221;, and this will give you the <a href="https://www.postgresql.org/docs/current/ddl-schemas.html#DDL-SCHEMAS-PATH" target="_blank" rel="noreferrer noopener">search_path</a>:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
postgres=# show search_path;
   search_path   
-----------------
 &quot;$user&quot;, public
(1 row)

postgres=# \set PROMPT1 &#039;%/%R%x%..%S..# &#039;
postgres=..&quot;$user&quot;, public..# set search_path = &#039;xxxxxxx&#039;;
SET
postgres=..xxxxxxx..# 
</pre></div>


<p>Nice. You can find all the other prompting options in the documentation of <a href="https://www.postgresql.org/docs/current/app-psql.html" target="_blank" rel="noreferrer noopener">psql</a>, the commit is <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b3ce55f413cdf70b1bc4724052fb4eacf9de239a" target="_blank" rel="noreferrer noopener">here</a>.</p>



<p>The second improvement is about logging the time of automatic analyze. Before PostgreSQL 19 we only had <a href="https://www.postgresql.org/docs/18/runtime-config-logging.html#GUC-LOG-AUTOVACUUM-MIN-DURATION" target="_blank" rel="noreferrer noopener">log_autovacuum_min_duration</a>. This logs all actions of autovacuum if they cross the specified threshold. This of course includes the auto analyze as well, but usually it is autovacuum taking most of the time. This is now separated and there is a new parameter called &#8220;log_autovacuum_min_duration&#8221;. You can easily test this with the following snippet:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
postgres=# create table t ( a int , b text );
CREATE TABLE
postgres=# insert into t select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000
postgres=# alter system set log_autoanalyze_min_duration = &#039;1ms&#039;;
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

postgres=# insert into t select i, i::text from generate_series(1,1000000) i;
INSERT 0 1000000
</pre></div>


<p>Looking at the log file there is now this:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
2025-10-29 08:05:52.744 CET - 1 - 3454 -  - @ - 771LOG:  automatic analyze of table &quot;postgres.public.t&quot;
        avg read rate: 0.033 MB/s, avg write rate: 0.033 MB/s
        buffer usage: 10992 hits, 1 reads, 1 dirtied
        WAL usage: 5 records, 1 full page images, 10021 bytes, 0 buffers full
        system usage: CPU: user: 0.10 s, system: 0.01 s, elapsed: 0.23 s
</pre></div>


<p>Also nice, the commit is <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dd3ae378301f7e84c18f7a90f183c3cd4165c0da" target="_blank" rel="noreferrer noopener">here</a>.</p>
<p>L’article <a href="https://www.dbi-services.com/blog/postgresql-19-two-nice-little-improvements-log_autoanalyze_min_duration-and-search_path-in-the-psql-prompt/">PostgreSQL 19: Two nice little improvements: log_autoanalyze_min_duration and search_path in the psql prompt</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/postgresql-19-two-nice-little-improvements-log_autoanalyze_min_duration-and-search_path-in-the-psql-prompt/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>RAG Series – Agentic RAG</title>
		<link>https://www.dbi-services.com/blog/rag-series-agentic-rag/</link>
					<comments>https://www.dbi-services.com/blog/rag-series-agentic-rag/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 26 Oct 2025 20:29:09 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[AI/LLM]]></category>
		<category><![CDATA[pgvector]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=41108</guid>

					<description><![CDATA[<p>Introduction In earlier parts, we moved from Naive RAG (vector search) to Hybrid RAG (dense + sparse) to Adaptive RAG (query classification and dynamic weighting). Each step improved what we retrieve. Agentic RAG goes further: the LLM decides if and when to retrieve at all and can take multiple steps (retrieve → inspect → refine [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-agentic-rag/">RAG Series – Agentic RAG</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading" id="h-introduction">Introduction</h2>



<p>In earlier parts, we moved from <strong>Naive RAG</strong> (vector search) to <strong>Hybrid RAG</strong> (dense + sparse) to <strong>Adaptive RAG</strong> (query classification and dynamic weighting). Each step improved <em>what</em> we retrieve. <strong>Agentic RAG</strong> goes further: the LLM decides <em>if</em> and <em>when</em> to retrieve at all and can take multiple steps (retrieve → inspect → refine → retrieve) before answering. Retrieval stops being a fixed stage and becomes a tool the model invokes when and how it needs to. This blog post will explain the fundamental principles of agentic RAG from a DBA perspective on which you can build on top of all your business logic and governance rules. </p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>RAG Type</th><th>Decision Logic</th><th>Flexibility</th><th>Typical Latency</th><th>Best For</th></tr></thead><tbody><tr><td>Naive</td><td>None</td><td>Fixed</td><td>~0.5 s</td><td>Simple FAQ</td></tr><tr><td>Hybrid</td><td>Static weights</td><td>Moderate</td><td>~0.6 s</td><td>Mixed queries</td></tr><tr><td>Adaptive</td><td>Query classifier</td><td>Dynamic</td><td>~0.7 s</td><td>Varied, predictable query types</td></tr><tr><td><strong>Agentic</strong></td><td><strong>LLM agent (tool use)</strong></td><td><strong>Fully dynamic</strong></td><td><strong>~2.0–2.5 s</strong></td><td>Complex, exploratory, multi-hop</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-when-one-retrieval-isn-t-enough">When One Retrieval Isn’t Enough</h2>



<p><strong>Example:</strong> <em>“Compare PostgreSQL and MySQL indexing approaches.”</em></p>



<p><strong>Traditional (single-pass) RAG</strong></p>



<ul class="wp-block-list">
<li>Retrieves mixed docs about both systems</li>



<li>LLM synthesizes from noisy context</li>



<li>Often misses nuanced differences or secondary linked subjects.</li>
</ul>



<p><strong>Agentic RAG</strong></p>



<ol class="wp-block-list">
<li>Agent searches “PostgreSQL indexing mechanisms”</li>



<li>Reads snippets; detects a gap for MySQL</li>



<li>Searches “MySQL indexing mechanisms”</li>



<li>Synthesizes a side-by-side comparison from focused contexts</li>
</ol>



<p>This loop generalizes: the agent decomposes questions, detects missing context, and invokes tools until it has enough evidence to answer confidently.</p>



<p><em>Generalization is an ability for an agent to take on new tasks or variations of existing ones by reusing and combining what it already knows rather than memorizing patterns. This is very usefull to handle variations in inputs and allow an agent to adapt faster with fewer codified examples but also comes with new limitations. There are different ways to generalize and you need to measure this functionality to detect when it fails. So far we covered monitoring at the ranking level of retrieval but this part I am not going to extend on is about reliable solving new tasks or shift, one way to measure it would be implement failure detection wired to the agent logs.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-architecture-overview">Architecture Overview</h2>



<p>Agentic RAG inserts a <strong>decision loop</strong> between query and retrieval; the database is explicitly a <strong>tool</strong>.<br></p>



<figure class="wp-block-image size-full"><img loading="lazy" decoding="async" width="1352" height="605" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-12.png" alt="" class="wp-image-41232" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-12.png 1352w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-12-300x134.png 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-12-1024x458.png 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-12-768x344.png 768w" sizes="auto, (max-width: 1352px) 100vw, 1352px" /></figure>



<h2 class="wp-block-heading" id="h-about-this-implementation">About This Implementation</h2>



<p>The code examples in this post are based on the <strong>complete, production-ready implementation</strong><br>available in the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab</a> repository.<br>​<br><strong>What&#8217;s included</strong>:</p>



<ul class="wp-block-list">
<li>✅ Agentic search engine</li>



<li>✅ Modern OpenAI tools API</li>



<li>✅ Interactive demo and CLI</li>



<li>✅ FastAPI integration</li>



<li>✅ n8n workflow template<br>​<br><strong>You don&#8217;t need to build from scratch</strong> — the implementation is ready to use. The post explains<br>the concepts and design decisions behind the working code.</li>
</ul>



<h3 class="wp-block-heading" id="h-picking-an-orchestration-style">Picking an orchestration style</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Approach</th><th>Best For</th><th>Complexity</th><th>Control</th></tr></thead><tbody><tr><td><strong>LangGraph</strong></td><td>Production agent graphs &amp; branches</td><td>Medium</td><td>Medium</td></tr><tr><td><strong>n8n</strong></td><td>Low-code demos / single-decision flows</td><td>Low</td><td>Low</td></tr><tr><td><strong>Custom Python</strong> (ours)</td><td>Full transparency &amp; tight DB integration</td><td>Medium</td><td>High</td></tr></tbody></table></figure>



<p>We’ll keep the loop in <strong>custom Python</strong> (testable, observable), while remaining compatible with LangGraph or n8n if you want to wrap it later.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-implementing-the-agent-loop-python">Implementing the Agent Loop (Python)</h2>



<p>Below are compact, production-minded snippets using the <strong>new client style</strong> and <strong>gpt-5 / gpt-5-mini</strong>. We use <strong>gpt-5-mini</strong> for the <em>decision</em> step (cheap/fast) and <strong>gpt-5</strong> for the <em>final synthesis</em> (quality). You can also run everything on <code>gpt-5-mini</code> if cost/latency is critical.</p>



<h3 class="wp-block-heading" id="h-1-expose-the-retrieval-tool">1) Expose the retrieval tool</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
# lab/search/agentic_search.py
from typing import List
from dataclasses import dataclass

@dataclass
class SearchResult:
    content: str
    metadata: dict

class VectorSearchService:
    def __init__(self, pg_conn):
        self.pg = pg_conn  # psycopg / asyncpg / SQLAlchemy
    def search(self, query: str, top_k: int = 5) -&gt; List&#x5B;SearchResult]:
        # Implement hybrid/dense search as in prior posts; this is the dense-only core.
        # SELECT title, text FROM wiki ORDER BY embedding &lt;-&gt; embed($1) LIMIT $2;
        ...

def format_snippets(results: List&#x5B;SearchResult]) -&gt; str:
    lines = &#x5B;]
    for i, r in enumerate(results, 1):
        title = r.metadata.get(&quot;title&quot;, &quot;Untitled&quot;)
        snippet = (r.content or &quot;&quot;).replace(&quot;\n&quot;, &quot; &quot;)&#x5B;:220]
        lines.append(f&quot;&#x5B;{i}] {title}: {snippet}...&quot;)
    return &quot;\n&quot;.join(lines)

def search_wikipedia(vector_service: VectorSearchService, query: str, top_k: int = 5) -&gt; str:
    results = vector_service.search(query, top_k=top_k)
    return format_snippets(results)
</pre></div>


<h3 class="wp-block-heading" id="h-2-register-tool-schema-for-the-llm">2) Register tool schema for the LLM</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
search_tool = {
  &quot;type&quot;: &quot;function&quot;,
  &quot;function&quot;: {
    &quot;name&quot;: &quot;search_wikipedia&quot;,
    &quot;description&quot;: &quot;Retrieve relevant Wikipedia snippets from Postgres+pgvector.&quot;,
    &quot;parameters&quot;: {
      &quot;type&quot;: &quot;object&quot;,
      &quot;properties&quot;: {
        &quot;query&quot;: {&quot;type&quot;: &quot;string&quot;},
        &quot;top_k&quot;: {&quot;type&quot;: &quot;integer&quot;, &quot;default&quot;: 5}
      },
      &quot;required&quot;: &#x5B;&quot;query&quot;]
    }
  }
}
</pre></div>


<h3 class="wp-block-heading" id="h-3-prompts-concise-outcome-oriented">3) Prompts (concise, outcome-oriented)</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
SYSTEM_PROMPT = &quot;&quot;&quot;
You are an expert assistant with access to a Wikipedia database via the tool `search_wikipedia`.
Decide if retrieval is needed before answering. If you are unsure, retrieve first.
If context is insufficient after the first retrieval, you may request one additional retrieval.
Base answers strictly on provided snippets; otherwise reply &quot;Unknown&quot;.
Conclude with a one-line decision note: `Decision: used search` or `Decision: skipped search`.
&quot;&quot;&quot;
</pre></div>


<h3 class="wp-block-heading" id="h-4-the-agent-loop-with-gpt-5-mini-decision-gpt-5-final">4) The agent loop with <strong>gpt-5-mini</strong> (decision) + <strong>gpt-5</strong> (final)</h3>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
import json, time
from openai import OpenAI
client = OpenAI()

def agentic_answer(pg_conn, user_question: str, max_retries: int = 1):
    vs = VectorSearchService(pg_conn)

    messages = &#x5B;
        {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: SYSTEM_PROMPT},
        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: user_question},
    ]

    # Phase 1: Decision / planning (cheap &amp; fast)
    start = time.time()
    decision = client.chat.completions.create(
        model=&quot;gpt-5-mini&quot;,
        messages=messages,
        tools=&#x5B;search_tool],
        tool_choice=&quot;auto&quot;,  # let the model decide
        temperature=0.2,
    )
    msg = decision.choices&#x5B;0].message
    tool_used = False
    loops = 0

    # Handle tool calls (allow at most 1 retry cycle)
    while msg.tool_calls and loops &lt;= max_retries:
        tool_used = True
        call = msg.tool_calls&#x5B;0]
        args = json.loads(call.function.arguments or &quot;{}&quot;)
        query = args.get(&quot;query&quot;) or user_question
        top_k = int(args.get(&quot;top_k&quot;) or 5)

        snippets = search_wikipedia(vs, query=query, top_k=top_k)
        if not snippets.strip():
            # No results safeguard
            messages += &#x5B;msg, {&quot;role&quot;: &quot;tool&quot;, &quot;name&quot;: &quot;search_wikipedia&quot;, &quot;content&quot;: &quot;NO_RESULTS&quot;}]
            break

        messages += &#x5B;
            msg,
            {&quot;role&quot;: &quot;tool&quot;, &quot;name&quot;: &quot;search_wikipedia&quot;, &quot;content&quot;: snippets}
        ]
        # Optionally allow one more decision round on mini
        decision = client.chat.completions.create(
            model=&quot;gpt-5-mini&quot;,
            messages=messages,
            tools=&#x5B;search_tool],
            tool_choice=&quot;auto&quot;,
            temperature=0.2,
        )
        msg = decision.choices&#x5B;0].message
        loops += 1

    # Phase 2: Final synthesis (high quality)
    final = client.chat.completions.create(
        model=&quot;gpt-5&quot;,
        messages=messages + (&#x5B;msg] if not msg.tool_calls else &#x5B;]),
        tool_choice=&quot;none&quot;,
        temperature=0.3,
    )
    answer = final.choices&#x5B;0].message.content or &quot;Unknown&quot;
    total_ms = int((time.time() - start) * 1000)

    return {
        &quot;answer&quot;: answer,
        &quot;tool_used&quot;: tool_used,
        &quot;loops&quot;: loops,
        &quot;latency_ms&quot;: total_ms,
    }
</pre></div>


<p>We use <strong>gpt-5-mini</strong> for the <em>decision</em> step (cheap/fast) and <strong>gpt-5</strong> for the <em>final synthesis</em> (quality).<br>The two-phase strategy (gpt-5-mini for decisions, gpt-5 for synthesis)<br>is an <strong>optional optimization</strong> for high-volume production use. The repository implementation uses a<br>single configurable model (default: gpt-5-mini) which works well for most use cases. Implement the<br>two-phase approach only if you need to optimize the cost/quality balance.</p>



<p><strong>Decision logic</strong></p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>LLM Output</th><th>Action</th><th>Next Step</th></tr></thead><tbody><tr><td>Direct answer</td><td>Return</td><td>Done</td></tr><tr><td>Tool call</td><td>Execute search</td><td>Feed snippets back; allow one re-decision</td></tr><tr><td>Tool call + no results</td><td>Log low confidence</td><td>Stop (avoid loops) → “Unknown”</td></tr></tbody></table></figure>



<p><strong>Guards</strong></p>



<ul class="wp-block-list">
<li>Max loop (<code>max_retries</code>) to prevent infinite cycles</li>



<li>Empty-results check</li>



<li>Low temperature for consistent decisions</li>



<li>Optional <strong>rate limiting</strong> (e.g., sleep/backoff) if your OpenAI or DB tier needs it</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-evaluating-agentic-decisions">Evaluating Agentic Decisions</h2>



<p>Agentic RAG adds a new layer to measure: <strong>decision quality</strong>. Keep the retrieval metrics (precision@k, nDCG), but <strong>add</strong> decision metrics and overhead.</p>



<h3 class="wp-block-heading" id="h-1-decision-accuracy-4-outcomes">1) Decision Accuracy (4 outcomes)</h3>



<ul class="wp-block-list">
<li><strong>TP</strong> (True Positive): Agent retrieved when external info was needed ✓</li>



<li><strong>FP</strong> (false Positive): Agent retrieved unnecessarily (latency/cost waste)</li>



<li><strong>FN</strong> (False Negative): Agent skipped retrieval but should have (hallucination risk)</li>



<li><strong>TN</strong> (True Negative): Agent skipped retrieval appropriately ✓</li>



<li>N : total questions evaluated</li>
</ul>



<p><strong>Accuracy</strong> = (TP + TN) / N</p>



<p><strong>Interpretation</strong>:</p>



<ul class="wp-block-list">
<li>How often did the agent make the right call about retrieval?</li>



<li>“Right call” means either:
<ul class="wp-block-list">
<li>it retrieved when it should (TP), or</li>



<li>it skipped when it could safely skip (TN).</li>
</ul>
</li>
</ul>



<p><strong>Example:</strong></p>



<ul class="wp-block-list">
<li>TP = 40</li>



<li>TN = 50</li>



<li>FP = 5</li>



<li>FN = 5</li>



<li>N = 100</li>
</ul>



<p><strong>Accuracy </strong>= (40 + 50) / 100 = 90%</p>



<p>That means: in 90% of cases, the agent made the correct decision about using retrieval.</p>



<p>Note: This doesn’t judge how <em>good</em> the final answer is — that’s a separate metric. This only measures the <em>decision to retrieve or not retrieve</em>.</p>



<h3 class="wp-block-heading" id="h-2-tool-usage-rate">2) Tool Usage Rate</h3>



<p>% of queries that triggered retrieval.<br>Too low → overconfident model; too high → cautious and costly. Track per domain/query type.</p>



<h3 class="wp-block-heading" id="h-3-latency-cost-impact">3) Latency/Cost Impact</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Scenario</th><th>LLM Calls</th><th>DB Queries</th><th>Avg Latency</th></tr></thead><tbody><tr><td>No retrieval</td><td>1</td><td>0</td><td>~0.5 s</td></tr><tr><td>Single retrieval</td><td>2</td><td>1</td><td>~2.1 s</td></tr><tr><td>Double retrieval</td><td>3</td><td>2</td><td>~3.8 s</td></tr></tbody></table></figure>



<p>Report both <strong>p50/p95</strong> to capture long-tail tool loops.</p>



<h3 class="wp-block-heading" id="h-4-answer-quality-agentic-vs-adaptive">4) Answer Quality (Agentic vs. Adaptive)</h3>



<p>Run the <strong>same query set</strong> through <em>Adaptive</em> and <em>Agentic</em>:</p>



<ul class="wp-block-list">
<li>Compare precision@k/nDCG of retrieved sets</li>



<li>Human-rate final answers for factuality and completeness</li>



<li>Track “Unknown” rate (good: avoids hallucination; too high: under-retrieval)</li>
</ul>



<p><strong>Tip:</strong> log a one-liner in every response:<br><code>decision=used_search|skipped_search loops=0|1 latency_ms=...</code></p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-when-agentic-rag-may-not-be-worth-it">When Agentic RAG May Not Be Worth It</h2>



<p>Prefer <strong>Adaptive</strong> if:</p>



<ul class="wp-block-list">
<li>p95 latency must be &lt;1s (agentic adds ~1-2s when searching)</li>



<li>Requests are predictable and schema-bound</li>



<li>Compliance needs strong deterministic behavior</li>



<li>Token cost is primary constraint</li>
</ul>



<p>Choose <strong>Agentic</strong> when:</p>



<ul class="wp-block-list">
<li>Queries are exploratory / multi-hop</li>



<li>Multiple tools or sources are available</li>



<li>Context quality &gt; raw speed</li>



<li>You’re building research assistants, not simple FAQ bots</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-repository-integration-pgvector-rag-search-lab">Repository Integration (pgvector_RAG_search_lab)</h2>



<p><strong>Proposed tree</strong></p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
lab/
├── search/
|	|__ ...other search scritps
│   ├── agentic_search.py           # Main agentic engine 
│   └── examples/
│       └── agentic_demo.py         # Interactive demo 
├── core/
│   └── generation.py               # Includes generate_with_tools() for function calling
├── api/
│   └── fastapi_server.py           # REST endpoint: POST /search with method=&quot;agentic&quot;
├── workflows/
│   └── agentic_rag_workflow.json   # n8n visual workflow
└── evaluation/
    └── metrics.py                  # nDCG and ranking metrics (not agentic-specific)
</pre></div>


<p><strong>FastAPI endpoint (sketch)</strong></p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
# lab/search/api_agentic.py
from fastapi import APIRouter
from .agentic_search import agentic_answer
router = APIRouter()

@router.post(&quot;/agent_search&quot;)
def agent_search(payload: dict):
    q = payload.get(&quot;query&quot;, &quot;&quot;)
    result = agentic_answer(pg_conn=..., user_question=q)
    return result
</pre></div>


<p><strong>n8n (optional)</strong></p>



<ul class="wp-block-list">
<li>Manual trigger → HTTP Request <code>/agent_search</code> → Markdown render</li>



<li>Show 🔄 if <code>tool_used: true</code> and add latency badge</li>
</ul>



<p><strong>LangGraph (optional)</strong></p>



<ul class="wp-block-list">
<li>Map our tool into a graph node; Agent node → Tool node → Agent node</li>



<li>Useful if you later add web search / SQL tools in parallel branches</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-prompt-tips-small-changes-big-impact">Prompt Tips (small changes, big impact)</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Goal</th><th>Prompt Additions</th></tr></thead><tbody><tr><td>Fewer false positives</td><td>“Only search when facts are needed; avoid unnecessary retrieval.”</td></tr><tr><td>Fewer false negatives</td><td>“Never guess facts; reply ‘Unknown’ if not in snippets.”</td></tr><tr><td>Lower latency</td><td>“Limit to at most one additional retrieval if context is missing.”</td></tr><tr><td>Better observability</td><td>“End with: <code>Decision: used search</code> or <code>Decision: skipped search</code>.”</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-quickstart">Quickstart</h2>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: bash; title: ; notranslate">
git clone https://github.com/boutaga/pgvector_RAG_search_lab
cd pgvector_RAG_search_lab

# Export your API key
export OPENAI_API_KEY=...

  # Interactive demo (recommended first try)  
python lab/search/examples/agentic_demo.py  
​  
# Command-line single query  
python lab/search/agentic_search.py \  
  --source wikipedia \  
  --query &quot;Compare PostgreSQL and MySQL indexing&quot; \  
  --show-decision \  
  --show-sources  
​  
# Interactive mode  
python lab/search/agentic_search.py --source wikipedia --interactive  
​  
# Start API server  
python lab/api/fastapi_server.py  
​  
# Call API endpoint  
curl -X POST http://localhost:8000/search \  
  -H &quot;Content-Type: application/json&quot; \  
  -d &#039;{  
    &quot;query&quot;: &quot;How does PostgreSQL MVCC work?&quot;,  
    &quot;method&quot;: &quot;agentic&quot;,  
    &quot;source&quot;: &quot;wikipedia&quot;,  
    &quot;top_k&quot;: 5,  
    &quot;generate_answer&quot;: true  
  }&#039;
</pre></div>


<p></p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-appendix-a-minimal-all-mini-variant-cheapest-path">Appendix A — Minimal “all-mini” variant (cheapest path)</h2>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
# Use gpt-5-mini for both phases
resp = client.chat.completions.create(
    model=&quot;gpt-5-mini&quot;,
    messages=messages,
    tools=&#x5B;search_tool],
    tool_choice=&quot;auto&quot;,
    temperature=0.2,
)
# ... identical loop, then final:
final = client.chat.completions.create(
    model=&quot;gpt-5-mini&quot;,
    messages=messages + (&#x5B;msg] if not msg.tool_calls else &#x5B;]),
    tool_choice=&quot;none&quot;,
    temperature=0.3,
)
</pre></div>


<p></p>



<h2 class="wp-block-heading" id="h-appendix-b-simple-metrics-logger">Appendix B — Simple metrics logger</h2>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
# lab/search/metrics.py
from dataclasses import dataclass

@dataclass
class DecisionLog:
    query: str
    tool_used: bool
    loops: int
    latency_ms: int
    label_retrieval_required: bool | None = None  # optional gold label
    hallucinated: bool | None = None             # set via eval

def summarize(logs: list&#x5B;DecisionLog]):
    n = len(logs)
    usage = sum(1 for x in logs if x.tool_used) / n
    p95 = sorted(x.latency_ms for x in logs)&#x5B;int(0.95 * n) - 1]
    # If labels present, compute TP/FP/FN/TN
    labeled = &#x5B;x for x in logs if x.label_retrieval_required is not None]
    if labeled:
        tp = sum(1 for x in labeled if x.tool_used and x.label_retrieval_required)
        tn = sum(1 for x in labeled if (not x.tool_used) and (not x.label_retrieval_required))
        acc = (tp + tn) / len(labeled)
    else:
        acc = None
    return {&quot;tool_usage_rate&quot;: usage, &quot;p95_latency_ms&quot;: p95, &quot;decision_accuracy&quot;: acc}
</pre></div>


<p>This is a simplified example for illustration. The repository currently tracks decision<br>metadata within the response object (<code>decision</code>, <code>tool_used</code>, <code>search_count</code>, <code>cost</code>). You can<br>implement this <code>DecisionLog</code> class separately if you need persistent decision analytics.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-conclusion">Conclusion</h2>



<p>Agentic RAG makes retrieval <strong>intentional</strong>. By letting the model decide <em>if</em> and <em>when</em> to search—and by limiting the loop to one safe retry—you gain better answers on complex queries with measured, predictable overhead. </p>



<p><strong>Takeaways</strong></p>



<ul class="wp-block-list">
<li>Expect latency to be several seconds when the tool is used; near-naive latency when skipped</li>



<li>Measure <strong>decision accuracy</strong> and <strong>tool usage rate</strong> alongside precision/nDCG</li>



<li>Start with one tool (wiki search) and one retry; expand only if metrics justify it</li>



<li>Use smaller models for decisions and bigger ones for synthesis to balance quality/cost</li>
</ul>



<p>In this post we focused only on retrieval quality — teaching the agent when to call the vector store and when to skip it, and giving it a controlled retrieval loop, that’s the foundation. The next step is to extend the loop with governance and compliance checks (who is asking, what data can they see, should this query even be answered), and only then layer domain-specific business logic on top. That’s how an agentic workflow evolves from “smart retrieval” into something you can actually trust in production.</p>



<p></p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-agentic-rag/">RAG Series – Agentic RAG</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/rag-series-agentic-rag/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>pgconf.eu 2025 &#8211; RECAP</title>
		<link>https://www.dbi-services.com/blog/pgconf-eu-2025-recap/</link>
					<comments>https://www.dbi-services.com/blog/pgconf-eu-2025-recap/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 26 Oct 2025 18:30:40 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[pgconfeu]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=41244</guid>

					<description><![CDATA[<p>I was fortunate to be able to attend at the pgconf.eu 2025. This year event was happening in RIGA and joined together once again key members of the community, contributors, committers, sponsors and users from across the world. I would summarize this year event with those three main topics : AI/LLM &#8211; PG18- Monitoring. AI/LLMs [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/pgconf-eu-2025-recap/">pgconf.eu 2025 &#8211; RECAP</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<div class="wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex">
<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="1024" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-1024x1024.jpg" alt="" class="wp-image-41250" style="width:575px;height:auto" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-1024x1024.jpg 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-300x300.jpg 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-150x150.jpg 150w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-768x768.jpg 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1-1536x1536.jpg 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/95801-1.jpg 1716w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>I was fortunate to be able to attend at the <a href="https://2025.pgconf.eu/">pgconf.eu 2025</a>. <br>This year event was happening in RIGA and joined together once again key members of the community, contributors, committers, sponsors and users from across the world. <br>I would summarize this year event with those three main topics : AI/LLM &#8211; PG18- Monitoring. </p>



<p></p>



<h2 class="wp-block-heading" id="h-ai-llms">AI/LLMs</h2>



<p><br>Compared to last year the formula changed a bit regarding the Community events day of Tuesday where for the first time different &#8220;Summits&#8221; where organized. If you want full details on the event and the schedule as well as the presentation slides of each talk you may find it here : <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/">Schedule — PostgreSQL Conference Europe 2025</a><br>I had the chance to be chosen as a speaker for the AI Summit. It was quite interesting for me. In total there was 13 short talks (10min) on various topics related to PostgreSQL and AI/LLMs it was dense with a lot of interesting ideas of implementations &#8211; you can find the details and slides here <a href="https://wiki.postgresql.org/wiki/PGConf.EU_2025_PostgreSQL_AI_Summit">PGConf.EU 2025 PostgreSQL AI Summit &#8211; PostgreSQL wiki</a>. AI/LLMs are the hot topic of the moment and naturally it came up often during this event, in the talks and in the discussions. You can find the pdf of my presentation <a href="https://drive.google.com/file/d/1qrE-42-CFVvIJ4PZ1dvkTuAVPWUkNSwF/view?pli=1">here</a>. I explained a business case implementation of a BI self-service agentic RAG to find relevant fields for a target KPI and data marts creation as output. Since the talks were short, it allowed to have a debate at the end between the audience and the speakers. The discussion nicely moderated by organizers was interesting because it exposed the same strong thoughts people have in general about AI/LLMs. A blend of distrust and not fully understanding of what it is about or how it could help organizations. Which, in itself, shows that the PostgreSQL community has the same difficulties at explaining technical challenges versus organizational/human challenges. My view here is that we don&#8217;t have technical challenges, they are almost un-relevant to most arguments but rather human relation and understanding of what values a DBA for example, brings to the organization. To me installing and configuring PostgreSQL has no benefits in terms of personal growth so automating it is quite natural and adding AI/LLMs on top is &#8220;nice to have&#8221; but not fundamentally different than an Ansible playbook. But for the junior DBA this an additional abstraction that can be dangerous because it provides tools that users can&#8217;t grasp the full extent of their consequences.  This outlines that the main issue of integrating AI/LLMs workflows is more a governance/ C-management issue than a technical one and it can&#8217;t be the last excuse for adding to the technological debt. <br><a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/6932-dont-leak-user-data-to-ai-strategies-for-protecting-pii-from-llms-and-mcp/">Jay Miller</a> from Aiven explained how you can fail at exposing PII from LLMs and MCPs. This is rely a relevant topic knowing that more and more organization are facing issues like shadow IT. He also was quite the show host and was funny to hear. I recommend strongly watching the recording when it will be released. </p>
</div>



<h2 class="wp-block-heading has-text-align-center" id="h-pg18">PG18</h2>



<div class="wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex">
<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="439" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-1024x439.jpg" alt="" class="wp-image-41247" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-1024x439.jpg 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-300x129.jpg 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-768x329.jpg 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-1536x659.jpg 1536w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9642-2048x879.jpg 2048w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div>



<p>This year was just after the PostgreSQL 18 version release which is one the version that brought major improvements and is initiating changes for future release to come. I was quite enthusiast to listen to <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/6952-improved-freezing-in-postgres-vacuum-from-idea-to-commit/">Melanie Plagemen</a> on how she worked on the improvements on freezing in this release. I have to say, usually when I am going at an advanced internal talk, I am more confused after than before. But here, Melanie did an amazing job at talking about a technical complex topic without loosing the audience. <br><a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/7133-what-you-should-know-about-constraints-in-postgresql-and-whats-new-in-18/">Gülçin Yıldırım Jelínek</a>, on her side explained what&#8217;s new in PG18 about constraints like NOT ENFORCED and NOT NULL and how to use them. The COO of Cybertec Raj Verma, during <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/7185-from-chaos-to-compliance-how-postgresql-makes-regulations-work-for-you/">a sponsor talk</a>, explained why compliance matters and how to minimize the risks and how PostgreSQL is helping us to be PCI DSS, GDPR, nLPD or HIPAA compliant. <br> Another interesting talk I was happy to attend was from <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/7021-they-grow-up-so-fast-donating-your-open-source-project-to-a-foundation/">Floor Drees and Gabriele Bartolini</a>. they explain how they went on joining the <a href="https://cloudnative-pg.io/">CloudNativePG</a> project to the CNCF.</p>



<p> </p>



<h2 class="wp-block-heading has-text-align-center" id="h-monitoring">Monitoring</h2>



<p>This leads me to another important topic, I wasn&#8217;t looking for it but became a bit of a main subject for my over the years as a DBA that was interested in performance tuning. Monitoring on PostgreSQL was introduced by several talks like <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/7187-dbtune-ai-driven-performance-tuning-for-all-postgresql-flavors/">Luigi Nardi</a> and his idea of workload fingerprint with the DBtune tool they have. Additionally,  <a href="https://www.postgresql.eu/events/pgconfeu2025/schedule/session/7112-tracking-plan-shapes-over-time-with-plan-ids-and-the-new-pg_stat_plans/">Lukas Fittl</a> presented pg_stat_plans, an extension which aims at tracking execution plans over time. This is definitely something I am going to try and will push for implementation in the core extensions if not the core code itself. <br>The reason for that is obvious for me, PostgreSQL is becoming more and more central to enterprise organizations and appart from subject like TDE, monitoring is going to become a key aspect of automation, CloudNativePG and AI/LLM workflows. Having PostgreSQL being able to be monitored better and easier at the core will allow leveraging at all this levels.  Cloud companies release that already hence there involvement in similar projects. <br><br>In the end, this year was once again the occasion for me to think about many relevant topics and exchange with PostgreSQL hackers as well as users from around the world. I came back home with the head full of ideas to investigate. </p>



<p>Additionally after the conference the videos of the each talks will be uploaded to the pgconf Europe Youtube channel : <a href="https://www.youtube.com/@pgeu/videos">PostgreSQL Europe</a>, but you can already check previous amazing talks and this year pgday Paris.</p>



<p>So once again the PostgreSQL flag was floating up high ! </p>


<div class="wp-block-image">
<figure class="aligncenter size-large is-resized"><img loading="lazy" decoding="async" width="1024" height="924" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9680-1024x924.jpg" alt="" class="wp-image-41248" style="width:726px;height:auto" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9680-1024x924.jpg 1024w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9680-300x271.jpg 300w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9680-768x693.jpg 768w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/9680.jpg 1280w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>
</div><p>L’article <a href="https://www.dbi-services.com/blog/pgconf-eu-2025-recap/">pgconf.eu 2025 &#8211; RECAP</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/pgconf-eu-2025-recap/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>RAG Series – Adaptive RAG, understanding Confidence, Precision &#038; nDCG</title>
		<link>https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/</link>
					<comments>https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/#respond</comments>
		
		<dc:creator><![CDATA[Adrien Obernesser]]></dc:creator>
		<pubDate>Sun, 12 Oct 2025 12:17:54 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[ai]]></category>
		<category><![CDATA[AI/LLM]]></category>
		<guid isPermaLink="false">https://www.dbi-services.com/blog/?p=40963</guid>

					<description><![CDATA[<p>Introduction In this RAG series we tried so far to introduce new concepts of the RAG workflow each time. This new article is going to introduce also new key concepts at the heart of Retrieval. Adaptive RAG will allow us to talk about measuring the quality of the retrieved data and how we can leverage [&#8230;]</p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">RAG Series – Adaptive RAG, understanding Confidence, Precision &amp; nDCG</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<h1 class="wp-block-heading" id="h-introduction">Introduction</h1>



<p>In this RAG series we tried so far to introduce new concepts of the RAG workflow each time. This new article is going to introduce also new key concepts at the heart of Retrieval. Adaptive RAG will allow us to talk about measuring the quality of the retrieved data and how we can leverage it to push our optimizations further.<br>A now <a href="https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf">famous study from MIT</a> is stating how 95% of organizations fail to get ROI within the 6 months of their &#8220;AI projects&#8221;. Although we could argue about the relevancy of the study and what it actually measured,  one of the key element to have a successful implementation is measurement. <br>An old BI principle is to know your KPI, what it really measures but also when it fails to measure.  For example if you would use the speedometer on your dashboard&#8217;s car to measure the speed at which you are going, you&#8217;d be right as long as the wheels are touching the ground. So with that in mind, let&#8217;s see how we can create smart and reliable retrieval. <br></p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-from-hybrid-to-adaptive">From Hybrid to Adaptive</h2>



<p>Hybrid search significantly improves retrieval quality by combining dense semantic vectors with sparse lexical signals. However, real-world queries vary:</p>



<ul class="wp-block-list">
<li>Some are <strong>factual</strong>, asking for specific names, numbers, or entities.</li>



<li>Others are <strong>conceptual</strong>, exploring ideas, reasons, or relationships.</li>
</ul>



<p>A single static weighting between dense and sparse methods cannot perform optimally across all query types.</p>



<p><strong>Adaptive RAG</strong> introduces a lightweight classifier that analyzes each query to determine its type and dynamically adjusts the hybrid weights before searching.<br>For example:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Query Type</th><th>Example</th><th>Dense Weight</th><th>Sparse Weight</th></tr></thead><tbody><tr><td>Factual</td><td>“Who founded PostgreSQL?”</td><td>0.3</td><td>0.7</td></tr><tr><td>Conceptual</td><td>“How does PostgreSQL handle concurrency?”</td><td>0.7</td><td>0.3</td></tr><tr><td>Exploratory</td><td>“Tell me about Postgres performance tuning”</td><td>0.5</td><td>0.5</td></tr></tbody></table></figure>



<p>This dynamic weighting ensures that each search leverages the right signals:</p>



<ul class="wp-block-list">
<li>Sparse when <strong>exact matching</strong> matters.</li>



<li>Dense when <strong>semantic similarity</strong> matters.</li>
</ul>



<p>Under the hood, our <code>AdaptiveSearchEngine</code> wraps dense and sparse retrieval modules. Before executing, it classifies the query, assigns weights, and fuses the results via a <strong>weighted Reciprocal Rank Fusion (RRF)</strong>, giving us the best of both worlds — adaptivity without complexity.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-confidence-driven-retrieval">Confidence-Driven Retrieval</h2>



<p>Once we make retrieval adaptive, the next challenge is <strong>trust</strong>. How confident are we in the results we just returned?</p>



<h3 class="wp-block-heading" id="h-confidence-from-classification">Confidence from Classification</h3>



<p>Each query classification includes a <strong>confidence score</strong> (e.g., 0.92 “factual” vs 0.58 “conceptual”).<br>When classification confidence is low, Adaptive RAG defaults to a balanced retrieval (dense 0.5, sparse 0.5) — avoiding extreme weighting that might miss relevant content.</p>



<h3 class="wp-block-heading" id="h-confidence-from-retrieval">Confidence from Retrieval</h3>



<p>We also compute confidence based on retrieval statistics:</p>



<ul class="wp-block-list">
<li>The similarity gap between the first and second ranked results (large gap = high confidence).</li>



<li>Average similarity score of the top-k results.</li>



<li>Ratio of sparse vs dense agreement (when both find the same document, confidence increases).</li>
</ul>



<p>These metrics are aggregated into a <strong>normalized confidence score</strong> between 0 and 1:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
def compute_confidence(top_scores, overlap_ratio):
    sim_conf = min(1.0, sum(top_scores&#x5B;:3]) / 3)
    overlap_conf = 0.3 + 0.7 * overlap_ratio
    return round((sim_conf + overlap_conf) / 2, 2)

</pre></div>


<p>If confidence &lt; 0.5, the system triggers a <strong>fallback strategy</strong>:</p>



<ul class="wp-block-list">
<li>Expands <code>top_k</code> results (e.g., from 10 → 30).</li>



<li>Broadens search to both dense and sparse equally.</li>



<li>Logs the event for later evaluation.</li>
</ul>



<p>The retrieval API now returns a structured response:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
{
  &quot;query&quot;: &quot;When was PostgreSQL 1.0 released?&quot;,
  &quot;query_type&quot;: &quot;factual&quot;,
  &quot;confidence&quot;: 0.87,
  &quot;precision@10&quot;: 0.8,
  &quot;recall@10&quot;: 0.75
}

</pre></div>


<p>This allows monitoring not just <em>what</em> was retrieved, but <em>how sure</em> the system is. Enabling alerting, adaptive reruns, or downstream LLM prompt adjustments (e.g., &#8220;Answer cautiously&#8221; when confidence &lt; 0.6).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-evaluating-quality-with-ndcg">Evaluating Quality with nDCG</h2>



<p>Precision and recall are fundamental metrics for retrieval systems, but they don’t consider <strong>the order</strong> of results. If a relevant document appears at rank 10 instead of rank 1, the user experience is still poor even if recall is high.</p>



<p>That’s why we now add <strong>nDCG@k (normalized Discounted Cumulative Gain)</strong> — a ranking-aware measure that rewards systems for ordering relevant results near the top.</p>



<p>The idea:</p>



<ul class="wp-block-list">
<li><strong>DCG@k</strong> evaluates gain by position:</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="424" height="129" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-5.png" alt="" class="wp-image-40969" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-5.png 424w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-5-300x91.png 300w" sizes="auto, (max-width: 424px) 100vw, 424px" /></figure>
</div>


<ul class="wp-block-list">
<li><strong>nDCG@k</strong> normalizes this against the ideal order (IDCG):</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><img loading="lazy" decoding="async" width="408" height="116" src="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-6.png" alt="" class="wp-image-40970" srcset="https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-6.png 408w, https://www.dbi-services.com/blog/wp-content/uploads/sites/2/2025/10/image-6-300x85.png 300w" sizes="auto, (max-width: 408px) 100vw, 408px" /></figure>
</div>


<p>A perfect ranking yields nDCG = 1.0. Poorly ordered but complete results may still have high recall, but lower nDCG.</p>



<p>In practice, we calculate nDCG@10 for each query and average it over the dataset.<br>Our evaluation script (<code>lab/04_evaluate/metrics.py</code>) integrates this directly:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
from evaluation import ndcg_at_k

score = ndcg_at_k(actual=relevant_docs, predicted=retrieved_docs, k=10)
print(f&quot;nDCG@10: {score:.3f}&quot;)

</pre></div>


<h3 class="wp-block-heading" id="h-results-on-the-wikipedia-dataset-25k-articles">Results on the Wikipedia dataset (25K articles)</h3>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Method</th><th>Precision@10</th><th>Recall@10</th><th>nDCG@10</th></tr></thead><tbody><tr><td>Dense only</td><td>0.61</td><td>0.54</td><td>0.63</td></tr><tr><td>Hybrid fixed weights</td><td>0.72</td><td>0.68</td><td>0.75</td></tr><tr><td><strong>Adaptive (dynamic)</strong></td><td><strong>0.78</strong></td><td><strong>0.74</strong></td><td><strong>0.82</strong></td></tr></tbody></table></figure>



<p>These results confirm that <strong>adaptive weighting not only improves raw accuracy but also produces better-ranked results</strong>, giving users relevant documents earlier in the list.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-implementation-in-our-lab">Implementation in our LAB</h2>



<p>You can explore the implementation in the GitHub repository:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
git clone https://github.com/boutaga/pgvector_RAG_search_lab
cd pgvector_RAG_search_lab

</pre></div>


<p>Key components:</p>



<ul class="wp-block-list">
<li><code>lab/04_search/adaptive_search.py</code> — query classification, adaptive weights, confidence scoring.</li>



<li><code>lab/04_evaluate/metrics.py</code> — precision, recall, and nDCG evaluation.</li>



<li>Streamlit UI (<code>streamlit run streamlit_demo.py</code>) — visualize retrieved chunks, scores, and confidence in real time.</li>
</ul>



<p>Example usage:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
python lab/04_search/adaptive_search.py --query &quot;Who invented SQL?&quot;

</pre></div>


<p>Output:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
Query type: factual (0.91 confidence)
Dense weight: 0.3 | Sparse weight: 0.7
Precision@10: 0.82 | Recall@10: 0.77 | nDCG@10: 0.84

</pre></div>


<p>This feedback loop closes the gap between research and production — making RAG not only smarter but measurable.</p>



<h2 class="wp-block-heading" id="h-what-is-relevance">What is “Relevance”?</h2>



<p>When we talk about <strong>precision</strong>, <strong>recall</strong>, or <strong>nDCG</strong>, all three depend on one hidden thing:</p>



<blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow">
<p class="has-text-align-center"><strong>a <em>ground truth</em> of which documents are relevant for each query.</strong></p>
</blockquote>



<p>There are <strong>two main ways</strong> to establish that ground truth:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Approach</th><th>Who decides relevance</th><th>Pros</th><th>Cons</th></tr></thead><tbody><tr><td><strong>Human labeling</strong></td><td>Experts mark which documents correctly answer each query</td><td>Most accurate; useful for benchmarks</td><td>Expensive and slow</td></tr><tr><td><strong>Automated or LLM-assisted labeling</strong></td><td>An LLM (or rules) judges if a retrieved doc contains the correct answer</td><td>Scalable and repeatable</td><td>Risk of bias / noise</td></tr></tbody></table></figure>



<p>In some business activity you are almost forced to use human labeling because the business technicalities are so deep that automating it is hard. Labeling can be slow and expensive for a business but I learned that it also is a way to introduce change management towards AI workflow by enabling key employees of the company to participate and build a solution with their expertise and without going through a harder project of asking to an external organization to create specific business logic into a software that was never made to handle it in the first place. As a DBA, I witnessed business logic move away from databases towards ORMs and application code and this time the business logic is going towards AI workflow. Starting this human labeling project my be the first step towards it and guarantees solid foundations. <br>Managers need to keep in mind that AI workflows are not just a technical solution, they are social-technical framework to allow organizational growth. You can&#8217;t just ship an AI chatbot into an app and expect 10x returns with minimal effort, this is a simplistic state of mind that already cost billions according the MIT study.  <br><br>In a research setup (like your <code>pgvector_RAG_search_lab</code>), you can <strong>mix both</strong> approach:</p>



<ul class="wp-block-list">
<li>Start with a <strong>seed dataset</strong> of <code>(query, relevant_doc_ids)</code> pairs (e.g. small set labeled manually).</li>



<li>Use the LLM to <strong>extend or validate</strong> relevance judgments automatically.</li>
</ul>



<p>For example:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
prompt = f&quot;&quot;&quot;
Query: {query}
Document: {doc_text&#x5B;:2000]}
Is this document relevant to answering the query? (yes/no)
&quot;&quot;&quot;
llm_response = openai.ChatCompletion.create(...)
label = llm_response&#x5B;&#039;choices&#039;]&#x5B;0]&#x5B;&#039;message&#039;]&#x5B;&#039;content&#039;].strip().lower() == &#039;yes&#039;

</pre></div>


<p>Then you store that in a simple table or CSV:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>query_id</th><th>doc_id</th><th>relevant</th></tr></thead><tbody><tr><td>1</td><td>101</td><td>true</td></tr><tr><td>1</td><td>102</td><td>false</td></tr><tr><td>2</td><td>104</td><td>true</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-precision-amp-recall-in-practice">Precision &amp; Recall in Practice</h2>



<p>Once you have that table of true relevances, you can compute:</p>



<ul class="wp-block-list">
<li><strong>Precision@k</strong> → “Of the top <em>k</em> documents I retrieved, how many were actually relevant?” </li>



<li><strong>Recall@k</strong> → “Of all truly relevant documents, how many did I retrieve in my top <em>k</em>?” </li>
</ul>



<p>They’re correlated but not the same:</p>



<ul class="wp-block-list">
<li><strong>High precision</strong> → few false positives.</li>



<li><strong>High recall</strong> → few false negatives.</li>
</ul>



<p>For example:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Query</th><th>Retrieved docs (top 5)</th><th>True relevant</th><th>Precision@5</th><th>Recall@5</th></tr></thead><tbody><tr><td>“Who founded PostgreSQL?”</td><td>[d3, d7, d9, d1, d4]</td><td>[d1, d4]</td><td>0.4</td><td>1.0</td></tr></tbody></table></figure>



<p>You got both relevant docs (good recall = 1.0), but only 2 of the 5 retrieved were correct (precision = 0.4).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-why-ndcg-is-needed">Why nDCG is Needed</h2>



<p>Precision and recall only measure <em>which</em> docs were retrieved, not <em>where they appeared in the ranking</em>.</p>



<p><strong>nDCG@k</strong> adds <em>ranking quality</em>:</p>



<ul class="wp-block-list">
<li>Each relevant document gets a <strong>relevance grade</strong> (commonly 0, 1, 2 — irrelevant, relevant, highly relevant).</li>



<li>The higher it appears in the ranked list, the higher the gain.</li>
</ul>



<p>So if a highly relevant doc is ranked 1st, you get more credit than if it’s ranked 10th.</p>



<p><strong>In your database</strong>, you can store relevance grades in a table like:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>query_id</th><th>doc_id</th><th>rel_grade</th></tr></thead><tbody><tr><td>1</td><td>101</td><td>2</td></tr><tr><td>1</td><td>102</td><td>1</td></tr><tr><td>1</td><td>103</td><td>0</td></tr></tbody></table></figure>



<p>Then your evaluator computes:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: python; title: ; notranslate">
import math

def dcg_at_k(relevances, k):
    return sum((2**rel - 1) / math.log2(i+2) for i, rel in enumerate(relevances&#x5B;:k]))

def ndcg_at_k(actual_relevances, k):
    ideal = sorted(actual_relevances, reverse=True)
    return dcg_at_k(actual_relevances, k) / dcg_at_k(ideal, k)

</pre></div>


<p><strong>You do need to keep track of rank</strong> (the order in which docs were returned).<br>In PostgreSQL, you could log that like:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>query_id</th><th>doc_id</th><th>rank</th><th>score</th><th>rel_grade</th></tr></thead><tbody><tr><td>1</td><td>101</td><td>1</td><td>0.92</td><td>2</td></tr><tr><td>1</td><td>102</td><td>2</td><td>0.87</td><td>1</td></tr><tr><td>1</td><td>103</td><td>3</td><td>0.54</td><td>0</td></tr></tbody></table></figure>



<p>Then it’s easy to run SQL to evaluate:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: sql; title: ; notranslate">
SELECT query_id,
       SUM((POWER(2, rel_grade) - 1) / LOG(2, rank + 1)) AS dcg
FROM eval_results
WHERE rank &lt;= 10
GROUP BY query_id;

</pre></div>


<p>In a real system (like your Streamlit or API demo), you can:</p>



<ul class="wp-block-list">
<li>Log <strong>each retrieval attempt</strong> (query, timestamp, ranking list, scores, confidence).</li>



<li>Periodically <strong>recompute metrics</strong> (precision, recall, nDCG) using a fixed ground-truth set.</li>
</ul>



<p>This lets you track if tuning (e.g., changing dense/sparse weights) is improving performance.</p>



<p>Structure of your evaluation log table could be:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>run_id</th><th>query_id</th><th>method</th><th>rank</th><th>doc_id</th><th>score</th><th>confidence</th><th>rel_grade</th></tr></thead><tbody><tr><td>2025-10-12_01</td><td>1</td><td>adaptive_rrf</td><td>1</td><td>101</td><td>0.92</td><td>0.87</td><td>2</td></tr><tr><td>2025-10-12_01</td><td>1</td><td>adaptive_rrf</td><td>2</td><td>102</td><td>0.85</td><td>0.87</td><td>1</td></tr></tbody></table></figure>



<p>From there, you can generate:</p>



<ul class="wp-block-list">
<li><strong>nDCG@10 trend over runs</strong> (e.g., in Prometheus or Streamlit chart)</li>



<li><strong>Precision vs Confidence correlation</strong></li>



<li><strong>Recall improvements per query type</strong></li>
</ul>



<p><em>⚠️ Note: While nDCG is a strong metric for ranking quality, it’s not free from bias. Because it normalizes per query, easier questions (with few relevant documents) can inflate the average score. In our lab, we mitigate this by logging both raw DCG and nDCG, and by comparing results across query categories (factual vs conceptual vs exploratory). This helps ensure improvements reflect true retrieval quality rather than statistical artifacts.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-human-llm-hybrid-evaluation-practical-middle-ground">Human + LLM Hybrid Evaluation (Practical Middle Ground)</h2>



<p>For your PostgreSQL lab setup:</p>



<ul class="wp-block-list">
<li>Label a <strong>small gold set</strong> manually (e.g., 20–50 queries × 3–5 relevant docs each).</li>



<li>For larger coverage, use the <strong>LLM as an auto-grader</strong>.<br>You can even use <em>self-consistency</em>: ask the LLM to re-evaluate relevance twice and keep consistent labels only.</li>
</ul>



<p>This gives you a <strong>semi-automated evaluation dataset</strong>, good enough to monitor:</p>



<ul class="wp-block-list">
<li>Precision@10</li>



<li>Recall@10</li>



<li>nDCG@10 over time</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-lessons-learned">Lessons Learned</h2>



<p>Through Adaptive RAG, we’ve transformed retrieval from a static process into a self-aware one.</p>



<ul class="wp-block-list">
<li><strong>Precision increased by ~6–7%</strong>, especially for conceptual queries.</li>



<li><strong>Recall improved by ~8%</strong> for factual questions thanks to better keyword anchoring.</li>



<li><strong>nDCG@10 rose from 0.75 → 0.82</strong>, confirming that relevant results are appearing earlier.</li>



<li><strong>Confidence scoring</strong> provides operational visibility: we now know when the system is uncertain, enabling safe fallbacks and trust signals.</li>
</ul>



<p>The combination of adaptive routing, confidence estimation, and nDCG evaluation makes this pipeline suitable for enterprise-grade RAG use cases — where explainability, reliability, and observability are as important as accuracy.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" />



<h2 class="wp-block-heading" id="h-conclusion-and-next-steps">Conclusion and Next Steps</h2>



<p>Adaptive RAG is the bridge between smart retrieval and <strong>reliable retrieval</strong>.<br>By classifying queries, tuning dense/sparse balance dynamically, and measuring ranking quality with nDCG, we now have a system that understands <em>what kind of question it’s facing</em> and <em>how well it performed</em> in answering it.</p>



<p>This version of the lab introduces the first metrics-driven feedback loop for RAG in PostgreSQL:</p>



<ul class="wp-block-list">
<li>Retrieve adaptively,</li>



<li>Measure precisely,</li>



<li>Adjust intelligently.</li>
</ul>



<p>In <strong>the next part</strong>, we’ll push even further — introducing <strong>Agentic RAG</strong>, and how it plans and executes multi-step reasoning chains to improve retrieval and answer quality even more.</p>



<p>Try Adaptive RAG in the <a href="https://github.com/boutaga/pgvector_RAG_search_lab">pgvector_RAG_search_lab</a> repository, explore your own datasets, and start measuring nDCG@10 to see how adaptive retrieval changes the game.</p>



<p></p>
<p>L’article <a href="https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/">RAG Series – Adaptive RAG, understanding Confidence, Precision &amp; nDCG</a> est apparu en premier sur <a href="https://www.dbi-services.com/blog">dbi Blog</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.dbi-services.com/blog/rag-series-adaptive-rag-understanding-confidence-precision-ndcg/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 
Lazy Loading (feed)

Served from: www.dbi-services.com @ 2026-05-22 06:34:06 by W3 Total Cache
-->