dbi Blog

Over the years, ORMs have industrialized query generation, allowing developers to focus on optimizing the few queries that truly need it. However, they can sometimes produce overly complex SQL with excessive joins and business logic, making tuning a nightmare. It is sometimes hard to know when you should split a query in multiple smaller parts. The complexity is amplified with the use of JSON, adding another layer for DBAs to manage, although this is not a new data type, best practices on how to use JSON in PostgreSQL is still kind of a dark magic. This blog post intends to put some lights on all of this with a practical example.

Improving ORM-Generated Queries

Example of an Inefficient ORM-Generated Query

Below is a truncated and anonymized version of a complex SQL query generated by an ORM which I had to tune since the execution time (+90min) wasn’t in line with the business need. Note that the original query was over 1500 lines :

WITH result_list AS (
    SELECT o.id,
           o.data AS order_data,
           ARRAY_AGG(DISTINCT cr.id) AS customer_refs_id,
           ARRAY_AGG(DISTINCT cr.data) AS customer_refs_data,
           ARRAY_AGG(DISTINCT cp.policy_id) AS customer_policy_id,
           ROW_NUMBER() OVER () AS rowid
    FROM orders o
    LEFT JOIN customer_refs cr ON cr.data->>'id' = o.data->'customer'->>'id'
    LEFT JOIN customer_policies cp ON cp.policy_data->>'id' = o.data->'customer'->>'id'
        AND cp.policy_data->>'region' = ANY (ARRAY['GLOBAL', 'REGION_A', 'REGION_B', 'REGION_C', 'REGION_D'])
        AND (
            (
                cp.policy_data->>'status' IS NOT NULL
                AND cp.policy_data->>'status' NOT IN ('inactive', 'blocked', 'suspended')
            )
            AND (
                (
                    cp.policy_data->>'customer_type' IS NOT NULL
                    AND cp.policy_data->>'customer_type' = 'issuer'
                )
                OR (
                    cp.policy_data->>'customer_type' IS NOT NULL
                    AND cp.policy_data->>'customer_type' IN ('type1', 'type2', 'type3')
                )
                OR (
                    get_text_from_path(cp.policy_data->'shared_groups', '$[*]', 'NONE') IS NOT NULL
                    AND get_text_from_path(cp.policy_data->'shared_groups', '$[*]', 'NONE') && ARRAY['GROUP_1']
                )
            )
        )
    WHERE EXISTS (
        SELECT 1
        FROM account_policies ap
        WHERE ap.policy_id = ARRAY_TO_STRING(ARRAY[
            o.data->'account'->'id'->>'system_ref',
            o.data->'account'->'id'->>'type_ref',
            o.data->'account'->'id'->>'region',
            o.data->'account'->'id'->>'id'], '-')
        AND ap.policy_data->>'region' = ANY (ARRAY['GLOBAL', 'REGION_A', 'REGION_B', 'REGION_C', 'REGION_D'])
        AND ap.policy_data->>'status' NOT IN ('inactive', 'blocked', 'suspended')
        AND ap.policy_data->>'contract' != 'prospect'
        AND (
            ap.policy_data->>'account_type' IN ('typeX', 'typeY', 'typeZ')
            OR get_text_from_path(ap.policy_data->'shared_groups', '$[*]', 'NONE') && ARRAY['GROUP_1']
        )
    )
    AND (
        (
            o.data->>'order_type' = 'MONEY_ORDER'
            AND (
                o.data->'close_date'->>'value' > '2024-09-10'
                OR o.data->'close_date'->>'value' IS NULL
                OR o.data->'close_date'->>'value' = ''
            )
        )
        OR (
            o.data->>'order_type' != 'MONEY_ORDER'
            AND COALESCE(
                NULLIF(o.data->'valuation'->'quantity'->>'value', '')::DECIMAL,
                0
            ) != 0
        )
    )
    AND (
        LOWER(o.data->>'display_name') LIKE '%current%'
        OR LOWER(o.data->'product'->'item'->'item_name'->>'abbreviation') LIKE '%current%'
        OR LOWER(o.data->'product'->'item'->'item_name'->>'full') LIKE '%current%'
        OR LOWER(o.data->'product'->'item'->>'identifier') LIKE '%current%'
        OR LOWER(o.data->'product'->'issuer'->>'display_name') LIKE '%current%'
        OR get_text_for_search_from_path(o.data->'product'->'underlying', '$[*].item.item_name.abbreviation', 'LOWER') LIKE '%current%'
    )
    GROUP BY o.id, o.data
    ORDER BY o.id
),
ordered_list AS (
    SELECT *
    FROM result_list
    ORDER BY rowid
    LIMIT 23
)
SELECT *
FROM ordered_list
ORDER BY rowid;

Key Characteristics of the Query:

Multiple Joins: The query includes several LEFT JOIN operations with complex ON conditions involving JSON fields.
Nested JSON Access: Frequent use of JSON operators (->, ->>) to access deeply nested fields.
Custom Functions: Usage of functions like get_text_from_path and get_text_for_search_from_path, which may not be optimized for performance.
Complex WHERE Conditions: The WHERE clause contains multiple nested conditions and subqueries, often involving JSON fields.

Why the Query is Inefficient

Accessing deeply nested JSON fields using operators in large datasets can be slow, especially if not properly indexed this will cause full table scans and increase drastically execution time.
The query performs numerous LOWER() and LIKE operations on JSON fields, which can be computationally expensive with the addition of preventing index usage effectively.
The indexes on JSON fields may not be utilized due to the way conditions are written.
The use of LEFT JOIN with conditions that include functions and JSON field comparisons leads to inefficient execution plans.
The EXISTS subquery further complicates the execution, potentially causing full table scans.
ORMs often generate generic SQL that doesn’t consider database-specific optimizations.
The abstraction can lead to redundant or unnecessary conditions and joins.

Result : Bad Execution Plan

Limit  (cost=292646835.98..292646835.99 rows=1 width=1813) (actual time=5504844.738..5504844.745 rows=23 loops=1)
  Buffers: shared hit=3622772239 read=7862262 dirtied=3639
  ->  Sort  (cost=292646835.98..292646835.99 rows=1 width=1813) (actual time=5502255.883..5502255.889 rows=23 loops=1)
        Sort Key: (row_number() OVER (?))
        Sort Method: quicksort  Memory: 114kB
        Buffers: shared hit=3622772239 read=7862262 dirtied=3639
        ->  WindowAgg  (cost=292646835.93..292646835.97 rows=1 width=1813) (actual time=5502254.506..5502255.810 rows=43 loops=1)
              Buffers: shared hit=3622772236 read=7862262 dirtied=3639
              ->  GroupAggregate  (cost=292646835.93..292646835.96 rows=1 width=1805) (actual time=5502253.088..5502253.249 rows=43 loops=1)
                    Group Key: o.id
                    Buffers: shared hit=3622772236 read=7862262 dirtied=3639
                    ->  Sort  (cost=292646835.93..292646835.93 rows=1 width=1930) (actual time=5502244.952..5502244.964 rows=43 loops=1)
                          Sort Key: o.id, cr.id
                          Sort Method: quicksort  Memory: 71kB
                          Buffers: shared hit=3622772236 read=7862262 dirtied=3639
                          ->  Nested Loop  (cost=0.98..292646835.92 rows=1 width=1930) (actual time=2015030.222..5502243.020 rows=43 loops=1)
                                Join Filter: (array_to_string(ARRAY[...]) = ap.policy_id)
                                Rows Removed by Join Filter: 246198062
                                Buffers: shared hit=3622772236 read=7862262 dirtied=3639
---

"Planning:"
"  Buffers: shared hit=649 read=2"
"Planning Time: 133.626 ms"
"JIT:"
"  Functions: 32"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 226.277 ms, Inlining 631.748 ms, Optimization 801.960 ms, Emission 1156.811 ms, Total 2816.797 ms"
"Execution Time: 5505139.477 ms"

Observations from the Execution Plan:

Execution Time: ~90 minutes
Buffer Usage:

Shared hits: ~3.6 billion
Reads: ~7.8 million
Dirtied: ~3,639

Key Issues:

Functions in Joins and Conditions: Prevented effective index usage
Nested Loops: Inefficient row-by-row processing.
Sort Operations: High-cost sorting due to lack of index usage.
Full Table Scans: Extensive rows removed by join filters.

Optimizing the Query

Optimized Query suggested :

Here I modified some things, to help the optimizer use the indexes :

Indexes and created some generated columns to avoid applying functions in the where clause on some fields.
I changed the LEFT JOIN to a INNER JOIN when possible to reduce the data set early in the query execution.
I rewrote some conditions to avoid the use of functions on the JSON fields.

ALTER TABLE orders ADD COLUMN customer_id TEXT GENERATED ALWAYS AS (data->'customer'->>'id') STORED;
ALTER TABLE orders ADD COLUMN order_type TEXT GENERATED ALWAYS AS (data->>'order_type') STORED;
ALTER TABLE orders ADD COLUMN display_name_lower TEXT GENERATED ALWAYS AS (LOWER(data->>'display_name')) STORED;
CREATE INDEX idx_orders_customer_id ON orders (customer_id);
CREATE INDEX idx_orders_order_type ON orders (order_type);
CREATE INDEX idx_orders_display_name_lower ON orders (display_name_lower);
--- ... etc

WITH result_list AS (
    SELECT o.id,
           o.data AS order_data,
           ARRAY_AGG(DISTINCT cr.id) AS customer_refs_id,
           ARRAY_AGG(DISTINCT cr.data) AS customer_refs_data,
           ARRAY_AGG(DISTINCT cp.policy_id) AS customer_policy_id,
           ROW_NUMBER() OVER () AS rowid
    FROM orders o
    INNER JOIN customer_refs cr ON cr.data->>'id' = o.customer_id
    INNER JOIN customer_policies cp ON cp.policy_data->>'id' = o.customer_id
        AND cp.policy_data->>'region' = ANY (ARRAY['GLOBAL', 'REGION_A', 'REGION_B', 'REGION_C', 'REGION_D'])
        AND cp.policy_data->>'status' NOT IN ('inactive', 'blocked', 'suspended')
        AND (
            cp.policy_data->>'customer_type' = 'issuer'
            OR cp.policy_data->>'customer_type' IN ('type1', 'type2', 'type3')
            OR get_text_from_path(cp.policy_data->'shared_groups', '$[*]', 'NONE') && ARRAY['GROUP_1']
        )
    WHERE EXISTS (
        SELECT 1
        FROM account_policies ap
        WHERE ap.policy_id = ARRAY_TO_STRING(ARRAY[
            o.data->'account'->'id'->>'system_ref',
            o.data->'account'->'id'->>'type_ref',
            o.data->'account'->'id'->>'region',
            o.data->'account'->'id'->>'id'], '-')
        AND ap.policy_data->>'region' = ANY (ARRAY['GLOBAL', 'REGION_A', 'REGION_B', 'REGION_C', 'REGION_D'])
        AND ap.policy_data->>'status' NOT IN ('inactive', 'blocked', 'suspended')
        AND ap.policy_data->>'contract' != 'prospect'
        AND (
            ap.policy_data->>'account_type' IN ('typeX', 'typeY', 'typeZ')
            OR get_text_from_path(ap.policy_data->'shared_groups', '$[*]', 'NONE') && ARRAY['GROUP_1']
        )
    )
    AND o.order_type = 'MONEY_ORDER'
    AND (o.data->'close_date'->>'value' > '2024-09-10' OR o.data->'close_date'->>'value' IS NULL OR o.data->'close_date'->>'value' = '')
    AND o.display_name_lower LIKE '%current%'
    GROUP BY o.id, o.data
    ORDER BY o.id
)
SELECT *
FROM result_list
ORDER BY rowid
LIMIT 23;

Updated Execution Plan Analysis

Some DEVs were also working on tuning this query and decided to implement a materialized view which solved the issue differently but with the same intent.

Limit  (cost=9013972.46..9013972.46 rows=1 width=1789) (actual time=175757.446..175757.449 rows=0 loops=1)
  Buffers: shared hit=20716139 read=836484 dirtied=1676
  ->  Sort  (cost=9013972.46..9013972.46 rows=1 width=1789) (actual time=174722.049..174722.052 rows=0 loops=1)
        Sort Key: (row_number() OVER (?))
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=20716139 read=836484 dirtied=1676
        ->  WindowAgg  (cost=9013972.40..9013972.45 rows=1 width=1789) (actual time=174722.010..174722.013 rows=0 loops=1)
              Buffers: shared hit=20716136 read=836484 dirtied=1676
              ->  GroupAggregate  (cost=9013972.40..9013972.44 rows=1 width=1781) (actual time=174722.008..174722.011 rows=0 loops=1)
                    Group Key: positions.id
                    Buffers: shared hit=20716136 read=836484 dirtied=1676
                    ->  Sort  (cost=9013972.40..9013972.41 rows=1 width=1906) (actual time=174722.007..174722.009 rows=0 loops=1)
                          Sort Key: positions.id, client_refs.id
                          Sort Method: quicksort  Memory: 25kB
                          Buffers: shared hit=20716136 read=836484 dirtied=1676
                          ->  Nested Loop  (cost=1.41..9013972.39 rows=1 width=1906) (actual time=174722.001..174722.004 rows=0 loops=1)
                                Buffers: shared hit=20716136 read=836484 dirtied=1676
                                ->  Nested Loop Left Join  (cost=0.99..9013961.75 rows=1 width=1933) (actual time=184.358..170945.266 rows=8554 loops=1)
                                      Buffers: shared hit=20683249 read=835147 dirtied=1672
                                      ->  Nested Loop Left Join  (cost=0.56..9013950.91 rows=1 width=1931) (actual time=138.461..146595.622 rows=8554 loops=1)
                                            Join Filter: (client_refs.id = pv.client_id)
                                            Rows Removed by Join Filter: 328,884,138
                                            Buffers: shared hit=20655467 read=828553 dirtied=1153
                                            ->  Nested Loop  (cost=0.56..9008615.75 rows=1 width=1736) (actual time=33.568..75336.287 rows=8554 loops=1)
                                                  Buffers: shared hit=27,870 read=824,707 dirtied=851
                                                  ->  Seq Scan on positions_view pv  (cost=0.00..9008612.97 rows=1 width=78) (actual time=16.384..12504.255 rows=8554 loops=1)
                                                        Filter: ...
                                                        Rows Removed by Filter: 32,144,940
                                                        Buffers: shared hit=7 read=809,465
                                                  ->  Index Scan using positions_pkey on positions positions  (cost=0.56..2.78 rows=1 width=1685) (actual time=7.327..7.327 rows=1 loops=8554)
                                                        Index Cond: (id = pv.id)
                                                        Buffers: shared hit=27,550 read=15,242 dirtied=602
                                            ->  Seq Scan on client_refs  (cost=0.00..4513.96 rows=65,696 width=195) (actual time=0.004..5.734 rows=38,449 loops=8554)
                                                  Buffers: shared hit=20,627,597 read=3,846 dirtied=302
                                      ->  Index Scan using clients_policy_pkey on clients_policy  (cost=0.43..10.84 rows=1 width=26) (actual time=2.841..2.841 rows=0 loops=8554)
                                            Index Cond: (policy_id = pv.client_id)
                                            Filter: ...
                                            Rows Removed by Filter: 1
                                            Buffers: shared hit=27,782 read=6,594 dirtied=519
                                ->  Index Scan using idx_btree_portfolios_policy_id on portfolios_policy  (cost=0.42..10.64 rows=1 width=28) (actual time=0.439..0.439 rows=0 loops=8554)
                                      Index Cond: (policy_id = pv.portfolio_id)
                                      Filter: ...
                                      Rows Removed by Filter: 1
                                      Buffers: shared hit=32,887 read=1,337 dirtied=4
                                      
---

"Planning:"
"  Buffers: shared hit=954 read=78 dirtied=9"
"Planning Time: 130.627 ms"
"JIT:"
"  Functions: 33"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 5.483 ms, Inlining 81.013 ms, Optimization 530.635 ms, Emission 423.850 ms, Total 1040.982 ms"
"Execution Time: 175799.036 ms"

Execution Time: Under 3 minutes

Buffer Usage:

Shared hits: ~20.7 million
Reads: ~836,484
Dirtied: ~1,676

Improvements:

Efficient Joins: Usage of INNER JOIN, limited the dataset early.
Indexed Columns: Enabled rapid data retrieval without full scans.
Simplified Conditions: Eliminating some functions in WHERE clauses, allowed index usage.
Reduced Buffer Operations: Allowing lower I/O overhead significantly.

Comparison to the Previous Execution Plan:

Despite processing a large number of rows, the optimized query executes much faster.
The use of materialized views allows for more efficient joins and filtering, as data is stored in a format that is easier for the database to handle.

When to Use JSON vs. Traditional Table Columns

Use JSON When:

Schema Flexibility is Required: If your application deals with varying data structures or frequent schema changes, JSON allows you to store this data without altering the database schema.
Storing Semi-Structured or Unstructured Data: JSON is ideal for data that doesn’t fit neatly into a relational schema, such as logs, configuration files, or user-generated content.
Aggregating Diverse Data: When consolidating data from multiple sources with different schemas, JSON can accommodate the discrepancies without data loss.
Embedding Related Data: Storing related data together as JSON can simplify data retrieval in certain scenarios, reducing the need for complex joins.

Use Traditional Table Columns When:

Data Integrity is Crucial: Relational columns allow for constraints, foreign keys, and data types that enforce data validity.
Frequent Updates on Specific Fields: Updating a field within a JSON object requires rewriting the entire object, whereas individual columns can be updated more efficiently.
Complex Queries and Reporting: SQL excels at querying structured data. If you need to perform complex queries, aggregations, or reports, structured columns are more efficient.
Performance Optimization: Indexing and query planning are generally more effective with traditional columns.

Best practices for running efficient Queries on JSON Data

Utilize JSON Functions and Operators

PostgreSQL offers a rich set of functions and operators for querying JSON data:

Access Operators:
-> : Returns a JSON object field by key.
->> : Returns a JSON object field by key as text.
#> : Returns JSON sub-object at the specified path.
#>> : Returns JSON sub-object at the specified path as text.

Table and Sample Data

postgres=# create database jsonlab;
CREATE DATABASE
postgres=# \c jsonlab
You are now connected to database "jsonlab" as user "postgres".
jsonlab=# CREATE TABLE users (
    id serial PRIMARY KEY,
    data jsonb
);
CREATE TABLE
jsonlab=# INSERT INTO users (data) VALUES
    ('{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}'),
    ('{"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}'),
    ('{"name": "Charlie", "age": 35, "address": {"city": "Chicago", "zip": "60601"}}');
INSERT 0 3
jsonlab=# SELECT
    data->'name' AS name_json,                  -- Returns name as JSON
    data->>'age' AS age_text,                   -- Returns age as text
    data#>'{address}' AS address_json,          -- Returns the address object as JSON
    data#>>'{address,city}' AS city_text        -- Returns the city from the address as text
FROM users;
 name_json | age_text |              address_json               |  city_text
-----------+----------+-----------------------------------------+-------------
 "Alice"   | 30       | {"zip": "10001", "city": "New York"}    | New York
 "Bob"     | 25       | {"zip": "90001", "city": "Los Angeles"} | Los Angeles
 "Charlie" | 35       | {"zip": "60601", "city": "Chicago"}     | Chicago
(3 rows)

Indexing for Performance

To speed up querying JSON data, we can create indexes on the fields being accessed frequently. PostgreSQL allows you to create different types of indexes for JSON and JSONB fields.
Here are some example indexes to support the above query:

idx_users_data_gin:
A general-purpose GIN index on the data column to speed up key-value searches.
idx_users_data_name:
A BTREE index on the name field to quickly access user names.
idx_users_data_city:
A BTREE index on the city field in the address to efficiently retrieve users by their city.
idx_users_age_above_30:
A partial index when conditions are common and you want to reduce the index overhead on age (this is just for the example it doesn’t make sense here).

These indexes will significantly improve the performance of JSON queries, especially when dealing with large datasets.

CREATE INDEX idx_users_data_gin ON users USING gin (data);
CREATE INDEX idx_users_data_name ON users ((data->>'name'));
CREATE INDEX idx_users_data_city ON users ((data#>>'{address,city}'));
CREATE INDEX idx_users_age_above_30 ON users ((data->>'age'))
    WHERE (data->>'age')::int > 30;

Avoid Unnecessary Casting

Because the optimizer is still the same when you are using JSON operators, “hiding” in complexity the intent of your code to the optimizer still has the same drawbacks. Casting JSON data unnecessarily can lead to performance degradation, since the optimizer might not take advantage of present indexes. Using the appropriate operator to retrieve data in the desired format is important.

Leverage JSON Path Queries

Here are some JSON Path Queries principles.
PostgreSQL 12 introduced this, allowing for more advanced querying capabilities. This was a huge step forward to navigate into complex JSON structures. While XML was generally used with highly structured data in the past, this feature made the case for JSONB once more.
In this example, I want to display all orders that contain at least one item priced above 100 :

SELECT data
FROM orders
WHERE data @? '$.items[*] ? (@.price > 100)';

Explanation

data: This is the JSONB column storing the order details.
@?: This operator is used to evaluate a JSON path predicate and returns true or false based on whether the condition is met.
'$.items[*] ? (@.price > 100)':
- $: Represents the root of the JSON document.
- items[*]: Selects all elements in the items array.
- ?: The filter expression that applies a condition.
- @.price > 100: The condition being checked — in this case, whether the price of any item is greater than 100.

The equivalent of that query in standard relational table would look like that :

SELECT DISTINCT o.order_id
FROM orders o
JOIN items i ON o.order_id = i.order_id
WHERE i.price > 100;

Conclusion

By using materialized views or views to pre-extract JSON fields into regular columns, we can dramatically improve query performance. This approach reduces the computational burden during query execution and allowed the database optimizer to effectively use indexes. Oddly enough, old tricks are still working :).

JSON support in PostgreSQL 17 allow developers and DBAs to design flexible, efficient, and scalable applications. By understanding when and how to use JSON effectively, you can leverage PostgreSQL’s full potential, combining the reliability of a relational database with the flexibility and dynamic schema. In a world where RAG search through vector or graph databases is the new trend, the knowledge of how JSONB and PostgreSQL are working together can bring you closer to the finish line.

Post Views: 1,745

JSON in PostgreSQL : a query tuning case

Improving ORM-Generated Queries

Example of an Inefficient ORM-Generated Query

Why the Query is Inefficient

Result : Bad Execution Plan

Optimizing the Query

Updated Execution Plan Analysis

When to Use JSON vs. Traditional Table Columns

Use JSON When:

Use Traditional Table Columns When:

Best practices for running efficient Queries on JSON Data

Utilize JSON Functions and Operators

Table and Sample Data

Indexing for Performance

Avoid Unnecessary Casting

Leverage JSON Path Queries

Explanation

Conclusion

Adrien Obernesser

Leave a Reply:

Related blog articles

Improving ORM-Generated Queries

Example of an Inefficient ORM-Generated Query

Why the Query is Inefficient

Result : Bad Execution Plan

Optimizing the Query

Updated Execution Plan Analysis

When to Use JSON vs. Traditional Table Columns

Use JSON When:

Use Traditional Table Columns When:

Best practices for running efficient Queries on JSON Data

Utilize JSON Functions and Operators

Table and Sample Data

Indexing for Performance

Avoid Unnecessary Casting

Leverage JSON Path Queries

Explanation

Conclusion

Adrien Obernesser

Leave a Reply:

Related blog articles

PostgreSQL 17-18 major upgrade – blue-green migration with minimal downtime

Technological debt with AI workflows: When data hits the fan?

pgvector, a guide for DBA – Part3 AI agent and workflows

pgvector, a guide for DBA – Part2 indexes