By Franck Pachot

.
Do you think that it’s better to write semi-join SQL statements with IN(), EXISTS(), or to do a JOIN? Usually, the optimizer will evaluate the cost and do the transformation for you. And in this area, one more transformation has been introduced in 12c which is the Partial Join Evaluation (PJE).

First, let’s have a look at the 11g behaviour. For that example, I use the SCOTT schema, but I hire a lot more employees in departement 40:


SQL> alter table EMP modify empno number(10);
Table altered.
SQL> insert into EMP(empno,deptno) select rownum+10000,40 from EMP,(select * from dual connect by level

Why department 40? I’ll explain it below, but I let you think about it before. In the default SCOTT schema, there is a department 40 in DEPT table, but which has no employees in EMP. And the new transformation is not useful in that case.

11g behaviour

Now, I’m running the following query to check all the departments that have at least one employee:
I can write it with IN:


PLAN_TABLE_OUTPUT
 ------------------------------------------------------------------------
 SQL_ID 6y71msam9w32r, child number 0
 -------------------------------------
 select distinct deptno,dname from dept
 where deptno in ( select deptno from emp)
 
Plan hash value: 1754319153
 
------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |
------------------------------------------------------------------------
Predicate Information (identified by operation id):
 ---------------------------------------------------
 
   1 - access("DEPTNO"="DEPTNO")

or with EXISTS:


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------
SQL_ID  cbpa3zjtzfzrn, child number 0
-------------------------------------
select distinct deptno,dname from dept 
 where exists ( select 1 from emp where emp.deptno=dept.deptno)
 
Plan hash value: 1754319153
 
------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows | Buffers |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      4 |      15 |
|*  1 |  HASH JOIN SEMI    |      |      1 |      4 |      4 |      15 |
|   2 |   TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   3 |   TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       8 |
------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("DEPTNO"="DEPTNO")

Both are good. We didn’t have to read the whole EMP table. I have 15000 rows in my table, I do a full scan on it, but look at the A-Rows: only 388 rows were actually read.

The HASH JOIN first read the DEPT table in order to build the hash table. So it already knows that we cannot have more than 4 distinct departments.

Then we do the join to EMP just to check which of those departments have an employee. But we can stop as soon as we find the 4 departments. This is the reason why we have read only 388 rows here. And this is exactly what a Semi Join is: we don’t need all the matching rows, we return at most one row per matching pair.

Ok. What if we write the join ourselves?


PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------
SQL_ID  2xjj9jybqja87, child number 1
-------------------------------------
select distinct deptno,dname from dept join emp using(deptno)
 
Plan hash value: 2962452962
 
-------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |     129 |
|   1 |  HASH UNIQUE        |      |      1 |  15068 |      4 |     129 |
|*  2 |   HASH JOIN         |      |      1 |  15068 |  14014 |     129 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |  14014 |     122 |
-------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

Bad luck. We have to read all the rows. More rows and more buffers.

12c behaviour

Let’s do the same in 12.1.0.1:


PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------
SQL_ID  2xjj9jybqja87, child number 0
-------------------------------------
select distinct deptno,dname from dept join emp using(deptno)
 
Plan hash value: 1629510749
 
-------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows | Buffers |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      4 |      14 |
|   1 |  HASH UNIQUE        |      |      1 |      4 |      4 |      14 |
|*  2 |   HASH JOIN SEMI    |      |      1 |      4 |      4 |      14 |
|   3 |    TABLE ACCESS FULL| DEPT |      1 |      4 |      4 |       7 |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  15068 |    388 |       7 |
-------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("DEPT"."DEPTNO"="EMP"."DEPTNO")

Same plan but less rows have been read. If we check the outlines, we see the new feature:


      PARTIAL_JOIN(@"SEL$58A6D7F6" "EMP"@"SEL$1")

And here is what we see in the optimizer trace:


OPTIMIZER STATISTICS AND COMPUTATIONS
PJE: Checking validity of partial join eval on query block SEL$58A6D7F6 (#1)
PJE: Passed validity of partial join eval by query block SEL$58A6D7F6 (#1)
PJE: Partial join eval conversion for query block SEL$58A6D7F6 (#1).
PJE: Table marked for partial join eval: EMP[EMP]#1

The hints that control the feature are PARTIAL_JOIN and NO_PARTIAL_JOIN and the transformation is enabled by _optimizer_partial_join_eval which appeared in 12c.

But of course, the optimization is useful only when we have all the values at the beginning of the table. This is why I added at least one employee in department 40. If there are some rows in DEPT that have no matching row in EMP, then Oracle cannot know the result before reaching the end of the table.