By Franck Pachot
.
I know that lot of people are against the ANSI join syntax in Oracle. And this goes beyond the limits when talking about NATURAL JOIN. But I like them and use them quite often.
Why is Natural Join bad?
Natural join is bad because it relies on column names, and, at the time of writing the query, you don’t know which columns will be added or removed later. Here is an example on the SCOTT schema, joining on DEPTNO which has the same name in DEPT and EMP:
SQL> select * from EMP natural join DEPT where DNAME='SALES';
DEPTNO EMPNO ENAME JOB MGR HIREDATE SAL COMM DNAME LOC
---------- ---------- ---------- --------- ---------- --------- ---------- ---------- -------------- -------------
30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 SALES CHICAGO
30 7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 SALES CHICAGO
30 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 SALES CHICAGO
30 7900 JAMES CLERK 7698 03-DEC-81 950 SALES CHICAGO
30 7698 BLAKE MANAGER 7839 01-MAY-81 2850 SALES CHICAGO
30 7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 SALES CHICAGO
The DEPT table has a ‘LOC’column for the location of the department. But the data model may evolve and you may add a location for each employee. And we may also call it LOC:
SQL> alter table EMP add (LOC varchar2(10));
Table altered.
But now our Natural Join adds this column to the join predicate and the result is wrong because it shows only rows which have same department location as employee location:
SQL> select * from EMP natural join DEPT where DNAME='SALES';
no rows selected
Projection
In my opinion, the problem is not the Natural Join. Column names have a meaning for their tables. But the tables have different roles in our queries. As soon as a table or view participates to our query, we should redefine the column names. If we don’t, the result is completely wrong as:
SQL> select * from EMP join DEPT using(DEPTNO) where DNAME='SALES';
DEPTNO EMPNO ENAME JOB MGR HIREDATE SAL COMM LOC DNAME LOC
---------- ---------- ---------- --------- ---------- --------- ---------- ---------- ---------- -------------- -------------
30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 SALES CHICAGO
30 7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 SALES CHICAGO
30 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 SALES CHICAGO
30 7900 JAMES CLERK 7698 03-DEC-81 950 SALES CHICAGO
30 7698 BLAKE MANAGER 7839 01-MAY-81 2850 SALES CHICAGO
30 7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 SALES CHICAGO
Look: the result has two columns with the same name. This is completely wrong for a relational database and I don’t even understand why this parses without raising an error.
The projection is the most important relational operation, often overlooked as if it was just a rename for aesthetic purpose. You need to name the columns of your result set. They are the metadata for the interface between SQL and the host language. ‘select *’ is a shortcut when running an interactive query, to get a glance at the result rows. But a SQL query result is not complete without proper column names. And in most cases, at least when you query more than one table, the name of the query result columns should be different than the name of the underlying table columns. A department may have a location. And an employee may have a location. But the location of the employee department is something completely different than the employee location.
Then, as you need to name each column anyway, why not doing it as soon as possible? Do it for each table involved in the query, so that you are sure that all column names are correct within the query. As soon as you introduce an new table in the FROM clause, you should actually name the columns according to their role in the query. Let’s take an example with an airline data model. Each airport is linked to a city. This can be a CITY column in the AIRPORTS table. But as soon as you join FLIGHTS with AIRPORTS, this table has a different role. You join on destination airport or source airport. Then you alias the AIRPORTS table in the FROM clause, such as DST_AIRPORTS or SRC_AIRPORTS. Within the query, you can reference the columns with the table alias, such as DST_AIRPORTS.CITY or SRC_AIRPORTS.CITY but this cannot be exposed as-is in the query result. You must name them in the SELECT clause with something like SELECT DST_AIRPORTS.CITY as DST_ARP_CITY , SRC_AIRPORTS.CITY as SRC_ARP_CITY.
Then, as I’ll need to rename them anyway, I prefer to do it as soon as I join to a new table in the FROM clause. Instead of joining to AIRPORTS DST_AIRPORTS I can join to (SELECT IATA DST_ARP_IATA, CITY DST_ARP_CITY FROM AIRPORTS) and all column names will relate to the role without table aliases and without further renaming. And when I do that correctly, I can use natural join without risk.
Projection in the FROM clause
Let’s take an example. Here is a query in DEPT where I explicitly mention that LOC is the department location. This is implicit when the column name belongs to the DEPT table. But it will not be implicit anymore once I join this table to another table. Here is the view ready to be included in any query:
SQL> select DEPTNO,DNAME DEPT_DNAME,LOC DEPT_LOC from DEPT where DNAME='SALES';
DEPTNO DEPT_DNAME DEPT_LOC
---------- -------------- -------------
30 SALES CHICAGO
Now, I can join this to the EMP table. I prefix all columns from EMP with “EMP_” and all columns from DEPT with “EMP_DEPT_” because they belong to DEPT when in the role of employee department:
SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_EMPNO
from
(select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
natural join
(select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
where EMP_DEPT_DNAME='SALES';
EMP_EMPNO EMP_ENAME EMP_DEPT_DNAME EMP_DEPT_LOC EMP_LOC EMP_MGR_EMPNO
---------- ---------- -------------- ------------- ---------- -------------
7521 WARD SALES CHICAGO 7698
7844 TURNER SALES CHICAGO 7698
7499 ALLEN SALES CHICAGO 7698
7900 JAMES SALES CHICAGO 7698
7698 BLAKE SALES CHICAGO 7839
7654 MARTIN SALES CHICAGO 7698
As you can see, when the names are clearly indicating the column with its role in the join, and how they are correlated with the other tables, there is no need to mention any join predicate. I used Natural Join because the join is on EMP_DEPTNO and I’m sure that it will always be the one and only one column with the same name. By query design.
And the column names in the result are correct, explicitly mentioning what is an Employee attribute or an Employee department attribute. That can be easy to parse and put in an object graph in the host language. You can see there that the MGR column of EMP was named EMP_MGR_EMPNO because this is actually what it is: the EMPNO of the employee manager. It is a foreign key to the EMP table.
And then, adding more information about the manager is easy: join with EMP again but with the proper projection of columns: EMPNO will be EMP_MGR_EMPNO when in the role of the employee manager, ENAME will be EMP_MGR_ENAME, DEPTNO will be EMP_MGR_DEPTNO, and so on:
SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME
from
(select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
natural join
(select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
natural join
(select DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP)
where EMP_DEPT_DNAME='SALES';
EMP_EMPNO EMP_ENAME EMP_DEPT_DNAME EMP_DEPT_LOC EMP_LOC EMP_MGR_DEPTNO EMP_MGR_ENAME
---------- ---------- -------------- ------------- ---------- -------------- -------------
7900 JAMES SALES CHICAGO 30 BLAKE
7499 ALLEN SALES CHICAGO 30 BLAKE
7654 MARTIN SALES CHICAGO 30 BLAKE
7844 TURNER SALES CHICAGO 30 BLAKE
7521 WARD SALES CHICAGO 30 BLAKE
7698 BLAKE SALES CHICAGO 10 KING
No need to review the whole query when adding a new table. No need to solve the new ‘column ambiguously defined’. We don’t even need to alias the tables here.
Want to add the department name of the manager? That’s easy: join to DEPT with the right column projection (all prefixed by EMP_MGR_DEPT as the new columns are all about the employee manager’s department):
SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME,EMP_MGR_DEPT_DNAME
from
(select DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP)
natural join
(select DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT)
natural join
(select DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP)
natural join
(select DEPTNO EMP_MGR_DEPTNO,DNAME EMP_MGR_DEPT_DNAME,LOC EMP_MGR_DEPT_LOC from DEPT)
where EMP_DEPT_DNAME='SALES';
EMP_EMPNO EMP_ENAME EMP_DEPT_DNAME EMP_DEPT_LOC EMP_LOC EMP_MGR_DEPTNO EMP_MGR_EN EMP_MGR_DEPT_D
---------- ---------- -------------- ------------- ---------- -------------- ---------- --------------
7698 BLAKE SALES CHICAGO 10 KING ACCOUNTING
7900 JAMES SALES CHICAGO 30 BLAKE SALES
7499 ALLEN SALES CHICAGO 30 BLAKE SALES
7654 MARTIN SALES CHICAGO 30 BLAKE SALES
7844 TURNER SALES CHICAGO 30 BLAKE SALES
7521 WARD SALES CHICAGO 30 BLAKE SALES
This can be even easier when you generate SQL queries. When adding a new table to join to, you just prefix all columns with their role. Check foreign keys so that the naming is consistent with the referenced tables. Then when parsing the result, the naming convention can help to break on the object hierarchy.
Additional notes
I mentioned that aliasing the subquery is not mandatory because I do not have to prefix the column names. However, when looking at the predicates section of the execution plan, the columns may be prefixed with an internal alias:
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("from$_subquery$_006"."EMP_MGR_DEPTNO"="from$_subquery$_009"."EMP_MGR_DEPTNO")
2 - access("from$_subquery$_001"."EMP_MGR_EMPNO"="from$_subquery$_006"."EMP_MGR_EMPNO" AND "from$_subquery$_001"."EMP_DEPTNO"="from$_subquery$_003"."EMP_DEPTNO")
Then it is a good idea to add prefixes, such as EMP, EMP_DEPT, EMP_MGR EMP_MGR_DEPTNO in the query above so that the predicates become:
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("EMP_MGR"."EMP_MGR_DEPTNO"="EMP_MGR_DEPT"."EMP_MGR_DEPTNO")
2 - access("EMP"."EMP_MGR_EMPNO"="EMP_MGR"."EMP_MGR_EMPNO" AND "EMP"."EMP_DEPTNO"="EMP_DEPT"."EMP_DEPTNO")
5 - filter("DNAME"='SALES')
I also like to add a QB_NAME hint so that I can reference easily those subqueries if I have to add some hints there. Finally, this is what I can generate for this query:
SQL> select EMP_EMPNO,EMP_ENAME,EMP_DEPT_DNAME,EMP_DEPT_LOC,EMP_LOC,EMP_MGR_DEPTNO,EMP_MGR_ENAME,EMP_MGR_DEPT_DNAME
from
(select /*+qb_name(EMP)*/ DEPTNO EMP_DEPTNO,EMPNO EMP_EMPNO,ENAME EMP_ENAME,MGR EMP_MGR_EMPNO,LOC EMP_LOC from EMP) EMP
natural join
(select /*+qb_name(EMP_DEPT)*/ DEPTNO EMP_DEPTNO,DNAME EMP_DEPT_DNAME,LOC EMP_DEPT_LOC from DEPT) EMP_DEPT
natural join
(select /*+qb_name(EMP_MGR)*/ DEPTNO EMP_MGR_DEPTNO,EMPNO EMP_MGR_EMPNO,ENAME EMP_MGR_ENAME from EMP) EMP_MGR
natural join
(select /*+qb_name(EMP_MGR_DEPT)*/ DEPTNO EMP_MGR_DEPTNO,DNAME EMP_MGR_DEPT_DNAME,LOC EMP_MGR_DEPT_LOC from DEPT) EMP_MGR_DEPT
where EMP_DEPT_DNAME='SALES';
So what?
My goal here is not to recommend to always use natural joins. This depends on the context (ad-hoc queries, embedded ones in existing code with naming standards,…) and whether con control exactly the columns names. There are also a few bugs with ANSI joins, and natural join is not widely used, so maybe not tested a lot. But when I hear that Natural Join is bad, I want to explain the why/how/when. And one of the good sides of it is that it forces us to do the projection/rename as soon as possible and this makes the query easier to read/maintain/evolve. Of course, using natural join in that way requires that all tables are added to the FROM clause through a subquery which carefully names all columns in the SELECT clause so that the correlation with the other tables is clearly defined.