At the DOAG I attended a session “Top-level DB design for Big Data in ATLAS Experiment at CERN” provided by Gancho Dimitrov. The presentation was actually very interesting. As part of Gancho’s improvement activities to reduce space in a table he stored data in a 16 Bytes raw format (instead of a string representing hex values which requires 36 Bytes) and use virtual columns to actually calculate the real hex-string.

So the original value is e.g. 21EC2020-3AEA-4069-A2DD-08002B30309D, which is reduced to 16 Bytes by removing the ‘-‘ and converting the resulting hex-string to raw:

HEXTORAW(REPLACE(’21EC2020-3AEA-4069-A2DD-08002B30309D’, ‘-‘, ”))

The problem was that the longer virtual columns added to the average row length statistic in Oracle. I.e. here the simple testcase:


create table cern_test
(
GUID0 RAW(16)
,GUID1 RAW(16)
,GUID2 RAW(16)
,GUID0_CHAR as (SUBSTR(RAWTOHEX(GUID0),1,8)||'-'||
SUBSTR(RAWTOHEX(GUID0),9,4)||'-'||
SUBSTR(RAWTOHEX(GUID0),13,4)||'-'||
SUBSTR(RAWTOHEX(GUID0),17,4)||'-'||
SUBSTR(RAWTOHEX(GUID0),21,12))
,GUID1_CHAR as (SUBSTR(RAWTOHEX(GUID1),1,8)||'-'||
SUBSTR(RAWTOHEX(GUID1),9,4)||'-'||
SUBSTR(RAWTOHEX(GUID1),13,4)||'-'||
SUBSTR(RAWTOHEX(GUID1),17,4)||'-'||
SUBSTR(RAWTOHEX(GUID1),21,12))
,GUID2_CHAR as (SUBSTR(RAWTOHEX(GUID2),1,8)||'-'||
SUBSTR(RAWTOHEX(GUID2),9,4)||'-'||
SUBSTR(RAWTOHEX(GUID2),13,4)||'-'||
SUBSTR(RAWTOHEX(GUID2),17,4)||'-'||
SUBSTR(RAWTOHEX(GUID2),21,12))
);
 
insert into cern_test (guid0,guid1,guid2)
select HEXTORAW('21EC20203AEA4069A2DD08002B30309D'),
HEXTORAW('31DC20203AEA4069A2DD08002B30309D'),
HEXTORAW('41CC20203AEA4069A2DD08002B30309D')
from xmltable('1 to 10000');
commit;
 
exec dbms_stats.gather_table_stats(user,'CERN_TEST',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE 1');
 
select avg_row_len from tabs where table_name='CERN_TEST';
 
AVG_ROW_LEN
-----------
162
 
select sum(avg_col_len) from user_tab_columns
where table_name='CERN_TEST' and column_name in ('GUID0','GUID1','GUID2');
 
SUM(AVG_COL_LEN)
----------------
51
 
select sum(avg_col_len) from user_tab_columns
where table_name='CERN_TEST' and column_name in ('GUID0_CHAR','GUID1_CHAR','GUID2_CHAR');
 
SUM(AVG_COL_LEN)
----------------
111

The question is if the computation of the average row length by Oracle is correct. I.e. should the physically non-existent virtual columns be considered?
I.e. physically they do not take space. So physically the average row length in the example above is 51, but logically it is 162. What is correct?

To answer that question it has to be checked what the average row length is used for. That information is not documented, but my assumption is that it’s actually only used for the calculation of Bytes required when doing a “select * “, i.e. all columns. That number however, may become important later on when calculating the memory required for e.g. a hash join.

Anyway, the basic question is how Oracle treats virtual columns in execution plans? I.e. does it compute the value of the virtual column when the table is accessed or does it compute the virtual column when it needs it (e.g. when fetching the row or when needing it as a column to join with). According the number “Bytes” in the execution plan the value is computed when the table is accessed:


SQL> explain plan for
2 select a.*, b.guid0 b_guid0 from cern_test a, cern_test b
3 where a.guid0_char=b.guid0_char;
 
Explained.
 
SQL> select * from table(dbms_xplan.display(format=>'+PROJECTION'));
 
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------
Plan hash value: 3506643611
 
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100M| 20G| 267 (83)| 00:00:01 |
|* 1 | HASH JOIN | | 100M| 20G| 267 (83)| 00:00:01 |
| 2 | TABLE ACCESS FULL| CERN_TEST | 10000 | 527K| 23 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| CERN_TEST | 10000 | 1582K| 23 (0)| 00:00:01 |
--------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
1 - access("A"."GUID0_CHAR"="B"."GUID0_CHAR")
 
Column Projection Information (identified by operation id):
-----------------------------------------------------------
 
1 - (#keys=1) "A"."GUID0_CHAR"[VARCHAR2,132], "GUID0"[RAW,16],
"GUID0"[RAW,16], "GUID1"[RAW,16], "GUID2"[RAW,16]
2 - "GUID0"[RAW,16]
3 - "GUID0"[RAW,16], "GUID1"[RAW,16], "GUID2"[RAW,16]
 
23 rows selected.
 
SQL> select column_name, avg_col_len from user_tab_columns
2 where table_name='CERN_TEST' and column_name in ('GUID0','GUID0_CHAR');
 
COLUMN_NAME AVG_COL_LEN
----------------------------------- -----------
GUID0 17
GUID0_CHAR 37
 
SQL> select (10000*17)/1024 from dual;
 
(10000*17)/1024
---------------
166.015625
 
SQL> select ((10000*17)+(10000*37))/1024 from dual;
 
((10000*17)+(10000*37))/1024
----------------------------
527.34375

So according the projection at step 2 of the plan we use B.GUID0 only, but the Bytes value of 527K considers GUID0 and the virtual column GUID0_CHAR. So the calculation of Bytes is done when the table is accessed and not when the virtual column is actually needed (during the hash).

In that regard the calculation of the avg_row_len by dbms_stats with the virtual columns considered is correct.

The only issue I see are old scripts people wrote long ago, which try to compute the amount of data in a table based on its avg_row_len statistic using something like


SELECT table_name, num_rows * avg_row_len actual_size_of_data
FROM user_tables order by 2;

If there are virtual columns in the table, such a select may return too high values for “actual_size_of_data”.

REMARK: Using the old ANALYZE command to gather statistics results in a value for the avg_row_len, which considers only “real” columns. However, ANALYZE must not be used anymore of course.