By Franck Pachot
.
This post is the first one from a series of small examples on recent Oracle features. My goal is to present them to people outside of Oracle and relational databases usage, maybe some NoSQL players. And this is why the title is “select from a flat-file” rather than “Inline External Tables”. In my opinion, the names of the features of Oracle Database are invented by the architects and developers, sometimes renamed by Marketing or CTO, and all that is very far from what the users are looking for. In order to understand “Inline External Table” you need to know all the history behind: there were tables, then external tables, and there were queries, and inlined queries, and… But imagine a junior who just wants to query a file, he will never find this feature. He has a file, it is not a table, it is not external, and it is not inline. What is external to him is this SQL language and what we want to show him is that this language can query his file.
I’m running this in the Oracle 20c preview in the Oracle Cloud.
In this post, my goal is to load a small fact and dimension table for the next posts about some recent features that are interesting in data warehouses. It is the occasion to show that with Oracle we can easily select from a .csv file, without the need to run SQL*Loader or create an external table.
I’m running everything from SQLcl and then I use the host command to call curl:
host curl -L http://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ | dos2unix | (sed -u 1q; sort -t, -n -k4,4 -k3,3 -k2,2 ) > /tmp/covid-19.csv
This gets the latest number of COVID-19 cases per day and per country.
It looks like this:
SQL> host head /tmp/covid-19.csv
dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp
31/12/2019,31,12,2019,0,0,Afghanistan,AF,AFG,37172386,Asia
31/12/2019,31,12,2019,0,0,Algeria,DZ,DZA,42228429,Africa
31/12/2019,31,12,2019,0,0,Armenia,AM,ARM,2951776,Europe
31/12/2019,31,12,2019,0,0,Australia,AU,AUS,24992369,Oceania
31/12/2019,31,12,2019,0,0,Austria,AT,AUT,8847037,Europe
31/12/2019,31,12,2019,0,0,Azerbaijan,AZ,AZE,9942334,Europe
31/12/2019,31,12,2019,0,0,Bahrain,BH,BHR,1569439,Asia
31/12/2019,31,12,2019,0,0,Belarus,BY,BLR,9485386,Europe
31/12/2019,31,12,2019,0,0,Belgium,BE,BEL,11422068,Europe
I sorted them on date on purpose (next post may talk about data clustering) but this is the way the file comes anyway.
I need a directory object to access the file:
SQL> create or replace directory "/tmp" as '/tmp';
Directory created.
You don’t have to use quoted identifiers if you don’t like it. I find it convenient here.
I can directly select from the file, the EXTERNAL clause mentioning what we had to put in an external table before 18c:
SQL>
select *
from external (
(
dateRep date
,n_day number
,n_month number
,n_year number
,cases number
,deaths number
,countriesAndTerritories varchar2(50)
,geoId varchar2(10)
,countryterritoryCode varchar2(3)
,popData2018 number
,continentExp varchar2(10)
)
default directory "/tmp"
access parameters (
records delimited by newline skip 1 -- skip header
logfile 'covid-19.log'
badfile 'covid-19.bad'
fields terminated by "," optionally enclosed by '"'
(dateRep date 'dd/mm/yyyy',n_day,n_month,n_year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp))
location ('covid.csv')
reject limit 10000 - because I've seen some bad data ;)
)
.
SQL> /
DATEREP N_DAY N_MONTH N_YEAR CASES DEATHS COUNTRIESANDTERRITORIES GEOID COUNTRYTERRITORYCODE POPDATA2018 CONTINENTEXP
_____________ ________ __________ _________ ________ _________ _____________________________________________ ___________ _______________________ ______________ _______________
31/12/2019 31 12 2019 0 0 Afghanistan AF AFG 37172386 Asia
31/12/2019 31 12 2019 0 0 Algeria DZ DZA 42228429 Africa
31/12/2019 31 12 2019 0 0 Armenia AM ARM 2951776 Europe
31/12/2019 31 12 2019 0 0 Australia AU AUS 24992369 Oceania
31/12/2019 31 12 2019 0 0 Austria AT AUT 8847037 Europe
31/12/2019 31 12 2019 0 0 Azerbaijan AZ AZE 9942334 Europe
31/12/2019 31 12 2019 0 0 Bahrain BH BHR 1569439 Asia
ORA-01013: user requested cancel of current operation
SQL>
I cancelled it as that’s too long to display here.
As the query is still in the buffer, I just add a CREATE TABLE in front of it:
SQL> 1
1* select *
SQL> c/select/create table covid as select/
create table covid as select *
2 from external (
3 (
4 dateRep varchar2(10)
5 ,day number
...
SQL> /
Table created.
SQL>
While I’m there I’ll quickly create a fact table and a dimension hierarchy:
SQL>
create table continents as select rownum continent_id, continentexp continent_name from (select distinct continentexp from covid where continentexp!='Other');
Table created.
SQL> create table countries as select country_id,country_code,country_name,continent_id,popdata2018 from (select distinct geoid country_id,countryterritorycode country_code,countriesandterritories country_name,continentexp continent_name,popdata2018 from covid where continentexp!='Other') left join continents using(continent_name);
Table created.
SQL>
create table cases as select daterep, geoid country_id,cases from covid where continentexp!='Other';
Table created.
SQL>
alter table continents add primary key (continent_id);
Table altered.
SQL>
alter table countries add foreign key (continent_id) references continents;
Table altered.
SQL>
alter table countries add primary key (country_id);
Table altered.
SQL>
alter table cases add foreign key (country_id) references countries;
Table altered.
SQL>
alter table cases add primary key (country_id,daterep);
Table altered.
SQL>
This creates a CASES fact table with only one measure (covid-19 cases) and two dimensions. To get it simple, the date dimension here is just a date column (you usually have a foreign key to a calendar dimension). The geographical dimension is a foreign key to the COUNTRIES table which itself has a foreign key referencing the CONTINENTS table.
12c Top-N queries
In 12c we have a nice syntax for Top-N queries with the FETCH FIRST clause of the ORDER BY:
SQL>
select continent_name,country_code,max(cases) from cases join countries using(country_id) join continents using(continent_id) group by continent_name,country_code order by max(cases) desc fetch first 10 rows only;
CONTINENT_NAME COUNTRY_CODE MAX(CASES)
_________________ _______________ _____________
America USA 48529
America BRA 33274
Asia CHN 15141
Europe RUS 11656
America ECU 11536
Asia IND 9851
Europe ESP 9181
America PER 8875
Europe GBR 8719
Europe FRA 7578
10 rows selected.
This returns the 10 countries which had the maximum covid-19 cases per day.
20c WINDOW clauses
If I want to show the date with the maximum value, I can use analytic functions and in 20c I don’t have to repeat the window several times:
SQL>
select continent_name,country_code,top_date,top_cases from (
select continent_name,country_code,daterep,cases
,first_value(daterep)over(w) top_date
,first_value(cases)over(w) top_cases
,row_number()over(w) r
from cases join countries using(country_id) join continents using(continent_id)
window w as (partition by continent_id order by cases desc)
)
where r=1 -- this to get the rows with the highest value only
order by top_cases desc fetch first 10 rows only;
CONTINENT_NAME COUNTRY_CODE TOP_DATE TOP_CASES
_________________ _______________ _____________ ____________
America USA 26/04/2020 48529
Asia CHN 13/02/2020 15141
Europe RUS 12/05/2020 11656
Africa ZAF 05/06/2020 3267
Oceania AUS 23/03/2020 611
The same can be done before 20c but you have to write the (partition by continent_id order by cases desc) for each projection.
In the next post I’ll show a very nice feature. Keeping the 3 tables normalized data model but, because storage is cheap, materializing some pre-computed joins. If you are a fan of NoSQL because “storage is cheap” and “joins are expensive”, then you will see what we can do with SQL in this area…