By Franck Pachot
This post is the first one from a series of small examples on recent Oracle features. My goal is to present them to people outside of Oracle and relational databases usage, maybe some NoSQL players. And this is why the title is “select from a flat-file” rather than “Inline External Tables”. In my opinion, the names of the features of Oracle Database are invented by the architects and developers, sometimes renamed by Marketing or CTO, and all that is very far from what the users are looking for. In order to understand “Inline External Table” you need to know all the history behind: there were tables, then external tables, and there were queries, and inlined queries, and… But imagine a junior who just wants to query a file, he will never find this feature. He has a file, it is not a table, it is not external, and it is not inline. What is external to him is this SQL language and what we want to show him is that this language can query his file.
I’m running this in the Oracle 20c preview in the Oracle Cloud.
In this post, my goal is to load a small fact and dimension table for the next posts about some recent features that are interesting in data warehouses. It is the occasion to show that with Oracle we can easily select from a .csv file, without the need to run SQL*Loader or create an external table.
I’m running everything from SQLcl and then I use the host command to call curl:
host curl -L http://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ | dos2unix | (sed -u 1q; sort -t, -n -k4,4 -k3,3 -k2,2 ) > /tmp/covid-19.csv
This gets the latest number of COVID-19 cases per day and per country.
It looks like this:
SQL> host head /tmp/covid-19.csv dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp 31/12/2019,31,12,2019,0,0,Afghanistan,AF,AFG,37172386,Asia 31/12/2019,31,12,2019,0,0,Algeria,DZ,DZA,42228429,Africa 31/12/2019,31,12,2019,0,0,Armenia,AM,ARM,2951776,Europe 31/12/2019,31,12,2019,0,0,Australia,AU,AUS,24992369,Oceania 31/12/2019,31,12,2019,0,0,Austria,AT,AUT,8847037,Europe 31/12/2019,31,12,2019,0,0,Azerbaijan,AZ,AZE,9942334,Europe 31/12/2019,31,12,2019,0,0,Bahrain,BH,BHR,1569439,Asia 31/12/2019,31,12,2019,0,0,Belarus,BY,BLR,9485386,Europe 31/12/2019,31,12,2019,0,0,Belgium,BE,BEL,11422068,Europe
I sorted them on date on purpose (next post may talk about data clustering) but this is the way the file comes anyway.
I need a directory object to access the file:
SQL> create or replace directory "/tmp" as '/tmp'; Directory created.
You don’t have to use quoted identifiers if you don’t like it. I find it convenient here.
I can directly select from the file, the EXTERNAL clause mentioning what we had to put in an external table before 18c:
SQL> select * from external ( ( dateRep date ,n_day number ,n_month number ,n_year number ,cases number ,deaths number ,countriesAndTerritories varchar2(50) ,geoId varchar2(10) ,countryterritoryCode varchar2(3) ,popData2018 number ,continentExp varchar2(10) ) default directory "/tmp" access parameters ( records delimited by newline skip 1 -- skip header logfile 'covid-19.log' badfile 'covid-19.bad' fields terminated by "," optionally enclosed by '"' (dateRep date 'dd/mm/yyyy',n_day,n_month,n_year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2018,continentExp)) location ('covid.csv') reject limit 10000 - because I've seen some bad data ;) ) . SQL> / DATEREP N_DAY N_MONTH N_YEAR CASES DEATHS COUNTRIESANDTERRITORIES GEOID COUNTRYTERRITORYCODE POPDATA2018 CONTINENTEXP _____________ ________ __________ _________ ________ _________ _____________________________________________ ___________ _______________________ ______________ _______________ 31/12/2019 31 12 2019 0 0 Afghanistan AF AFG 37172386 Asia 31/12/2019 31 12 2019 0 0 Algeria DZ DZA 42228429 Africa 31/12/2019 31 12 2019 0 0 Armenia AM ARM 2951776 Europe 31/12/2019 31 12 2019 0 0 Australia AU AUS 24992369 Oceania 31/12/2019 31 12 2019 0 0 Austria AT AUT 8847037 Europe 31/12/2019 31 12 2019 0 0 Azerbaijan AZ AZE 9942334 Europe 31/12/2019 31 12 2019 0 0 Bahrain BH BHR 1569439 Asia ORA-01013: user requested cancel of current operation SQL>
I cancelled it as that’s too long to display here.
As the query is still in the buffer, I just add a CREATE TABLE in front of it:
SQL> 1 1* select * SQL> c/select/create table covid as select/ create table covid as select * 2 from external ( 3 ( 4 dateRep varchar2(10) 5 ,day number ... SQL> / Table created. SQL>
While I’m there I’ll quickly create a fact table and a dimension hierarchy:
SQL> create table continents as select rownum continent_id, continentexp continent_name from (select distinct continentexp from covid where continentexp!='Other'); Table created. SQL> create table countries as select country_id,country_code,country_name,continent_id,popdata2018 from (select distinct geoid country_id,countryterritorycode country_code,countriesandterritories country_name,continentexp continent_name,popdata2018 from covid where continentexp!='Other') left join continents using(continent_name); Table created. SQL> create table cases as select daterep, geoid country_id,cases from covid where continentexp!='Other'; Table created. SQL> alter table continents add primary key (continent_id); Table altered. SQL> alter table countries add foreign key (continent_id) references continents; Table altered. SQL> alter table countries add primary key (country_id); Table altered. SQL> alter table cases add foreign key (country_id) references countries; Table altered. SQL> alter table cases add primary key (country_id,daterep); Table altered. SQL>
This creates a CASES fact table with only one measure (covid-19 cases) and two dimensions. To get it simple, the date dimension here is just a date column (you usually have a foreign key to a calendar dimension). The geographical dimension is a foreign key to the COUNTRIES table which itself has a foreign key referencing the CONTINENTS table.
12c Top-N queries
In 12c we have a nice syntax for Top-N queries with the FETCH FIRST clause of the ORDER BY:
SQL> select continent_name,country_code,max(cases) from cases join countries using(country_id) join continents using(continent_id) group by continent_name,country_code order by max(cases) desc fetch first 10 rows only; CONTINENT_NAME COUNTRY_CODE MAX(CASES) _________________ _______________ _____________ America USA 48529 America BRA 33274 Asia CHN 15141 Europe RUS 11656 America ECU 11536 Asia IND 9851 Europe ESP 9181 America PER 8875 Europe GBR 8719 Europe FRA 7578 10 rows selected.
This returns the 10 countries which had the maximum covid-19 cases per day.
20c WINDOW clauses
If I want to show the date with the maximum value, I can use analytic functions and in 20c I don’t have to repeat the window several times:
SQL> select continent_name,country_code,top_date,top_cases from ( select continent_name,country_code,daterep,cases ,first_value(daterep)over(w) top_date ,first_value(cases)over(w) top_cases ,row_number()over(w) r from cases join countries using(country_id) join continents using(continent_id) window w as (partition by continent_id order by cases desc) ) where r=1 -- this to get the rows with the highest value only order by top_cases desc fetch first 10 rows only; CONTINENT_NAME COUNTRY_CODE TOP_DATE TOP_CASES _________________ _______________ _____________ ____________ America USA 26/04/2020 48529 Asia CHN 13/02/2020 15141 Europe RUS 12/05/2020 11656 Africa ZAF 05/06/2020 3267 Oceania AUS 23/03/2020 611
The same can be done before 20c but you have to write the (partition by continent_id order by cases desc) for each projection.
In the next post I’ll show a very nice feature. Keeping the 3 tables normalized data model but, because storage is cheap, materializing some pre-computed joins. If you are a fan of NoSQL because “storage is cheap” and “joins are expensive”, then you will see what we can do with SQL in this area…