dbi services Blog
Welcome to the dbi services Blog! This IT blog focuses on database, middleware, and OS technologies such as Oracle, Microsoft SQL Server & SharePoint, EMC Documentum, MySQL, PostgreSQL, Sybase, Unix/Linux, etc. The dbi services blog represents the view of our consultants, not necessarily that of dbi services. Feel free to comment on our blog postings.
Oracle LOBs: Infer the file type with dbms_lob package
LOBs (Large OBjects) first appeared in Oracle 7 and were created to store large amount of data such as document files, video, pictures, etc. Today, we can store up to 128 TB of data in a LOB, depending on the DB_BLOCK_SIZE parameter settings. In this posting, I will show you how to load LOB files into the Oracle database and will present a way to identify the file type of the LOB stored in the database, based on the LOB value column only.
Each file type is associated to a "file signature". This signature is composed of several bytes (i. e. 00 01), and allows to uniquely identify a file type (i.e. zip file, png file, etc). The signature corresponds to the first bytes of the file. To identify the file type of a LOB stored in a database, we just have to extract the first bytes of the LOB value and then compare them to a list of known bytes. I have created a PL/SQL script for that purpose.
The first step is to create a table containing LOB files. Here, I will create a simple table to store internal Binary objects (BLOB). The statement used is:
CREATE TABLE app_doc ( doc_id NUMBER, doc_value BLOB);
SQL> desc app_doc;
Name Null? Type
----------------------------------------- -------- ----------------------------
Now, I want to load a file from the operating file system to the table APP_DOC. The file to load is a picture saved as PNG format. The path is "/oracle/documents/my_pic.png".
First, we must create an Oracle directory in order to access the file:
SQL> CREATE OR REPLACE DIRECTORY DIR_DOCUMENTS AS '/oracle/documents';
We then use the DBMS_LOB package through a PL/SQL block to load a file in the database:
src_lob BFILE := BFILENAME('DIR_DOCUMENTS', 'my_pic.png');
INSERT INTO app_doc VALUES(1, EMPTY_BLOB())
RETURNING doc_value INTO dest_lob;
DBMS_LOB.LoadFromFile( DEST_LOB => dest_lob,
SRC_LOB => src_lob,
AMOUNT => DBMS_LOB.GETLENGTH(src_lob) );
The table has now a new record:
SQL> select count (*) from app_doc;
I have populated the table with 28 files of several types (png, pdf, bmp, etc.) and the table has 28 records. Here is the list of files loaded into the table:
oracle@vmtest:/oracle/documents/ [rdbms11203] ls -1
Here is an extract of the PL/SQL written to identify the LOB file type:
-- Create a temporary table to store known document types
CREATE GLOBAL TEMPORARY TABLE temp_doc_types (id NUMBER, type VARCHAR(5), sign VARCHAR(4)) ON COMMIT DELETE ROWS;
-- Populate the temporary table
INSERT INTO temp_doc_types VALUES (1,'jpeg','FFD8');
INSERT INTO temp_doc_types VALUES (2,'gif','4749');
INSERT INTO temp_doc_types VALUES (3,'png','8950');
INSERT INTO temp_doc_types VALUES (4,'bmp','424D');
INSERT INTO temp_doc_types VALUES (5,'pdf','2550');
INSERT INTO temp_doc_types VALUES (6,'doc','D0CF');
INSERT INTO temp_doc_types VALUES (7,'zip','504B');
INSERT INTO temp_doc_types VALUES (8,'rar','5261');
In this example, I chose to compare only the four first bytes of the LOB value. The number of bytes composing a file signature can be larger, but to simplify the example, I only used file signatures that have four bytes.
The number of files of each extension is returned by this statement:
SQL> select decode(type,NULL,'Unknown document type',type) as "TYPE OF DOCUMENT",
2 count(*) as "NUMBER OF DOCUMENTS"
3 from temp_doc_types a,
4 (select (dbms_lob.substr(doc_value,2,1)) as FILE_TYPE from app_doc) b
5 where a.sign(+)=b.FILE_TYPE group by a.type;
TYPE OF DOCUMENT NUMBER OF DOCUMENTS
Unknown document type 1
The statement gets the first four bytes of each LOB and compares them to the temporary table, matching the LOB to a type.
- The decode function is used to display 'Unknown document type' when NULL is returned and no extension matches between the LOB and the temporary table. This is the case for text files which have no file signature.
- The DBMS_LOB.SUBSTR function is used to retrieve only the first four bytes from the LOB value. Parameters are the source LOB value, the amount of bytes to read (4), and the starting position (1).
- An outer join is used to count LOBs not matching the temporary table. This will correspond to unidentified files types.
To finish, I drop the temporary table before exiting the job:
-- The temporary table is dropped before exit
DROP TABLE temp_doc_types;
And just for the eyes, this is the result in a HTML fashion, generated directly from sqlplus using the MARKUP option and an css style sheet. I let you read the Oracle documentation in order to know how to use the MARKUP option ;-)