LOBs (Large OBjects) first appeared in Oracle 7 and were created to store large amount of data such as document files, video, pictures, etc. Today, we can store up to 128 TB of data in a LOB, depending on the DB_BLOCK_SIZE parameter settings. In this posting, I will show you how to load LOB files into the Oracle database and will present a way to identify the file type of the LOB stored in the database, based on the LOB value column only.
Each file type is associated to a “file signature”. This signature is composed of several bytes (i. e. 00 01), and allows to uniquely identify a file type (i.e. zip file, png file, etc). The signature corresponds to the first bytes of the file. To identify the file type of a LOB stored in a database, we just have to extract the first bytes of the LOB value and then compare them to a list of known bytes. I have created a PL/SQL script for that purpose.
The first step is to create a table containing LOB files. Here, I will create a simple table to store internal Binary objects (BLOB). The statement used is:
CREATE TABLE app_doc ( doc_id NUMBER, doc_value BLOB); SQL> desc app_doc; Name Null? Type ----------------------------------------- -------- ---------------------------- DOC_ID NUMBER DOC_VALUE BLOB
Now, I want to load a file from the operating file system to the table APP_DOC. The file to load is a picture saved as PNG format. The path is “/oracle/documents/my_pic.png”.
First, we must create an Oracle directory in order to access the file:
SQL> CREATE OR REPLACE DIRECTORY DIR_DOCUMENTS AS '/oracle/documents';
We then use the DBMS_LOB package through a PL/SQL block to load a file in the database:
DECLARE src_lob BFILE := BFILENAME('DIR_DOCUMENTS', 'my_pic.png'); dest_lob BLOB; BEGIN INSERT INTO app_doc VALUES(1, EMPTY_BLOB()) RETURNING doc_value INTO dest_lob; DBMS_LOB.OPEN(src_lob, DBMS_LOB.LOB_READONLY); DBMS_LOB.LoadFromFile( DEST_LOB => dest_lob, SRC_LOB => src_lob, AMOUNT => DBMS_LOB.GETLENGTH(src_lob) ); DBMS_LOB.CLOSE(src_lob); COMMIT; END; /
The table has now a new record:
SQL> select count (*) from app_doc; COUNT(*) ---------- 1
I have populated the table with 28 files of several types (png, pdf, bmp, etc.) and the table has 28 records. Here is the list of files loaded into the table:
oracle@vmtest:/oracle/documents/ [rdbms11203] ls -1 001.png 002.png 003.png 004.png 005.png 006.png 007.png 008.png 009.png 010.png 011.png 012.png 013.png 014.png 015.png 016.png 017.png 018.png 100021667-logo-dbi-services-sa.jpg dbi_twitter_bigger.jpg e18294-SecureFiles_and_Large_Objects_Developers_Guide.pdf e25523-VLDB_and_Partitioning_Guide.pdf file.zip Mybmp2.bmp Mybmp.bmp Mydoc1.doc mypdf.pdf text3.txt
Here is an extract of the PL/SQL written to identify the LOB file type:
-- Create a temporary table to store known document types CREATE GLOBAL TEMPORARY TABLE temp_doc_types (id NUMBER, type VARCHAR(5), sign VARCHAR(4)) ON COMMIT DELETE ROWS;
-- Populate the temporary table INSERT INTO temp_doc_types VALUES (1,'jpeg','FFD8'); INSERT INTO temp_doc_types VALUES (2,'gif','4749'); INSERT INTO temp_doc_types VALUES (3,'png','8950'); INSERT INTO temp_doc_types VALUES (4,'bmp','424D'); INSERT INTO temp_doc_types VALUES (5,'pdf','2550'); INSERT INTO temp_doc_types VALUES (6,'doc','D0CF'); INSERT INTO temp_doc_types VALUES (7,'zip','504B'); INSERT INTO temp_doc_types VALUES (8,'rar','5261');
In this example, I chose to compare only the four first bytes of the LOB value. The number of bytes composing a file signature can be larger, but to simplify the example, I only used file signatures that have four bytes.
The number of files of each extension is returned by this statement:
SQL> select decode(type,NULL,'Unknown document type',type) as "TYPE OF DOCUMENT", 2 count(*) as "NUMBER OF DOCUMENTS" 3 from temp_doc_types a, 4 (select (dbms_lob.substr(doc_value,2,1)) as FILE_TYPE from app_doc) b 5 where a.sign(+)=b.FILE_TYPE group by a.type; TYPE OF DOCUMENT NUMBER OF DOCUMENTS --------------------- ------------------- Unknown document type 1 doc 1 bmp 2 png 18 pdf 3 jpeg 2 zip 1
The statement gets the first four bytes of each LOB and compares them to the temporary table, matching the LOB to a type.
- The decode function is used to display ‘Unknown document type’ when NULL is returned and no extension matches between the LOB and the temporary table. This is the case for text files which have no file signature.
- The DBMS_LOB.SUBSTR function is used to retrieve only the first four bytes from the LOB value. Parameters are the source LOB value, the amount of bytes to read (4), and the starting position (1).
- An outer join is used to count LOBs not matching the temporary table. This will correspond to unidentified files types.
To finish, I drop the temporary table before exiting the job:
-- The temporary table is dropped before exit DROP TABLE temp_doc_types; exit;
And just for the eyes, this is the result in a HTML fashion, generated directly from sqlplus using the MARKUP option and an css style sheet. I let you read the Oracle documentation in order to know how to use the MARKUP option 😉