Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, January 04, 2018

Using Preprocessor with External Table [Over HDFS]

Oracle 11g Release 2 introduced the PREPROCESSOR clause to identify a directory object and script used to process the files before they are read by the external table. This feature was backported to 11gR1 (11.1.0.7). The PREPROCESSOR clause is especially useful for reading compressed files, since they are unzipped and piped straight into the external table process without ever having to be unzipped on the file system.


In our environment, text files are located on Hadoop HDFS,  need to be loaded into the database, are compressed or not precisely in the format expected by the external tables. In such cases, the text files must be transformed into the appropriate format—for example, uncompressed—in a separate process before they can be read by external tables. Thanks to Oracle Database, rather than uncompress the file in a separate process, you can use the preprocessor feature of Oracle Database 11g Release 2 with external tables to uncompress the file inline.

Before starting the test, we need data for this post first.


Prepare Data
****************

Create comma delimited text file. You can choose any tool like toad or SQL Plus to create the data. 

set colsep ',';
SET HEADING OFF;
SET FEEDBACK OFF;
SET LINESIZE 1000;
SPOOL /data/ora_ext_tab_dir/xt_uo_tx_preprocessor.txt;
select * from myobj ;
SPOOL OFF;

 If you want better data generation, you could create a procedure like below.

create or replace procedure dump_table_to_csv( p_tname in varchar2,  p_dir in varchar2,  p_filename in varchar2,p_separator varchar2,p_where_cond varchar2 default '1=1' ) 
 is 
   l_output utl_file.file_type; 
   l_theCursor integer default dbms_sql.open_cursor; 
   l_columnValue varchar2(4000); 
   l_status integer; 
   l_query varchar2(1000) default 'select * from ' || p_tname|| ' where '||p_where_cond; 
   l_colCnt number := 0; 
   l_separator varchar2(10); 
   l_descTbl dbms_sql.desc_tab; 
   l_fileGenerated varchar2(255);
begin 
dbms_output.put_line('QueryGenerated   :  '||l_query);
    
   l_output := utl_file.fopen( p_dir, p_filename, 'w' ); 
   execute immediate 'alter session set nls_date_format=''dd-mon-yyyy hh24:mi:ss'' '; 
   dbms_sql.parse( l_theCursor, l_query, dbms_sql.native ); 
   dbms_sql.describe_columns( l_theCursor, l_colCnt, l_descTbl ); 
   for i in 1 .. l_colCnt loop 
      --utl_file.put( l_output, l_separator || '"' || l_descTbl(i).col_name || '"' ); 
      dbms_sql.define_column( l_theCursor, i, l_columnValue, 4000 ); 
      l_separator := p_separator; 
  end loop; 
  
  --utl_file.new_line( l_output ); 
  l_status := dbms_sql.execute(l_theCursor); 
  while ( dbms_sql.fetch_rows(l_theCursor) > 0 ) loop 
    l_separator := ''; 
    for i in 1 .. l_colCnt loop 
        dbms_sql.column_value( l_theCursor, i, l_columnValue ); 
        utl_file.put( l_output, l_separator || l_columnValue ); 
        l_separator := p_separator; 
    end loop; 
    utl_file.new_line( l_output ); 
 end loop; 
 dbms_sql.close_cursor(l_theCursor); 
 utl_file.fclose( l_output ); 
 select directory_path||'/'||p_filename into l_fileGenerated from dba_directories where directory_name =p_dir;
 dbms_output.put_line('File generated at: '||l_fileGenerated );
 execute immediate 'alter session set nls_date_format=''dd-MON-yy'' '; 
 exception 
    when others then 
          execute immediate 'alter session set nls_date_format=''dd-MON-yy'' '; 
          raise; 
 end; 
 /

After creating the procedure call it to generate your data.

SQL> EXEC dump_table_to_csv( p_tname =>'SYS.MYOBJ',  p_dir => 'LOCAL_OS_DIR',  p_filename => 'xt_uo_tx_preprocessor.txt',p_separator =>',' )

PL/SQL procedure successfully completed.

Elapsed: 00:52:04.54

MYOBJ is based on user_objects but with 4GB size. 

--Another example
EXEC dump_table_to_csv( p_tname =>'SCOTT.EMPLOYEE',  p_dir => 'LOCAL_OS_DIR',  p_filename => 'employee.txt',p_separator =>'<FT>' ,p_where_cond => 'job=''CLERK''' ) 


Create necessary Directory Objects
********************************************

You may need to create the relevant directory objects to be used in your external table definition also.


SQL> CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'
  2  /

Directory created.

SQL> CREATE OR REPLACE DIRECTORY nfs_dir AS '/data/hdfsloc/data/oraclenfs/';

Directory created.

SQL> CREATE OR REPLACE DIRECTORY local_os_dir AS '/data/ora_ext_tab_dir';

Directory created.

SQL> CREATE OR REPLACE DIRECTORY exec_dir AS '/data/ora_ext_tab_dir';

Directory created.

exec_dir is required to hold the executable to be used as preprocessor option while creating external table. You need to copy the related executable also in this directory if required. In my example I'm using zcat in preprocessor option of external table. zcat exists in /bin (RHEL) , I copied it to EXEC_DIR. Remember Oracle Directory object should not contain symbolic links.



[hdpclient@te1-hdp-rp-en01 bin]$ cp zcat /data/ora_ext_tab_dir


Zip the data generated
****************************
After the data creation zip this delimited file. Compare the file size. For my test, delimited text file size is about 4GB and its compressed version is just 352MB.


[root@te1-hdp-rp-en01 ora_ext_tab_dir]# gzip -c xt_uo_tx_preprocessor.txt > xt_uo_tx_preprocessor.txt.gz


[root@te1-hdp-rp-en01 ora_ext_tab_dir]# ls -l xt_uo_tx_pre*
-rw-r--r-- 1 oracle oinstall 4324215296 Jan  2 16:53 xt_uo_tx_preprocessor.txt
-rw-r--r-- 1 root   root      368855592 Jan  2 16:59 xt_uo_tx_preprocessor.txt.gz


Create External Table WO Preprocessor and test the performance
********************************************************************************
Now create external table on local storage (LOCAL_OS_DIR) using text file (size 4GB) and check performance

drop table xt_uo_tx_preprocessor;

create table xt_uo_tx_preprocessor
(
OBJECT_NAME                                        VARCHAR2(128)
,SUBOBJECT_NAME                                     VARCHAR2(128)
,OBJECT_ID                                          NUMBER
,DATA_OBJECT_ID                                     NUMBER
,OBJECT_TYPE                                        VARCHAR2(23)
,CREATED                                            DATE
,LAST_DDL_TIME                                      DATE
,TIMESTAMP1                                          VARCHAR2(19)
,STATUS                                             VARCHAR2(7)
,TEMPORARY                                          VARCHAR2(1)
,GENERATED                                          VARCHAR2(1)
,SECONDARY                                          VARCHAR2(1)
,NAMESPACE                                          NUMBER
,EDITION_NAME                                       VARCHAR2(128)
,SHARING                                            VARCHAR2(18)
,EDITIONABLE                                        VARCHAR2(1)
,ORACLE_MAINTAINED                                  VARCHAR2(1)
,APPLICATION                                        VARCHAR2(1)
,DEFAULT_COLLATION                                  VARCHAR2(100)
,DUPLICATED                                         VARCHAR2(1)
,SHARDED                                            VARCHAR2(1)
,CREATED_APPID                                      NUMBER
,CREATED_VSNID                                      NUMBER
,MODIFIED_APPID                                     NUMBER
,MODIFIED_VSNID                                     NUMBER                     
)
organization external
(
type oracle_loader
default directory LOCAL_OS_DIR
access parameters
    (
       RECORDS DELIMITED BY NEWLINE 
       FIELDS TERMINATED BY ',' 
   OPTIONALLY ENCLOSED BY '"' 
   MISSING FIELD VALUES ARE NULL
   ( OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS"  ,LAST_DDL_TIME CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS"  ,TIMESTAMP1,STATUS,TEMPORARY,GENERATED,SECONDARY,NAMESPACE,EDITION_NAME,SHARING,EDITIONABLE,ORACLE_MAINTAINED,APPLICATION,DEFAULT_COLLATION,DUPLICATED,SHARDED,CREATED_APPID,CREATED_VSNID,MODIFIED_APPID,MODIFIED_VSNID      
   )
    )
location ('xt_uo_tx_preprocessor.txt')
)
REJECT LIMIT 100;

--Check table performance with and without parallelism (LOCAL_OS_DIR)

SQL> select count(*) from xt_uo_tx_preprocessor;

  COUNT(*)
----------
  26524160

Elapsed: 00:02:21.48

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*)  FROM xt_uo_tx_preprocessor;

  COUNT(*)
----------
  26524160

Elapsed: 00:00:35.76


Create External Table with Preprocessor and test the performance
*********************************************************************************

drop and creat table again and this time with preprocessor option and update the location parameter also with .gz file created already existing on LOCAL_OS_DIR

drop table xt_uo_tx_preprocessor;

create table xt_uo_tx_preprocessor
(
OBJECT_NAME                                        VARCHAR2(128)
,SUBOBJECT_NAME                                     VARCHAR2(128)
,OBJECT_ID                                          NUMBER
,DATA_OBJECT_ID                                     NUMBER
,OBJECT_TYPE                                        VARCHAR2(23)
,CREATED                                            DATE
,LAST_DDL_TIME                                      DATE
,TIMESTAMP1                                          VARCHAR2(19)
,STATUS                                             VARCHAR2(7)
,TEMPORARY                                          VARCHAR2(1)
,GENERATED                                          VARCHAR2(1)
,SECONDARY                                          VARCHAR2(1)
,NAMESPACE                                          NUMBER
,EDITION_NAME                                       VARCHAR2(128)
,SHARING                                            VARCHAR2(18)
,EDITIONABLE                                        VARCHAR2(1)
,ORACLE_MAINTAINED                                  VARCHAR2(1)
,APPLICATION                                        VARCHAR2(1)
,DEFAULT_COLLATION                                  VARCHAR2(100)
,DUPLICATED                                         VARCHAR2(1)
,SHARDED                                            VARCHAR2(1)
,CREATED_APPID                                      NUMBER
,CREATED_VSNID                                      NUMBER
,MODIFIED_APPID                                     NUMBER
,MODIFIED_VSNID                                     NUMBER                     
)
organization external
(
type oracle_loader
default directory LOCAL_OS_DIR
access parameters
    (
       RECORDS DELIMITED BY NEWLINE 
       preprocessor exec_dir:'zcat'
       FIELDS TERMINATED BY ',' 
   OPTIONALLY ENCLOSED BY '"' 
   MISSING FIELD VALUES ARE NULL
   ( OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS"  ,LAST_DDL_TIME CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS"  ,TIMESTAMP1,STATUS,TEMPORARY,GENERATED,SECONDARY,NAMESPACE,EDITION_NAME,SHARING,EDITIONABLE,ORACLE_MAINTAINED,APPLICATION,DEFAULT_COLLATION,DUPLICATED,SHARDED,CREATED_APPID,CREATED_VSNID,MODIFIED_APPID,MODIFIED_VSNID       
   )
    )
location ('xt_uo_tx_preprocessor.txt.gz')
)
REJECT LIMIT 100;

Now run the count query again (zip file at this time in location) and observe the time taken.

SQL> select count(*) from xt_uo_tx_preprocessor;

  COUNT(*)
----------
  26524160

Elapsed: 00:02:29.85

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*)  FROM xt_uo_tx_preprocessor;

  COUNT(*)
----------
  26524160

Elapsed: 00:02:28.40

Testing with Preprocessor with Hadoop HDFS
********************************************************

Now you have the test results with you while your text and zip data is available on LOCAL_OS_DIR. Simply copy the zip and text file on HDFS (NFS_DIR) and change the location parameter of the external table and run the queries again to see the time taken.

--copy the zip file on HDFS

SQL> host time cp /data/ora_ext_tab_dir/xt_uo_tx_preprocessor.txt.gz /data/hdfsloc/data/oraclenfs

real    0m30.635s
user    0m0.000s
sys     0m0.375s


--alter the location of the table and run the count query again

SQL> alter table xt_uo_tx_preprocessor location (NFS_DIR:'xt_uo_tx_preprocessor.txt.gz')
  2  /

Table altered.


--run count query again, result is almost same as on LOCAL_OS_DIR

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*)  FROM xt_uo_tx_preprocessor;

  COUNT(*)
----------
  26524160

Elapsed: 00:02:29.60


Another Use Case - Directory Listing

Compression (zip) is not the only use case you can use with external table. You can use this preprocessing technique to show the output of a program as an external table like cat. I'm providing below example to create an external table for showing a directory listing.


1- Create a bash scripts which we want to call from external table by preprocessor option.

[oracle@te1-hdp-rp-en01 ~]$ vi /data/ora_ext_tab_dir/myprog.sh
[oracle@te1-hdp-rp-en01 ora_ext_tab_dir]$ cat /data/ora_ext_tab_dir/myprog.sh

#!/bin/bash

/bin/ls ls /data/hdfsloc/flume/twitter/ -a -l -X


2- Make the script executable.


[oracle@te1-hdp-rp-en01 ~]$ chmod u+x /data/ora_ext_tab_dir/myprog.sh

3- Create external table.

drop table extlstfile;

create table extlstfile
(
  file_priv     VARCHAR2(11),
           file_links    NUMBER,
           file_owner    VARCHAR2(25),
           file_owner_grp VARCHAR2(25),
           file_size       NUMBER,
           file_month    VARCHAR2(3),
           file_day      NUMBER,
           file_time       VARCHAR2(6), 
           file_name       VARCHAR2(25)
)
organization external
(
  type oracle_loader
  default directory LOCAL_OS_DIR
  access parameters
  (
    records delimited by newline
    preprocessor LOCAL_OS_DIR:'myprog.sh'
    skip 1
fields terminated by whitespace ldrtrim 
  )
location (LOCAL_OS_DIR:'myprog.sh') 
)
reject limit unlimited
/

LDRTRIM is used to provide compatibility with SQL*Loader trim features. It is the same as NOTRIM except in the following cases:


  • If the field is not a delimited field, then spaces will be trimmed from the right.
  • If the field is a delimited field with OPTIONALLY ENCLOSED BY specified, and the optional enclosures are missing for a particular instance, then spaces will be trimmed from the left.


4- Query external table

SQL> set lines 200
SQL> select * from extlstfile;

FILE_PRIV   FILE_LINKS FILE_OWNER                FILE_OWNER_GRP             FILE_SIZE FIL   FILE_DAY FILE_T FILE_NAME
----------- ---------- ------------------------- ------------------------- ---------- --- ---------- ------ -------------------------
drwxrwxrwx         481 hdfs                      hdfs                           15392 Jan          7 2018   .
drwxrwxrwx           6 hdfs                      hdfs                             192 Aug          9 12:50  ..
-rw-r--r--           1 hdpclient                 hdfs                           15294 Jan          7 07:41  FlumeData.1515300019774
-rw-r--r--           1 hdpclient                 hdfs                           24359 Jan          7 07:41  FlumeData.1515300049921
-rw-r--r--           1 hdpclient                 hdfs                            5659 Jan          7 07:42  FlumeData.1515300080442
-rw-r--r--           1 hdpclient                 hdfs                           26006 Jan          7 07:42  FlumeData.1515300113515
-rw-r--r--           1 hdpclient                 hdfs                           17226 Jan          7 07:43  FlumeData.1515300149653
-rw-r--r--           1 hdpclient                 hdfs                           30049 Jan          7 07:43  FlumeData.1515300183800
-rw-r--r--           1 hdpclient                 hdfs                           21977 Jan          7 07:44  FlumeData.1515300228408
-rw-r--r--           1 hdpclient                 hdfs                           13785 Jan          7 07:45  FlumeData.1515300263047
-rw-r--r--           1 hdpclient                 hdfs                            9785 Jan          7 07:45  FlumeData.1515300305630


SQL> select count(*) from extlstfile;

  COUNT(*)
----------

       469

1 comment:

veera cynixit said...

Very nice Blog with Awesome Content.

Keep updating More Posts.

hadoop admin online course
hadoop admin online training