Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, June 08, 2017

Installing/Configuring and Working on Apache Sqoop



Introduction


Apache Sqoop is a hadoop ecosystem's tool (hadoop client) designed to Efficiently transfers bulk data between Apache Hadoop and structured datastores like Oracle. It helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. It can also be used to extract data from Hadoop and export it into external structured datastores.


It provides two main tools (import & export) to facilitate data transfers along with other different utility tools.



The Import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in the text files or as binary data in Avro and Sequence files.


You can export data back from the HDFS to the RDBMS database. The target table must exist in the target database. The default operation is to insert all the record from the input files to the database table using the INSERT statement. In update mode, Sqoop generates the UPDATE statement that replaces the existing record into the database.



Installing/Configuring Sqoop 1.4.6

First download latest stable sqoop release, you can find the download location as below.

http://www-eu.apache.org/dist/sqoop/1.4.6/
unzip file in your desired location and set the ownership and permissions as per your environment.

[root@en01 ~]# chown -R hdpclient:hadoop_edge /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23

Configure the sqoop related variables


vi ~/.bash_profile


### SQOOP Variables
export SQOOP_HOME=/usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23
export PATH=$PATH:$SQOOP_HOME/bin


[hdpclient@en01 bin]$ cd $SQOOP_HOME/conf
[hdpclient@en01 conf]$ mv sqoop-env-template.sh sqoop-env.sh

vi sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/hadoopsw/hadoop-2.7.3
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/hadoopsw/hadoop-2.7.3
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
#export HIVE_HOME=
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=

Verify sqoop

[hdpclient@en01 ~]$ sqoop-version
.....
17/06/07 11:21:01 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Sqoop 1.4.6
git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:20:17 CST 2015

Download Oracle Driver and and place in lib
Download JDBC drivers as per your database and place them in $SQOOP_HOME/lib. You can download drivers from below locations. I'm using Oracle for the purpose of this post.

Oracle JDBC Driver
http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html.

Import data from Oracle database to Hadoop HDFS

Create location in HDFS to store imported data

[hdpsysuser@nn01 ~]$ hdfs dfs -mkdir /oradata
[hdpsysuser@nn01 ~]$ hdfs dfs -chown hdpclient:hadoop_edge /oradata


Create a table in Oracle schema (hdfsuser in my case)  to test the sqoop import.
SQL> create table test(id number, name varchar2(10));
SQL> insert into test(&id,&name);
SQL> select * from test;
        ID NAME

---------- ----------
         1 Ali
         2 Abuzar
         3 Zeeshan
         4 Hafiz
         5 Ikram
         6 Usman

6 rows selected.


Run the sqoop import now sqoop import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST --username hdfsuser --password hdfsuser --target-dir /oradata/test --columns "id,name" -m 1

If your table has no primary key defined then you have to give -m 1 option for importing the data or you have to provide --split-by argument with some column name, otherwise it gives the error. We can import a subset of a table using the where clause too. 



[hdpclient@en01 conf]$ sqoop import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST --username hdfsuser --password hdfsuser --target-dir /oradata/test --columns "id,name" -m 1


Warning: /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23/../hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/hadoopsw/sqoop-1.4.6.bin__hadoop-0.23/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
17/06/07 12:03:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
17/06/07 12:03:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/06/07 12:03:50 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/06/07 12:03:50 INFO manager.SqlManager: Using default fetchSize of 1000
17/06/07 12:03:50 INFO tool.CodeGenTool: Beginning code generation
17/06/07 12:04:16 INFO manager.OracleManager: Time zone has been set to GMT
17/06/07 12:04:16 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM TEST t WHERE 1=0
17/06/07 12:04:16 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hadoopsw/hadoop-2.7.3
Note: /tmp/sqoop-hdpclient/compile/e2e81658abee225ed66486494c9a67b9/TEST.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/06/07 12:04:17 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdpclient/compile/e2e81658abee225ed66486494c9a67b9/TEST.jar
17/06/07 12:04:17 INFO manager.OracleManager: Time zone has been set to GMT
17/06/07 12:04:17 INFO mapreduce.ImportJobBase: Beginning import of TEST
17/06/07 12:04:18 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/06/07 12:04:18 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/06/07 12:04:18 INFO client.RMProxy: Connecting to ResourceManager at nn01/192.168.44.133:8032
17/06/07 12:04:22 INFO db.DBInputFormat: Using read commited transaction isolation
17/06/07 12:04:22 INFO mapreduce.JobSubmitter: number of splits:1
17/06/07 12:04:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494154575009_0312
17/06/07 12:04:22 INFO impl.YarnClientImpl: Submitted application application_1494154575009_0312
17/06/07 12:04:22 INFO mapreduce.Job: The url to track the job: http://nn01:8088/proxy/application_1494154575009_0312/
17/06/07 12:04:22 INFO mapreduce.Job: Running job: job_1494154575009_0312
17/06/07 12:04:29 INFO mapreduce.Job: Job job_1494154575009_0312 running in uber mode : false
17/06/07 12:04:29 INFO mapreduce.Job:  map 0% reduce 0%
17/06/07 12:04:35 INFO mapreduce.Job:  map 100% reduce 0%
17/06/07 12:04:36 INFO mapreduce.Job: Job job_1494154575009_0312 completed successfully
17/06/07 12:04:36 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=138171
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=87
                HDFS: Number of bytes written=49
                HDFS: Number of read operations=4
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3686
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=3686
                Total vcore-milliseconds taken by all map tasks=3686
                Total megabyte-milliseconds taken by all map tasks=3774464
        Map-Reduce Framework
                Map input records=6
                Map output records=6
                Input split bytes=87
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=60
                CPU time spent (ms)=1170
                Physical memory (bytes) snapshot=179765248
                Virtual memory (bytes) snapshot=2166857728
                Total committed heap usage (bytes)=146800640
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=49
17/06/07 12:04:36 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 17.5645 seconds (2.7897 bytes/sec)
17/06/07 12:04:36 INFO mapreduce.ImportJobBase: Retrieved 6 records.


Verify  imported data in HDFS 

[hdpsysuser@nn01 ~]$ hdfs dfs -ls /oradata/test
Found 2 items
-rw-r--r-- 3 hdpclient hadoop_edge 0 2017-06-07 12:03 /oradata/test/_SUCCESS
-rw-r--r-- 3 hdpclient hadoop_edge 49 2017-06-07 12:03 /oradata/test/part-m-00000

[hdpsysuser@nn01 ~]$ hadoop fs -cat /oradata/test/part-m-00000
1,Ali
2,Abuzar
3,Zeeshan
4,Hafiz
5,Ikram
6,Usman

Incremental Import

Through incremental import you import only the newly added rows in a table. It is required to add ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import.

Insert new data in Oracle to test incremental import
SQL> insert into test values(7,'Hassan');

Run sqoop incremental import

sqoop import --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST --username hdfsuser --password hdfsuser --target-dir /oradata/test --columns "id,name" -m 1 --incremental append --check-column id -last-value 6

....
....
17/06/07 12:39:49 INFO mapreduce.ImportJobBase: Transferred 9 bytes in 28.3266 seconds (0.3177 bytes/sec)
17/06/07 12:39:49 INFO mapreduce.ImportJobBase: Retrieved 1 records.
17/06/07 12:39:49 INFO util.AppendUtils: Appending to directory test
17/06/07 12:39:49 INFO util.AppendUtils: Using found partition 1
17/06/07 12:39:49 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
17/06/07 12:39:49 INFO tool.ImportTool:  --incremental append
17/06/07 12:39:49 INFO tool.ImportTool:   --check-column id
17/06/07 12:39:49 INFO tool.ImportTool:   --last-value 7
17/06/07 12:39:49 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')


Verify incremental import


[hdpsysuser@nn01 ~]$ hdfs dfs -ls /oradata/test

Found 3 items
-rw-r--r--   3 hdpclient hadoop_edge          0 2017-06-07 12:03 /oradata/test/_SUCCESS
-rw-r--r--   3 hdpclient hadoop_edge         49 2017-06-07 12:03 /oradata/test/part-m-00000
-rw-r--r--   3 hdpclient supergroup           9 2017-06-07 12:38 /oradata/test/part-m-00001

[hdpsysuser@nn01 ~]$ hadoop fs -cat /oradata/test/part-m-*
1,Ali
2,Abuzar
3,Zeeshan
4,Hafiz
5,Ikram
6,Usman
7,Hassan



[hdpsysuser@nn01 ~]$ hadoop fs -cat /oradata/test/part-m-*1
7,Hassan


Import all tables
In case of importing all the table, each table data is stored in a separate directory and the directory name is same as the table name. It is also mandatory that every table in that database must have a primary key field.

sqoop import-all-tables  --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))"  --username hdfsuser --password hdfsuser  --target-dir /oradata


Import data from RDBMS into hive 

Create a second table in Oracle schem to test the import into HIVE table from Oracle.

SQL> create table test2(id int,name varchar2(10));
SQL> insert into test2 select * from test;
SQL> commit;

Create table in hive to have data from sqoop import

hive (scott)> create table ora_imp_hive (id int,name string);

Now run Sqoop import, you need to mention hive-import and hive-table options in sqoop command.

sqoop import  --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST2 --username hdfsuser --password hdfsuser  --target-dir /oradata/testhive --hive-import --hive-table ora_imp_hive  -m 1

If you have not already created table in Hive , you can use create-hive-table option in sqoop command.

sqoop import  --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST2 --username hdfsuser --password hdfsuser --target-dir /oradata/test3 --hive-import --hive-table ora_imp_hive3 --create-hive-table --m 1

Export data back from the HDFS to the RDBMS database.

We have already file in HDFS as in below location and want to export in Oracle Database.[hdpsysuser@nn01 ~]$ hadoop fs -ls /oradata/test/
Found 3 items
-rw-r--r--   3 hdpclient hadoop_edge          0 2017-06-07 12:03 /oradata/test/_SUCCESS
-rw-r--r--   3 hdpclient hadoop_edge         49 2017-06-07 12:03 /oradata/test/part-m-00000
-rw-r--r--   3 hdpclient supergroup           9 2017-06-07 12:38 /oradata/test/part-m-00001

[hdpsysuser@nn01 ~]$ hadoop fs -cat /oradata/test/part-m-*1
7,Hassan

You need to create table in Oracle Schema first and then you can export to this table.
SQL> create table exp_from_hdfs(id number,name varchar2(10));
Run sqoop export command

sqoop export --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --username hdfsuser --password hdfsuser --table EXP_FROM_HDFS --export-dir /oradata/test

......
......
Map-Reduce Framework
                Map input records=7
                Map output records=7
                Input split bytes=606
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=367
                CPU time spent (ms)=8960
                Physical memory (bytes) snapshot=772395008
                Virtual memory (bytes) snapshot=8645033984
                Total committed heap usage (bytes)=588775424
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
17/06/07 16:12:57 INFO mapreduce.ExportJobBase: Transferred 746 bytes in 240.0369 seconds (3.1079 bytes/sec)
17/06/07 16:12:57 INFO mapreduce.ExportJobBase: Exported 7 records.

Verify from SQLPLUS

SQL>  select * from exp_from_hdfs;

        ID NAME
---------- ----------
         7 Hassan
         1 Ali
         2 Abuzar
         5 Ikram
         6 Usman
         3 Zeeshan
         4 Hafiz

7 rows selected.


Other Sqoop Tools

Sqoop - Job
Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall the saved job. This re-calling or re-executing is used in the incremental import, which can import the updated rows from RDBMS table to HDFS.

sqoop job --create import_from_oracle -- import  --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST --username hdfsuser --password hdfsuser  --target-dir /oradata/test --columns "id,name"  -m 1 --incremental append --check-column id -last-value 7


List all jobs in the meta store
[hdpclient@en01 ~]$ sqoop job --list

.....

17/06/07 16:45:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Available jobs:
  import_from_oracle


Shows metadata information about your job
[hdpclient@en01 ~]$ sqoop job --show import_from_oracle

......
Enter password:
Job: import_from_oracle
Tool: import
Options:
----------------------------
verbose = false
incremental.last.value = 7
db.connect.string = jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.required = false
codegen.input.delimiters.field = 0
hbase.create.table = false
db.require.password = true
hdfs.append.dir = true
db.table = TEST
codegen.input.delimiters.escape = 0
import.fetch.size = null
accumulo.create.table = false
codegen.input.delimiters.enclose.required = false
db.username = hdfsuser
reset.onemapper = false
codegen.output.delimiters.record = 10
import.max.inline.lob.size = 16777216
hbase.bulk.load.enabled = false
hcatalog.create.table = false
db.clear.staging.table = false
incremental.col = id
codegen.input.delimiters.record = 0
enable.compression = false
hive.overwrite.table = false
hive.import = false
codegen.input.delimiters.enclose = 0
accumulo.batch.size = 10240000
hive.drop.delims = false
codegen.output.delimiters.enclose = 0
hdfs.delete-target.dir = false
codegen.output.dir = .
codegen.auto.compile.dir = true
db.column.list = id,name
relaxed.isolation = false
mapreduce.num.mappers = 1
accumulo.max.latency = 5000
import.direct.split.size = 0
codegen.output.delimiters.field = 44
export.new.update = UpdateOnly
incremental.mode = AppendRows
hdfs.file.format = TextFile
codegen.compile.dir = /tmp/sqoop-hdpclient/compile/bd7d0040bdbedafdeb502e0f61bcf3e8
direct.import = false
hdfs.target.dir = /oradata/test
hive.fail.table.exists = false

db.batch = false


Execute Job
--exec option is used to execute a saved job.
[hdpclient@en01 ~]$ sqoop job --exec import_from_oracle



Sqoop - Codegen tool

codegen generates the DAO class automatically which contains getter’ and ‘setter’ methods for database table. The main usage of this tool is to check if Java lost the Java code. If so, it will create a new version of Java with the default delimiter between fields.


[hdpclient@en01 lib]$ sqoop codegen -connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --table TEST --username hdfsuser --password hdfsuser
......
......
17/06/08 13:05:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM TEST t WHERE 1=0
17/06/08 13:05:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hadoopsw/hadoop-2.7.3
Note: /tmp/sqoop-hdpclient/compile/aa603ae3cb66f71003ac49b95b74136c/TEST.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/06/08 13:05:53 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdpclient/compile/aa603ae3cb66f71003ac49b95b74136c/TEST.jar

Verify generated class

[hdpclient@en01 lib]$ ll /tmp/sqoop-hdpclient/compile/aa603ae3cb66f71003ac49b95b74136c
total 16
-rw-rw-r--. 1 hdpclient hdpclient 8684 Jun  8 13:05 TEST.class
-rw-rw-r--. 1 hdpclient hdpclient 3626 Jun  8 13:05 TEST.jar


Sqoop - Eval tool

Eval tool allows users to execute user-defined queries against respective database servers and preview the result in the console. We can evaluate any type of SQL query that can be either DDL or DML statement.


[hdpclient@en01 lib]$ sqoop eval -connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --username hdfsuser --password hdfsuser --query "SELECT * FROM test WHERE rownum <= 3"
.....

17/06/08 13:18:19 INFO manager.OracleManager: Time zone has been set to GMT
-------------------------------------
| ID                   | NAME       |
-------------------------------------
| 1                    | Ali        |
| 2                    | Abuzar     |
| 3                    | Zeeshan    |


-------------------------------------

[hdpclient@en01 lib]$ sqoop eval -connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --username hdfsuser --password hdfsuser -e "INSERT INTO test VALUES(11,'Mohsin')"

....
....
17/06/08 13:21:32 INFO tool.EvalSqlTool: 1 row(s) updated.


SQL> select * from test;

        ID NAME
---------- ----------
         1 Ali
         2 Abuzar
         3 Zeeshan
         4 Hafiz
         5 Ikram
         6 Usman
         7 Hassan
        11 Mohsin

8 rows selected.


Sqoop - List Database tool
Sqoop list-databases tool lists out the present databases on the server. In case of Oracle it will show the schemas.

[hdpclient@en01 lib]$ sqoop list-databases -connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --username system --password manager


SYS
SYSTEM
....
....
HDFSUSER




Sqoop - List Tables tool
List Tables tool lists out the tables of a particular database/schema.
[hdpclient@en01 lib]$ sqoop list-tables --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=en01)(port=1521))(connect_data=(service_name=hadob)))" --username hdfsuser --password hdfsuser

...
...
CATALOG_EXT
TEST
TEST2
EXP_FROM_HDFS


Connecting with SQLServer

Put the SQL Server driver in $SQOOP_HOME/lib and then you can run the same sqoop command as above for SQL Server, below is one example.


sqoop eval --connect "jdbc:sqlserver://sqlserver01;database=DBA" --username hdfsuser --password hdfsuser --query "SELECT * FROM [DBA].[dbo].[test]"


No comments: