Integrating Hadoop and Elasticsearch

Introduction

Hadoop shines as a batch processing system, but serving real-time results can be challenging. For truly interactive data discovery, ES-Hadoop (The Elasticsearch-Hadoop connector) lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations.

With ES-Hadoop, you can easily build dynamic, embedded search applications to serve your Hadoop data or perform deep, low-latency analytics using full-text, geospatial queries and aggregations.

At the core, elasticsearch-hadoop integrates two distributed systems: Hadoop, a distributed computing platform and Elasticsearch, a real-time search and analytics engine. From a high-level view both provide a computational component: Hadoop through Map/Reduce or recent libraries like Apache Spark on one hand, and Elasticsearch through its search and aggregation on the other. elasticsearch-hadoop goal is to connect these two entities so that they can transparently benefit from each other.

Pre-Requisite

Working with Elasticsearch

Installation

elasticsearch-hadoop binaries can be obtained by downloading them from the elastic.co site. Download and unzip Elasticsearch for Apache Hadoop (ES-Hadoop) distribution elasticsearch-hadoop binary is suitable for Hadoop 2.x. elasticsearch-hadoop provides minimalistic jars for each integration.

Configuration

There are different configuration options available,for the purpose of this post I'll be using Hive integration which is required in my environment.

After extracting connector, place the jar in Hive and Hadoop distribution folders.

Below in HIVE_HOME/lib

elasticsearch-hadoop-hive-6.2.3-javadoc.jar

elasticsearch-hadoop-hive-6.2.3-sources.jar

Below in HADOOP_HOME/share/hadoop/hdfs/lib

elasticsearch-hadoop-6.2.3.jar

Now You need to create the external table in Hive backed by Elasticsearch

CREATE EXTERNAL TABLE customer (name string)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'customer/_doc',

'es.index.auto.create' = 'false')

hive>

> CREATE EXTERNAL TABLE customer (name string)

> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

> TBLPROPERTIES('es.resource' = 'customer/_doc',

> 'es.index.auto.create' = 'false');

Time taken: 7.027 seconds

hive> select * from customer;

Inaam Bukhary

Time taken: 7.274 seconds, Fetched: 1 row(s)

elasticsearch-hadoop behavior can be customized through the properties while creating external table.

es.resource: Elasticsearch resource location, where data is read and written to. Requires the format <index>/<type>

You can find the list of all properties in below link

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html

Reading data from elasticsearch
Below Hive table based on bank index in Elasticsearch.

--Elasticsearch Index sample record.

{

"_index" : "bank",

"_type" : "_doc",

"_id" : "1",

"_version" : 1,

"found" : true,

"_source" : {

"account_number" : 1,

"balance" : 39225,

"firstname" : "Amber",

"lastname" : "Duke",

"age" : 32,

"gender" : "M",

"address" : "880 Holmes Lane",

"employer" : "Pyrami",

"email" : "amberduke@pyrami.com",

"city" : "Brogan",

"state" : "IL"

}

CREATE EXTERNAL TABLE bank3 (account_number string,balance string,firstname string, lastname string, age string,gender string,address string,employer string,email string,city string,state string)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'bank/_doc',

'es.index.auto.create' = 'false')

hive> describe bank;

account_number int from deserializer

balance int from deserializer

firstname string from deserializer

lastname string from deserializer

age int from deserializer

gender string from deserializer

address string from deserializer

employer string from deserializer

email string from deserializer

Time taken: 0.746 seconds, Fetched: 9 row(s)

WRITING TO Elasticsearch

drop table artists;

CREATE EXTERNAL TABLE artists (

id BIGINT,

name STRING

)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES('es.resource' = 'artists/_doc','es.index.auto.create' = 'true');

-- insert data to Elasticsearch from another table which is source

INSERT OVERWRITE TABLE artists SELECT distinct s.id, s.name FROM employee s;

hive> INSERT OVERWRITE TABLE artists SELECT distinct id, s.name FROM employee s;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = hdpsysuser_20180409191844_7b9806b6-833b-483d-8d34-9f7407a18a1f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1523248038304_0014, Tracking URL = http://localhost:8088/proxy/application_1523248038304_0014/

Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job -kill job_1523248038304_0014

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2018-04-09 19:19:42,879 Stage-1 map = 0%, reduce = 0%

2018-04-09 19:20:05,022 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.79 sec

2018-04-09 19:20:39,869 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 22.94 sec

2018-04-09 19:20:46,854 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 25.76 sec

2018-04-09 19:20:49,278 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.47 sec

MapReduce Total cumulative CPU time: 26 seconds 470 msec

Ended Job = job_1523248038304_0014

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 26.47 sec HDFS Read: 11197 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 26 seconds 470 msec

Time taken: 138.918 seconds

Query data by Elasticsearch

--List of indices

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

yellow open logstash-2015.05.18 BnCViLEGT6i21FYJnlCvvw 5 1 4631 0 22.5mb 22.5mb

yellow open shakespeare 44dwGPXmTJ6eOOdRuGQf9w 5 1 111395 0 22.9mb 22.9mb

yellow open logstash-2015.05.20 5bQkSCxGT2Oka-hNjl1c-A 5 1 4750 0 23.6mb 23.6mb

yellow open bank ymIPSIiDS9i_iiQeM6mD1w 5 1 1000 0 499kb 499kb

yellow open logstash-2015.05.19 gvUQBcvVSweGcHob0t7FXw 5 1 4624 0 24.2mb 24.2mb

yellow open customer f-9cBDSlQ-aNmAWhHhj_rA 5 1 1 0 4.5kb 4.5kb

yellow open artists E_OTOKO3TTGLWg8D6jRltg 5 1 14 0 19.3kb 19.3kb

Query index information

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/?pretty&pretty'

{

"artists" : {

"aliases" : { },

"mappings" : {

"_doc" : {

"properties" : {

"id" : {

"type" : "long"

"name" : {

"type" : "text",

"fields" : {

"keyword" : {

"type" : "keyword",

"ignore_above" : 256

}

"settings" : {

"index" : {

"creation_date" : "1523301641741",

"number_of_shards" : "5",

"number_of_replicas" : "1",

"uuid" : "E_OTOKO3TTGLWg8D6jRltg",

"version" : {

"created" : "6020399"

"provided_name" : "artists"

}

Query Index doc

curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'

{

"query": { "match_all": {} }

}

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'

> {

> "query": { "match_all": {} }

> }

> '

{

"took" : 117,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"skipped" : 0,

"failed" : 0

"hits" : {

"total" : 14,

"max_score" : 1.0,

"hits" : [

{

"_index" : "artists",

"_type" : "_doc",

"_id" : "Xxbaq2IBeMlMXEgTgPjf",

"_score" : 1.0,

"_source" : {

"id" : 7782,

"name" : "CLARK"

}

...

curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'

{

"query": {

"bool": {

"must": [

{ "match": { "id": "7369" } }

]

}

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Monday, April 23, 2018

Integrating Hadoop and Elasticsearch

1 comment:

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews