Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, April 23, 2018

Integrating Hadoop and Elasticsearch


Introduction

Hadoop shines as a batch processing system, but serving real-time results can be challenging. For truly interactive data discovery, ES-Hadoop (The Elasticsearch-Hadoop connector) lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations.
With ES-Hadoop, you can easily build dynamic, embedded search applications to serve your Hadoop data or perform deep, low-latency analytics using full-text, geospatial queries and aggregations. 

At the core, elasticsearch-hadoop integrates two distributed systems: Hadoop, a distributed computing platform and Elasticsearch, a real-time search and analytics engine. From a high-level view both provide a computational component: Hadoop through Map/Reduce or recent libraries like Apache Spark on one hand, and Elasticsearch through its search and aggregation on the other. elasticsearch-hadoop goal is to connect these two entities so that they can transparently benefit from each other.


Pre-Requisite

Working with Elasticsearch

Installation
elasticsearch-hadoop binaries can be obtained by downloading them from the elastic.co site. Download and unzip Elasticsearch for Apache Hadoop (ES-Hadoop) distribution elasticsearch-hadoop binary is suitable for Hadoop 2.x. elasticsearch-hadoop provides minimalistic jars for each integration. 


Configuration

There are different configuration options available,for the purpose of this post I'll be using Hive integration which is required in my environment.

After extracting connector, place the jar in Hive and Hadoop distribution folders. 

Below in HIVE_HOME/lib
elasticsearch-hadoop-hive-6.2.3-javadoc.jar
elasticsearch-hadoop-hive-6.2.3-sources.jar

Below in HADOOP_HOME/share/hadoop/hdfs/lib
elasticsearch-hadoop-6.2.3.jar


Now You need to create the external table in Hive backed by Elasticsearch


CREATE EXTERNAL TABLE customer (name string)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'customer/_doc',
              'es.index.auto.create' = 'false')

hive>
    > CREATE EXTERNAL TABLE customer (name string)
    > STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
    > TBLPROPERTIES('es.resource' = 'customer/_doc',
    >               'es.index.auto.create' = 'false');
OK
Time taken: 7.027 seconds


hive> select * from customer;
OK
Inaam Bukhary
Time taken: 7.274 seconds, Fetched: 1 row(s)



elasticsearch-hadoop behavior can be customized through the properties while creating external table.
es.resource: Elasticsearch resource location, where data is read and written to. Requires the format <index>/<type>

You can find the list of all properties in below link

https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html


Reading data from elasticsearch
Below Hive table based on bank index in Elasticsearch.

--Elasticsearch Index sample record.
{
  "_index" : "bank",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "account_number" : 1,
    "balance" : 39225,
    "firstname" : "Amber",
    "lastname" : "Duke",
    "age" : 32,
    "gender" : "M",
    "address" : "880 Holmes Lane",
    "employer" : "Pyrami",
    "email" : "amberduke@pyrami.com",
    "city" : "Brogan",
    "state" : "IL"
  }
}


CREATE EXTERNAL TABLE bank3 (account_number string,balance string,firstname string, lastname string, age string,gender string,address string,employer string,email string,city string,state string)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'bank/_doc',
              'es.index.auto.create' = 'false')



hive> describe bank;
OK
account_number          int                     from deserializer
balance                 int                     from deserializer
firstname               string                  from deserializer
lastname                string                  from deserializer
age                     int                     from deserializer
gender                  string                  from deserializer
address                 string                  from deserializer
employer                string                  from deserializer
email                   string                  from deserializer
Time taken: 0.746 seconds, Fetched: 9 row(s)




WRITING TO Elasticsearch

drop table artists;

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING
   )
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'artists/_doc','es.index.auto.create' = 'true');


-- insert data to Elasticsearch from another table which is source 

INSERT OVERWRITE TABLE artists SELECT distinct s.id, s.name FROM employee s;

hive> INSERT OVERWRITE TABLE artists SELECT distinct id, s.name FROM employee s;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpsysuser_20180409191844_7b9806b6-833b-483d-8d34-9f7407a18a1f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1523248038304_0014, Tracking URL = http://localhost:8088/proxy/application_1523248038304_0014/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job  -kill job_1523248038304_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-04-09 19:19:42,879 Stage-1 map = 0%,  reduce = 0%
2018-04-09 19:20:05,022 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.79 sec
2018-04-09 19:20:39,869 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 22.94 sec
2018-04-09 19:20:46,854 Stage-1 map = 100%,  reduce = 71%, Cumulative CPU 25.76 sec
2018-04-09 19:20:49,278 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.47 sec
MapReduce Total cumulative CPU time: 26 seconds 470 msec
Ended Job = job_1523248038304_0014
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 26.47 sec   HDFS Read: 11197 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 470 msec
OK
Time taken: 138.918 seconds


Query data by Elasticsearch

--List of indices

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-2015.05.18 BnCViLEGT6i21FYJnlCvvw   5   1       4631            0     22.5mb         22.5mb
yellow open   shakespeare         44dwGPXmTJ6eOOdRuGQf9w   5   1     111395            0     22.9mb         22.9mb
yellow open   logstash-2015.05.20 5bQkSCxGT2Oka-hNjl1c-A   5   1       4750            0     23.6mb         23.6mb
yellow open   bank                ymIPSIiDS9i_iiQeM6mD1w   5   1       1000            0      499kb          499kb
yellow open   logstash-2015.05.19 gvUQBcvVSweGcHob0t7FXw   5   1       4624            0     24.2mb         24.2mb
yellow open   customer            f-9cBDSlQ-aNmAWhHhj_rA   5   1          1            0      4.5kb          4.5kb
yellow open   artists             E_OTOKO3TTGLWg8D6jRltg   5   1         14            0     19.3kb         19.3kb

Query index information

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/?pretty&pretty'
{
  "artists" : {
    "aliases" : { },
    "mappings" : {
      "_doc" : {
        "properties" : {
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1523301641741",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "E_OTOKO3TTGLWg8D6jRltg",
        "version" : {
          "created" : "6020399"
        },
        "provided_name" : "artists"
      }
    }
  }
}


Query Index doc

curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} }
}
'

[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
> {
>   "query": { "match_all": {} }
> }
> '
{
  "took" : 117,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 14,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "artists",
        "_type" : "_doc",
        "_id" : "Xxbaq2IBeMlMXEgTgPjf",
        "_score" : 1.0,
        "_source" : {
          "id" : 7782,
          "name" : "CLARK"
        }
      },
...
...


curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "id": "7369" } }       
      ]
    }
  }
}
'

1 comment:

veera said...

very intresting and useful blog.
Thank you..keep sharing..

big data online training