Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Saturday, June 24, 2017

Installing/Configuring and working with Apache Kafka


Introduction

Apache Kafka is an open source, distributed publish-subscribe messaging system,
mainly designed to persistent messaging, high throughput, support multiple clients and providing real time message visibility to consumers.

Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are. It supports parallel data loading in the Hadoop systems.


Friday, June 23, 2017

Forward syslog to Flume with rsyslog


Introduction


Syslog  


In computing, syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, indicating the software type generating the message, and assigned a severity label.

Computer system designers may use syslog for system management and security auditing as well as general informational, analysis, and debugging messages. A wide variety of devices, such as printers, routers, and message receivers across many platforms use the syslog standard. This permits the consolidation of logging data from different types of systems in a central repository. Implementations of syslog exist for many operating systems.

Benefits of syslog

  • Helps analyze the root cause for any trouble or problem caused 
  • Reduce overall downtime helping to troubleshoot issues faster with all the logs 
  • Improves incident management by active detection of issues 
  • Self-determination of incidents along with auto resolution 
  • Simplified architecture with different level of severity like error,info,warning etc


In this post, I'll be using HDFS as central repository for syslogs and hive as analytical platform.

Rsyslog  


Rsyslog is an open-source software utility used on UNIX and Unix-like computer systems for forwarding log messages in an IP network. It implements the basic syslog protocol, extends it with content-based filtering, rich filtering capabilities, flexible configuration options and adds features such as using TCP for transport.

Note:
Please review the post Streaming Twitter Data using Apache Flume  before this one for Flume introduction and architechture.

Flume's syslog TCP source  


The Syslog TCP source provides an endpoint for messages over TCP, allowing for a larger payload size and TCP retry semantics that should be used for any reliable inter-server communications.

To create a Syslog TCP source, set the type property to syslogtcp.

vi /usr/hadoopsw/apache-flume-1.7.0-bin/conf/syslog.conf

# Naming the components on the current agent.
agent.sources=SourceSyslog
agent.channels=ChannelMem
agent.sinks=SinkHDFS

# Describing/Configuring the source
#agent.sources.SourceSyslog.type=syslogudp
agent.sources.SourceSyslog.type=syslogtcp
agent.sources.SourceSyslog.host=0.0.0.0
agent.sources.SourceSyslog.port=12345
agent.sources.SourceSyslog.keepFields=true


# Describing/Configuring the channel
agent.channels.ChannelMem.type=memory
agent.channels.ChannelMem.capacity = 10000
agent.channels.ChannelMem.transactionCapacity = 1000

# Describing/Configuring the sink  
agent.sinks.SinkHDFS.type=hdfs
agent.sinks.SinkHDFS.hdfs.path = /flume/syslogs/
agent.sinks.SinkHDFS.hdfs.fileType = DataStream
agent.sinks.SinkHDFS.hdfs.writeFormat = Text
agent.sinks.SinkHDFS.hdfs.batchSize = 1000
agent.sinks.SinkHDFS.hdfs.rollSize = 0
agent.sinks.SinkHDFS.hdfs.rollCount = 10000

# Binding the source and sink to the channel
agent.sources.SourceSyslog.channels = ChannelMem
agent.sinks.SinkHDFS.channel = ChannelMem



The keepFields property tells the source to include the syslog fields as part of the body.

By default, these are simply removed, as they become Flume header values. Memory channel Property capacity determines 10000 events and transactionCapacity determines maximum number of events that can be written, also called a put, by a source’s ChannelProcessor, the component responsible for moving data from the source to the channel, in a single transaction. This is also the number of events that can be read, also called a take, in a single transaction by the SinkProcessor, which is the component responsible for moving data from the channel to the sink.

Remember that if you increase capacity property value, you will most likely have to increase your Java heap space using the -Xmx, and optionally -Xms, parameters.

Run Flume agent 

Create relevant folders in HDFS as mentioned in flume configuration

[hdpsysuser@te1-hdp-rp-nn01 ~]$ hdfs dfs -mkdir /flume/syslogs
[hdpsysuser@te1-hdp-rp-nn01 ~]$ hdfs dfs -chmod -R 777 /flume


Now you can run the flume using below command

flume-ng agent -n agent -c conf -f $FLUME_HOME/conf/syslogudp.conf - Dflume.root.logger=INFO,console


Test with nc


Now  test the connection to your flume agent using nc. nc is the command which runs netcat, a simple Unix utility that reads and writes data across network connections, using the TCP or UDP protocol.


[hdpclient@te1-hdp-rp-en01 ~]$ nc localhost 12345  
Event-1

Type above line and press enter, this line should be transported to Flume agent and it should write to HDFS location specified in the configuration file.

Test whether data reached to final destination (HDFS) 

[hdpclient@te1-hdp-rp-en01 ~]$ hdfs dfs -cat /flume/syslogs/FlumeData.1497263656231
Event-1


Use Hive to analyze (Optional)

create database flume;
use flume;

create external table syslog(log_line string) location '/flume/syslogs'

select line
from syslog
--where line like '%user root%'
--where line like '%Invalid user%'
where lower(line) like '%authentication failure%'
limit 5;

Configure remote logging with rsyslog (Unix Client) 

OS: Redhat 7.x


Configure the rsyslog to send rsyslog events to another server using TCP.

1- Add the following line to the RULES section of /etc/rsyslog.conf

# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @remote-host:514
*.*         @@10.10.10.1:514
vi /etc/rsyslog.conf

#### RULES ####

# Log all kernel messages to the console.

# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.

# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.

authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.

mail.*                                                  -/var/log/maillog


# Log cron stuff

cron.*                                                  /var/log/cron

# Everybody gets emergency messages

*.emerg                                                 :omusrmsg:*

# Save news errors of level crit and higher in a special file.

uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log

local7.*                                                /var/log/boot.log

# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional

#*.* @remote-host:514
*.*         @@192.168.44.134:12345


2- Restart rsyslog.

[hdpsysuser@te1-hdp-rp-dn04 ~]$ service rsyslog restart

Redirecting to /bin/systemctl restart  rsyslog.service
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to manage system services or units.
Authenticating as: hdpsysuser
Password:
==== AUTHENTICATION COMPLETE ===

3- Test the configuration using logger command. Logger makes entries in the system log.


[root@te1-hdp-rp-dn04 ~]# logger Test from Data Node - 4
Check your message

[root@te1-hdp-rp-dn04 ~]# tail /var/log/messages
Jun 12 14:18:59 te1-hdp-rp-dn04 fprintd: ** Message: entering main loop
Jun 12 14:19:19 te1-hdp-rp-dn04 su: (to root) hdpsysuser on pts/1
Jun 12 14:19:19 te1-hdp-rp-dn04 dbus[950]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 12 14:19:19 te1-hdp-rp-dn04 dbus-daemon: dbus[950]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 12 14:19:19 te1-hdp-rp-dn04 dbus[950]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 12 14:19:19 te1-hdp-rp-dn04 dbus-daemon: dbus[950]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 12 14:19:29 te1-hdp-rp-dn04 fprintd: ** Message: No devices in use, exit
Jun 12 14:19:39 te1-hdp-rp-dn04 hdpsysuser: Test from Data Node - 4
Jun 12 14:20:01 te1-hdp-rp-dn04 systemd: Started Session 11254 of user root.
Jun 12 14:20:01 te1-hdp-rp-dn04 systemd: Starting Session 11254 of user root.


4- Verify the flume agent's HDFS Location

[hdpclient@te1-hdp-rp-en01 ~]$ hdfs dfs -ls /flume/syslogs
Found 8 items
-rw-r--r--   3 hdpclient supergroup         10 2017-06-12 13:46 /flume/syslogs/FlumeData.1497264415313
-rw-r--r--   3 hdpclient supergroup         17 2017-06-12 13:50 /flume/syslogs/FlumeData.1497264697909
-rw-r--r--   3 hdpclient supergroup          9 2017-06-12 13:51 /flume/syslogs/FlumeData.1497264761727
-rw-r--r--   3 hdpclient supergroup         10 2017-06-12 13:55 /flume/syslogs/FlumeData.1497264970862
-rw-r--r--   3 hdpclient supergroup        497 2017-06-12 14:14 /flume/syslogs/FlumeData.1497266092851
-rw-r--r--   3 hdpclient supergroup         36 2017-06-12 14:16 /flume/syslogs/FlumeData.1497266217106
-rw-r--r--   3 hdpclient supergroup       1272 2017-06-12 14:16 /flume/syslogs/FlumeData.1497266252522
-rw-r--r--   3 hdpclient supergroup        176 2017-06-12 14:17 /flume/syslogs/FlumeData.1497266292306

Verify using browser


Check using Hive table
Query the hive table created earlier.





Monitor Flume metrics

You can configure the Flume agent to start an HTTP server that will output JSON that can use queries by outside mechanisms.

Start the Flume agent with these properties:

-Dflume.monitoring.type=http
-Dflume.monitoring.port=44444

flume-ng agent -n agent -c conf -f $FLUME_HOME/conf/syslog.conf - Dflume.root.logger=INFO,console -Dflume.monitoring.type=http -Dflume.monitoring.port=44444



Now, when you go to http://SERVER_OR_IP:44444/metrics

you will see something like below


{"CHANNEL.ChannelMem":{"ChannelCapacity":"1000000","ChannelFillPercentage":"0.0","Type":"CHANNEL","EventTakeSuccessCount":"14","ChannelSize":"0","EventTakeAttemptCount":"47","StartTime":"1497273770141","EventPutAttemptCount":"14","EventPutSuccessCount":"14","StopTime":"0"},"SINK.SinkHDFS":{"ConnectionCreatedCount":"2","ConnectionClosedCount":"2","Type":"SINK","BatchCompleteCount":"0","BatchEmptyCount":"31","EventDrainAttemptCount":"14","StartTime":"1497273770143","EventDrainSuccessCount":"14","BatchUnderflowCount":"2","StopTime":"0","ConnectionFailedCount":"0"},"SOURCE.SourceSyslog":{"EventReceivedCount":"14","AppendBatchAcceptedCount":"0","Type":"SOURCE","EventAcceptedCount":"14","AppendReceivedCount":"0","StartTime":"1497273770202","AppendAcceptedCount":"0","OpenConnectionCount":"0","AppendBatchReceivedCount":"0","StopTime":"0"}}




The channel’s ChannelSize or ChannelFillPercentage metrics will give you a good idea whether the data is coming in faster than it is going out. It will also tell you whether you have it set large enough for maintenance/outages of your data volume.
Looking at the sink, EventDrainSuccessCount versus EventDrainAttemptCount will tell you how often output is successful when compared to the times tried. ConnectionFailedCount metric is a good indicator of persistent connection problems.


A growing ConnectionCreatedCount metric can indicate that connections are dropping and reopening too often.


Streaming Twitter Data by Flume using Cloudera Twitter Source

In my previous post Streaming Twitter Data using Apache Flume which fetches tweets using Flume and twitter streaming for data analysis.Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said "Avro block size is invalid or too large". In order to overcome this issue, I used Cloudera TwitterSource rather than apache TwitterSource.

Streaming Twitter Data using Apache Flume


Introduction                                                                           

Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of streaming event data. It is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (event/log data) from various web servers and services like Facebook and Twitter to HDFS.


Building Teradata Presto Cluster


Prerequisites:
Before working on this post you should review below posts.



Installing/Configuring PrestoDB
Working with PrestoDB Connectors

In this post , I'll be covering below

1- Installing and configuring Presto Admin
2- Installing Presto Cluster on a single node
3- Using Presto ODBC Driver
4- Installing and configuring Presto Cluster with one coordinator and three workers

Working with PrestoDB Connectors



Prerequisite:
Complete my previous post Installing/Configuring PrestoDB


Presto enables you to connect to other databases using some connector, in order to perform queries and joins over several sources providing metadata and data for queries. In this post we will work with some connectors. A coordinator (a master daemon) uses connectors to get metadata (such as table schema) that is needed to build a query plan. Workers use connectors to get actual data that will be processed by them.


Installing/Configuring PrestoDB

Introduction

Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.

Managing HDFS Quotas


The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of names used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.

Hadoop DFSAdmin Commands

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As an added bonus, you can use them to perform some administration operations on HDFS as well.

Recover the deleted file/folder in HDFS


By default Hadoop deletes the files/directory permanently but sometimes they are deleted accidentally and you want to get them back. You have to enable Trash feature for this purpose. There are two properties (fs.trash.interval & fs.trash.checkpoint.interval) to be set in core-site.xml to move the deleted files and directories in .Trash folder which is located in HDFS /user/$USER/.Trash.

Hive Streaming



Streaming offers an alternative way to transform data. During a streaming job, the Hadoop 
Streaming API opens an I/O pipe to an external process. Data is then passed to 
the process, which operates on the data it reads from the standard input and writes the 
results out through the standard output, and back to the Streaming API job.


Thursday, June 08, 2017

Installing/Configuring and Working on Apache Sqoop



Introduction


Apache Sqoop is a hadoop ecosystem's tool (hadoop client) designed to Efficiently transfers bulk data between Apache Hadoop and structured datastores like Oracle. It helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. It can also be used to extract data from Hadoop and export it into external structured datastores.

Friday, June 02, 2017

Apache PIG - a Short Tutorial


Introduction

Apache Pig is an abstraction over MapReduce developed as a research project at Yahoo in 2006 and was open sourced via Apache incubator in 2007. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. To write data analysis programs, Pig provides a high-level language known as Pig Latin. Scripts written in Pig Latin are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. 

Tuesday, May 30, 2017

Creating External Table for HDFS using Oracle Connector for Hadoop (OSCH)


Introduction


Oracle Big Data Connectors facilitate data access to data stored in an Apache Hadoop cluster. It can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware. There are three connectors available from which we are going to work on Oracle SQL Connector for Hadoop Distributed File System for the purpose of this post.

Sunday, May 14, 2017

Connect Oracle SQL Developer to Hive


As Oracle SQL Developer is one of the most common SQL client tool used by Developers, Data Analyst and Data Architects to interact with Oracle and other relational systems. So extending the functionality of SQL developer to connect to hive is very useful for Oracle users. You can use the SQL Worksheet to query, create and alter Hive tables dynamically accessing data sources defined in the Hive metastore.

Tuesday, May 02, 2017

Using Hadoop Compression


Hadoop Compression

Hive can read data from a variety of sources, such as text files, sequence files, or even custom formats using Hadoop’s InputFormat APIs as well as can write data to various formats using OutputFormat API. You can take the leverage from Hadoop to store data as compressed to save significant disk storage. Compression also can increase throughput and performance. Compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.