Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, August 10, 2017

Working with Talend for Big Data (TOSBD)


Introduction                                                                               

Talend (eclipse based) provides unified development and management tools to integrate and process all of your data with an easy to use, visual designer. It helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making.
Talend for Big Data is built on top of Talend's data integration solution that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use.

Tuesday, August 08, 2017

Analyzing/Parsing syslogs using Hive and Presto


Scenario

My company asked me to provide the solution for syslog aggregation for all the environments so that they may be able to analyze and get insights. Logs should be captured first, then retained and finally processed by the analyst team in a way they already use to query/process with database. The requirements are not much clearer as well as volume of data can't be determined at the stage.

Wednesday, August 02, 2017

Working with Apache Cassandra (RHEL 7)


Introduction
Cassandra (created at Facebook for inbox search) like HBase is a NoSQL database, generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL and designed to manage extremely large data sets with manipulation capabilities. It is a distributed database, clients can connect to any node in the cluster and access any data.

Tuesday, August 01, 2017

Hortonworks - Using HDP Spark SQL



Using SQLContext, Apache Spark SQL can read data directly from the file system. This is useful when the data you are trying to analyze does not reside in Apache Hive (for example, JSON files stored in HDFS).

Monday, July 31, 2017

Installing/Configuring Hortonworks Data Platform [HDP]

Ambari is  completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop. As part of the Hortonworks Data Platform, allows enterprises to plan, install and securely configure HDP making it easier to provide ongoing cluster maintenance and management, no matter the size of the cluster.


Monday, July 10, 2017

Working with Apache Spark SQL


What is Spark?
Apache Spark is a lightning-fast cluster (in-memory cluster )computing technology, designed for fast computation. Spark does not depend upon Hadoop because it has its own cluster management, Hadoop is just one of the ways to implement Spark, it uses Hadoop for storage purpose. It extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.



Saturday, June 24, 2017

Installing/Configuring and working with Apache Kafka


Introduction

Apache Kafka is an open source, distributed publish-subscribe messaging system,
mainly designed to persistent messaging, high throughput, support multiple clients and providing real time message visibility to consumers.

Kafka is a solution to the real-time problems of any software solution, that is, to deal with real-time volumes of information and route it to multiple consumers quickly. Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are. It supports parallel data loading in the Hadoop systems.