Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, November 09, 2017

Diagnostics: Hive CLI is hanging On HDP


On our HDP, some time Hive CLI shell just hangs.




Diagnostics: Fix Under replicated blocks [Ambari Dashboard]


I see below in Ambari dashboard under HSDS Summary.

Wednesday, November 08, 2017

Using HDP Zeppelin



Apache Zeppelin is a web-based notebook that enables interactive data analytics. With Zeppelin, you can make beautiful data-driven, interactive and collaborative documents with a rich set of pre-built language backends (or interpreters, An interpreter is a plugin that enables you to access processing engines and data sources from the Zeppelin UI.) such as Scala (with Apache Spark), Python (with Apache Spark), SparkSQL, Hive, Markdown, Angular, and Shell. 


Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark is supported in Zeppelin with Spark interpreter group which consists of below  interpreters.
NameClassDescription
%spark2SparkInterpreterCreates a SparkContext and provides a Scala environment
%spark2.pysparkPySparkInterpreterProvides a Python environment
%rSparkRInterpreterProvides an R environment with SparkR support
%sqlSparkSQLInterpreterProvides a SQL environment
%depDepInterpreterDependency loader



Tuesday, November 07, 2017

Using Apache Phoenix on HDP



Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. It is a SQL abstraction layer for interacting with HBase. Phoenix translates SQL to native HBase API calls. Phoenix provide JDBC/ODBC and Python drivers. 

Monday, November 06, 2017

Working with HBase on HDP

Introduction
Apache HBase is a No-SQL database that runs on a Hadoop cluster. It is ideal for storing unstructured or semi-structured data. It was designed to scale due to the fact that data that is accessed together is stored together which allows to build big data applications for scaling and eliminating limitations of relational databases. 

Thursday, August 10, 2017

Working with Talend for Big Data (TOSBD)


Introduction                                                                               

Talend (eclipse based) provides unified development and management tools to integrate and process all of your data with an easy to use, visual designer. It helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making.
Talend for Big Data is built on top of Talend's data integration solution that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use.

Tuesday, August 08, 2017

Analyzing/Parsing syslogs using Hive and Presto


Scenario


My company asked me to provide the solution for syslog aggregation for all the environments so that they may be able to analyze and get insights. Logs should be captured first, then retained and finally processed by the analyst team in a way they already use to query/process with database. The requirements are not much clearer as well as volume of data can't be determined at the stage.

Wednesday, August 02, 2017

Working with Apache Cassandra (RHEL 7)


Introduction
Cassandra (created at Facebook for inbox search) like HBase is a NoSQL database, generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL and designed to manage extremely large data sets with manipulation capabilities. It is a distributed database, clients can connect to any node in the cluster and access any data.