DBMentors - Inam Bukhari's Blog

Monday, August 08, 2022

Satisfy your search use cases using Opensearch & Opensearch-Dashboards

OpenSearch is a distributed search and analytics engine to perform full-text searches with all of the features like search by field, search multiple indices, boost fields, rank results by score, sort results by field, and aggregate results. You can think it as the backend for a search application like Wikipedia or an online store. It organizes data into indices. Each index is a collection of JSON documents.

Hello Lakehoue! Building Your First On-Prem Data Lakehouse

As the emerging concept of a data lakehouse is continuing to gain traction, I thought to write the hello world for it which I named as Hello Lakehouse. In this post first I'll elaborate some necessary concepts and then will come to the implementation part using open source technologies.

Centralized Logging with Fluentd/Fluent-bit and Minio

Fluentd is an open source data collector for building the unified logging layer. Once installed on a server, it runs in the background to collect, parse, transform, analyze and store various types of data. It is written in Ruby for flexibility, with performance-sensitive parts in C. td-agent is a stable distribution package of Fluentd having 30-40MB memory footprint.

Fluent Bit is a Lightweight Data Forwarder (with 450KB memory footprint) for Fluentd. Fluent Bit is specifically designed for forwarding the data from the edge (Containers / Servers / Embedded Systems) to Fluentd aggregators.

Using Filebeat/Logstash to send logs to Minio Data Lake

To aggregate logs directly to an object store like Minio, you can use the Logstash S3 output plugin. Logstash aggregates and periodically writes objects on S3, which are then available for later analysis. For more information please review the related post at the end of this post.

Create Data Lake Without Hadoop

In this post, the focus is to build a modern data lake using only open source technologies. I will walk-through a step-by-step process to demonstrate how we can leverage an S3-Compatible Object Storage (MinIO) and a Distributed SQL query engine (Presto) to achieve this. For some administrative work we may use Hive as well.

Connect to Presto from Spark

If you have Presto cluster as your processing layer, you could connect to it from Spark using Scala.

1- Copy the presto driver to the spark master location eg; /opt/progs/spark-2.4.5-bin-hadoop2.7/jars

Kudu Integration with Spark

Kudu integrates with Spark through the Data Source API, I downloaded the jar files from below locaton

https://jar-download.com/artifacts/org.apache.kudu/kudu-spark2_2.11/1.10.0/source-code
you can place the jar files in $SPARK_HOME/jars (eg; /opt/progs/spark-2.4.5-bin-hadoop2.7/jars)if you dont want to use --jars option with spark shell

Working with Ignite [In-Memory Data Grid]

Introduction

Apache Ignite is an open source In-Memory Data Grid (IMDG), distributed database, caching and high performance computing platform. It offers a bucketload of features and integrates well with other Apache frameworks such as Hadoop, Spark, and Cassandra. We need it for its High Performance and Scalability. It keeps data in RAM for fast processing and linear scaling. If you add more workstations to the grid, it will offer higher scalability and performance gains.

Working with Apache Kudu

Introduction

Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Unlike other storage for big data analytics, Kudu isn't just a file format. It's a live storage system which supports low-latency millisecond-scale access to individual rows. Kudu isn't designed to be an OLTP system, but Fast processing of OLAP workloads.

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Monday, August 08, 2022

Satisfy your search use cases using Opensearch & Opensearch-Dashboards

Thursday, July 28, 2022

Hello Lakehoue! Building Your First On-Prem Data Lakehouse

Monday, July 25, 2022

Centralized Logging with Fluentd/Fluent-bit and Minio

Thursday, July 07, 2022

Using Filebeat/Logstash to send logs to Minio Data Lake

Tuesday, July 05, 2022

Create Data Lake Without Hadoop

Monday, May 04, 2020

Connect to Presto from Spark

Kudu Integration with Spark

Sunday, May 03, 2020

Working with Ignite [In-Memory Data Grid]

Working with Apache Kudu

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews