Data and Analytics Platform Modernization –Data Lake -Cloud Big data Platforms

Big Data Platform

Big data platform is an enterprise information platform that combines the features and capabilities of several big data application and utilities in one single solution.

Bigdata platform converts large amounts of structured and unstructured raw data retrieved from different sources to a data product useful for organizations business.Single version of truth for any enterprise to build decision making and forecasting applications .This single platform collects data   from all data sources and munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.

Let’s discuss about traditional data analytics platform architecture and its components and why this platform has to be modernized by enterprise to ingest and consume data from multi sources including online streaming data and generate insights, automated advisory solutions for business.

Traditional D& A Platform

Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.

Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database.

Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools.

Data Warehouse Database – The central database is the foundation of the data warehousing environment. This database is implemented on the RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS system is optimized for transactional database processing and not for data warehousing.Use of multidimensional database (MDDBs) to overcome any limitations which are placed because of the relational data model.

ETL

  • Anonymize data as per regulatory stipulations.
  • Eliminating unwanted data in operational databases from loading into Data warehouse.
  • Search and replace common names and definitions for data arriving from different sources.
  • Calculating summaries and derived data
  • In case of missing data, populate them with defaults
  • These Extract, Transform, and Load tools may generate cron jobs, background jobs, COBOL programs, shell scripts, etc. that regularly update data in Datawarehouse.

Metadata –Metadata is data about data which defines the data warehouse. Metadata plays an important role as it specifies the source, usage, values, and features of data. It also defines how data can be changed and processed.

Query Tools

  • Query and reporting tools.
  • Application Development tools.
  • Data mining tools.
  • OLAP tools.

Data & Analytics Platform Modernization

Traditional data & analytics platform cannot maintain and manage the non-structure data.Speed & volume and variety of data formats usage like ORC,Parquet are the mandatory requirements in data management processing and reporting layers of the D&A platform

Big Data is not the Solution….It’s the Challenge

Your data and analytics platform should support

  • Huge Volumes,Massive Streams,Mixed Structures and Complex Processing.

Also support Emerging Technologies and App

  • Blockchain Applications,Internet of Things (IoT) Solutions,Personalized Services.
  • Recommender Systems,Geolocation Services.
  • Logistics & SC Optimization,Preventive Maintenance.
  • Cybersecurity.

Hadoop Based Data Platforms

Hadoop based data hub platforms are common in all enterprise IT to have unstructured , online streaming data in their D&A platform .Hadoop open source eco system contains lot of tools and utilities to ingest, store, manage and maintain the enterprise data.

An enterprise data hub is a big data management model that uses a Hadoop platform as the central data repository. The goal of an enterprise data hub is to provide an organization with a centralized, unified data source that can quickly provide diverse business users with the information they need to do their jobs.

Hadoop Architecture

Hadoop Architecture

Moving on , let’s talk about Data life cycle frameworks to build Hadoop based platforms .There are many frameworks, methods available for the data life cycle in the bigdata and analytics platforms, we will be discussing about following frameworks.

  • CRISP Methodology
  • SIPMAA Framework

CRISP Methodology

CRISP -Phases

SIPMAA Framework

  • Architects Framework Support most sophisticated patterns
  • Not linear but rather interactive dynamic and continuous
SIPMAA

Hadoop Tools & Utilities -SIPMAA Mapping

Hadoop allows you to build modern data & analytics platform with its open source tools & utilities for Data sourcing ,prep and presentation.

Let’s discuss about the Hadoop tools & utilities which are used to ingest,process,access and model ,apply the data in the modern data & analytics platform.

1) Ingest

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows models.

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

Sqoop

  • Sqoop can be used to export/import the data from RDBMS to Hadoop Platform.
  • Sqoop 2 exposes REST API as a web service, which can be easily integrated with other systems. Connectors can be non-JDBC based.
  • As a service-oriented design, Sqoop 2 can have role-based authentication and audit trail logging to increase the security.

2) Process

HDFS
The Hadoop Distributed File System (HDFS) is a Java-based distributed, scalable, and portable filesystem designed to span large clusters of commodity servers. The design of HDFS is based on GFS, the Google File System, which is  published by Google. Like many other distributed filesystems, HDFS holds a large amount of data and provides transparent access to many clients distributed across a network.

  • Master/slave architecture
  • Name Node
  • Data Nodes
  • Data Distribution
  • Data Replication

YARN

YARN (Yet Another Resource Negotiator) is a core component of Hadoop, managing access to all resources in a cluster. Before YARN, jobs were forced to go through the Map Reduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience. Some new capabilities unlocked with YARN include:

  • In-memory Execution: Apache Spark is a data processing engine for Hadoop, offering performance-enhancing features like in-memory processing and cyclic data flow. By interacting directly with YARN, Spark is able to reach its full performance potential on a Hadoop cluster.
  • Real-time Processing: Apache Storm lets users define a multi-stage processing pipeline to process data as it enters a Hadoop cluster. Users expect Storm to process millions of events each second with low latency, so customers wanting run Storm and batch processing engines like MapReduce on the same cluster need YARN to manage resource sharing
  • Resource Management and job scheduling/monitoring

Apache Hadoop – MapReduce

Map Reduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multiterabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The Map Reduce program runs on Hadoop which is an Apache open-source framework . It is a processing technique and a program model for distributed computing based on java.

The Map Reduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name Map Reduce implies, the reduce task is always performed after the map job.

  • Map Reduce
  • To help solve Big Data problems
  • Specifically sorting intensive jobs or disc read intensive
  • You would have to code two functions:
  • Mapper – Converts Input into “key – value” pairs
  • Reducer – Aggregates all the values for a key

Hbase

HBase is a data model that is similar to Google’s big table. It is an open source, distributed database developed by Apache software foundation written in Java. HBase is an essential part of Hadoop ecosystem. HBase runs on top of HDFS. It can store massive amounts of data from terabytes to petabytes. It is column oriented and horizontally scalable

  • Use Hbase when you need random, realtime read/write access to your Big Data.
  • Hosting of very large tables — billions of rows X millions of columns
  • Leverages the distributed data storage provided Hadoop and HDFS.

Apache Spark

Apache Spark is a data processing engine for Hadoop, offering performance-enhancing features like in-memory processing and cyclic data flow. By interacting directly with YARN, Spark is able to reach its full performance potential on a Hadoop cluster

  • In memory distributed Processing
  • Scala, Python, Java and R
  • Resilient Distributed Dataset (RDD)
  • Mllib – Machine Learning Algorithms
  • SQL and Data Frames / Pipelines
  • Streaming
  • Big Graph Analytics

3)Model

Apache Spark – Mllib

MLlib is Apache Spark’s machine learning library and provides us with Spark’s superb scalability and ease-of-use when trying to solve machine learning problems.

Apache Hadoop – Mahout
Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed.

  • Machine Learning and Data Mining Library Leverage Hadoop and MapReduce
  • Designed for Massive Data
  • Integrates with Hadoop Ecosystem
  • Support for Variety of Algorithms: Classification, Clustering, Collaborative Filtering, Dimensionality Reduction, Topic Modeling, Others

4) Access

Apache Hive – Hive is a data ware house system for Hadoop. It runs SQL like queries called HQL (Hive query language) which gets internally converted to map reduce jobs. Hive was developed by Facebook. Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.

SQL queries are submitted to Hive and they are executed as follows:

  • Hive compiles the query.
  • An execution engine, such as Tez or MapReduce, executes the compiled query.
  • The resource manager, YARN, allocates resources for applications across the cluster.
  • The data that the query acts upon resides in HDFS (Hadoop Distributed File System). Supported data formats are ORC, AVRO, Parquet, and text.
  • Query results are then returned over a JDBC/ODBC connection.

Apache Impala

  • Impala raises the bar for SQL query performance on Apache Hadoop while retaining a familiar user experience.
  • With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

5) APPLY

Finally integrate all considered hadoop utilities and build modern data and analytics system with the following features.

  • Massively Parallel Processing
  • Shared Nothing
  • Massively Parallel Data Loading
  • Integration with Hadoop
  • Native MapReduce

Hadoop Security

Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Apache Metron provides a scalable advanced security analytics framework built with the Hadoop Community evolving from the Cisco OpenSOCProject. A cyber security application framework that provides organizations the ability to detect cyber anomalies and enable organizations to rapidly respond to identified anomalies.

Apache Sentryis a system to enforce fine grained role based authorization to data and metadata stored on a Hadoop cluster.

Apache Eagle: Analyze Big Data Platforms For Security and Performance.Apache Eagle is an Open Source Monitoring Platform for Hadoop ecosystem, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.

In conjunction with components (such as Ranger, Sentry, Knox, DgSecureand Splunketc.), Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.

Hadoop Governance

Enterprises adopting modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management.

As customers deploy Hadoop into corporate data and processing environments, metadata and data governance must be vital parts of any enterprise-readydata lake.

Apache Atlas and Apache Falcon

Apache Atlas is a scalable and extensible set of core foundational governance services. It enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

Apache Falcon is a framework for managing data life cycle in Hadoop clusters addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing.

Falcon centrally manages the data life cycle,facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs.

Will discuss about how to build Data lake using public cloud offerings and the reference architecture in the next blog …

Leave a comment