Cloud Bigdata Platform -Reference architecture -Public Cloud providers offerings.

Cloud platforms are transforming the way the enterprises are managing and consuming the data. Big data and analytics service components introduced by cloud providers now helps business to perform decision making and open up new business opportunities based on data insights for enterprises.

Reference Architecture is a collection of different modules which break down the solution into elements that results in capabilities to address the set of concerns.

Reference architecture has to solve multiple concerns :

Concerns related to external constraints on the system to be built, design decisions.
Concern related to reuse or sharing of modules, capabilities provided by a module were typically shared within or external to system.
Concerns aligned with stakeholder communities of interest or stakeholders roles.

Bigdata Reference Architecture -Modules

Bigdata platform modules can be grouped under 3 categories (Application provider,framework provider and cross cutting).

Big Data Application Provider Modules.
Big Data Framework Provider Modules.
Cross cutting modules.

Big Data Application Provider Modules

Application Orchestration Module-Application Orchestration configures and combines other modules of the big data Application Provider, integrating activities into a cohesive application. An application is the end-to-end data processing through the system to satisfy one or more use cases.

Collection Module-The Collection module is primarily concerned with the interface to external Data Providers.

Preparation Module –The main concern of the Preparation module is transforming data to make it useful for the other downstream modules.

Data validation.
Cleansing.
Optimization.
Data Transformation and standardization.
Performance optimization to faster lookup.

Analytics Module-This module is concerned with efficiently extracting insights from the data. Analytics can contribute further to the transform stage of the ETL cycle by performing more advanced transformations.

Visualization Module -The Visualization module is concerned with presenting processed data and the outputs of analytics to a human Data Consumer, in a format that communicates meaning and knowledge. It provides a “human interface” to the big data.

Access Module-The Access module is concerned with the interactions with external actors, such as the Data Consumer, or with human users.

Big Data Framework Provider Modules

Processing Module– The Processing module is concerned with efficient, scalable, and reliable execution of analytic. A common solution pattern to achieve scalability and efficiency is to distribute the processing logic and execute it locally on the same nodes where data is stored, transferring only the results of processing over the network.

Messaging Module –The Messaging module is concerned with reliable queuing,transmission, and delivery of data and control functions between components.

Data Storage Module- The primary concerns of the Data Storage module are providing reliable and efficient access to the persistent data.

Infrastructure Module-The Infrastructure module provides the infrastructure resources necessary to host and execute the activities.Infrastructure and data centre design are concerns when architecting a big data solution, and can be an important factor in achieving desired performance. Big data infrastructure needs to be scalable, reliable and support target workloads.

Cross cutting modules

Security Module-The Security module is concerned with controlling access to data and applications, including enforcement of access rules and restricting access based on classification.

Management Module-System Management, including activities such as monitoring,configuration, provisioning and control of infrastructure and applications;Data Management, involving activities surrounding the data life cycle of collection, preparation, analytics, visualization and access.

Federation Module– The Federation module is concerned with inter operation between Federated instances of the platform.

Moving on , Let’s discuss about the Big data and analytics platform components provided by different cloud providers and how we can build big data platform using those components.

Bigdata Cloud Platform -Logical Architecture

Below is the logical reference architecture of bigdata platform, will map the components provided by the cloud providers and see how the platform can be designed

AWS -Bigdata Platform

First let’s map the components provided by AWS to design and build the Bigdata platform

Ingest:

AWS has Kinesis streams to handle high frequency real time data. Data consumers can push data in real time to Kinesis streams, you can also connect Kinesis to an Apache storm cluster.

Kinesis fire hose can be used for large scale data ingestion, data pushed can be automatically transferred to different storage layers like S3, red-shift Database and elastic search services.

Kinesis Video Stream-Videos
Kinesis Data Streams: Processing stream using popular streaming frameworks such as Apache Spark, Flink.
Kinesis Data Firehose: Capture, transform load data with light transformation directly to BI.
Kinesis Data Analytics: Ability to run SQLs on streaming data.

AWS snow ball can be used for transporting data in and out of cloud. Snowballs comes in size of 50 TB / 80 TB and multiple snowballs can be daisy chained for larger data volume transports.

Process

EMR:AWS Elastic Map Reduce (EMR), a managed Hadoop, Spark and Presto solution. EMR takes care of setting up an underlying EC2 cluster and provides integration with a number of AWS services including S3 and DynamoDB. AWS offers EMR (Elastic Map Reduce) clusters on which Spark can run to do big data computation.

Data Pipeline is a data orchestration product, that moves, copies, transforms and enriches data. Data Pipeline manages the scheduling, orchestration and monitoring of the pipeline activities.

Store

AWS S3 is the primary storage because of its durability and availability, and can maintain a segregation between RAW storage buckets, Transformed / Discovery data buckets and Curated data buckets.

Redshift is columnar database can be used to store terabyte to petabyte data scale data warehouse data.

AWS DynamoDB can be used to store Key Value pairs of metadata, can be used for building data catalog of Data Lake raw storage.

Access

External tables built on AWS S3 enable services like AWS Athena to run SQLs on files stored in AWS S3. AWS Red shift Spectrum which enables to query Red shift and S3 External Tables parallelly using SQL.

Metadata Management: metadata can be built on AWS native services such as Glue catalog or built bespoke on NoSQL databases such as DynamoDB and advanced search can be enabled through services such as AWS CloudSearch or AWS Elasticsearch.

Document Classification: AWS Macie can discover, classify and protect sensitive data in AWS. It can do classification of documents based on rules which can evolve further based machine learning models for data discovery, classification and protection.

Model

AWS makes it very easy to create predictive models without the need to learn complex algorithms. To create a model users are guided through the process of selecting data, preparing data, training and evaluating models through a simple wizard based UI. Once trained the model can be used to create predictions via online API (request / response) or a batch API for processing multiple input records.

Consumption

AWS offers Quicksight for data . Dashboards can be built from data stored across most AWS data storage services and supports a number of third party solution.

AWS Bigdata Platform Security Components

AWS Identity and Access Management (IAM) lets you define individual user accounts with permissions across AWS resources ,MFA for privileged accounts.
Data encryption capabilities available in AWS storage and database services, such as EBS, S3, Glacier, Oracle RDS, SQL Server RDS, and Redshift.
AWS Key Management Service (KMS) to choose whether AWS can manage can encryption keys or customer keep complete control over keys.

Azure -Bigdata Platform

let’s map the components provided by Azure to design and build the Bigdata platform

Ingest

Event Hub –Azure Event Hubs is a Big Data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters. Event Hubs for Apache Kafka enables native Kafka clients, tools, and applications such as Mirror Maker, Apache Flink.

Stream Analytics- Stream Analytics is Microsoft’s latest addition to its suite of advanced, fully managed, server-less Platform-as-a-Service (PaaS) cloud components. The process of spinning up complex data pipelines and analytics has in the past been both time-consuming and expensive, but can now be done within minutes to hours for a very reasonable cost. Azure Stream Analytics currently supports three types of inputs: blob storage, IoT and Event Hub.

Process

Azure’s managed Apache platform HDInsight which comes with Hadoop, Spark, Storm or HBase. The platform has a standard and premium tier, the latter including the option of running RServer, Microsoft’s enterprise solution for building and running R models at scale. HDInsight comes with a local HDFS and can also connect to blob storage or Data Lake Store.

Azure Data Factory is a data orchestration service that is used to build data processing pipelines. Data factory can read data from a range of Azure and third party data sources, and through Data Management Gateway, can connect and consume on-premise data. Data Factory comes with a range of activities that can run compute tasks in HDInsight, Azure Machine Learning, stored procedures, Data Lake and custom code running on Batch.

Store

Azure Storage is highly available, secure, durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage Gen2, Azure Files, Azure Queues, and Azure Tables.

Azure Data lake store: Azure Data Lake Storage is a high speed, scalable, secure and cost-effective platform.The high-performance Azure blob file system (ABFS) is built for big data analytics and is compatible with the Hadoop Distributed File System. Some of the attractive features of the new service are the following: encryption of data in transit with TLS 1.2, encryption of the data at rest, storage account firewalls, virtual network integration, role-based access security, and hierarchical namespaces with granular ACLs.

Azure Data Catalog is a registry of data assets within an organisation. Technical and business users can then use Data Catalog to discover datasets and their intent.

Model :

Azure Machine Learning is a fully managed data science platform that is used to build and deploy powerful predictive and statistical models. Azure Machine Learning comes with a flexible UI canvas and a set of predefined modules that can be used to build and run powerful data science experiments. The platform comes with a series of predefined machine learning models and includes the ability to run custom R or Python code. Trained models can be published as web services for consumption either as a realtime request/response API or for batch execution, also comes with interactive Jupyter notebooks for recording and documenting lab notes.

Consumption

Power BI can consume data from a range of Azure and third party services, as well as being able to connect to on-premise data sources, also allows users to run R scripts and embed R generated visuals .

Cognitive Services is a suite of readymade intelligence APIs that make it easy to enable and integrate advanced speech, vision, and natural language into business solutions.

Azure Bigdata platform Security

Azure Storage Service Encryption (SSE) can automatically encrypt data before it is stored, and it automatically decrypts the data when you retrieve it. The process is completely transparent to users. Storage Service Encryption uses 256-bit Advanced Encryption Standard (AES) encryption, which is one of the strongest block ciphers available. AES handles encryption, decryption, and key management transparently.

TDE is used to encrypt SQL Server, Azure SQL Database, and Azure SQL Data Warehouse data files in real time, using a Database Encryption Key (DEK), which is stored in the database boot record for availability during recovery.
TDE protects data and log files, using AES and Triple Data Encryption Standard (3DES) encryption algorithms.

Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. User data that’s stored in Cosmos DB in non-volatile storage (solid-state drives) is encrypted by default.

Data Lake Store supports “on by default,” transparent encryption of data at rest.

Microsoft uses the Transport Layer Security (TLS) protocol to protect data when it’s traveling between the cloud services and customers. Microsoft datacenters negotiate a TLS connection with client systems that connect to Azure services. TLS provides strong authentication, message privacy, and integrity (enabling detection of message tampering, interception, and forgery), interoperability, algorithm flexibility, and ease of deployment and use.

Azure Key Vault– A secure secrets store for the passwords, connection strings, and other information you need to keep your apps working.

Google Cloud -Bigdata Platform

Let’s map the components provided by Google to design and build the Bigdata platform

Ingest

Pub Sub :

Cloud Pub/Sub is a scalable, durable, event ingestion and delivery system that supports the publish-subscribe pattern at large and small scales. Cloud Pub/Sub makes your systems more robust by decoupling publishers and subscribers of event data.

Process

Cloud Data proc is Google’s fully managed Hadoop and Spark offering. Google boasts an impressive 90 second lead time to start or scale Cloud Data proc clusters, by far the quickest of the three providers. Pricing is based on the underlying Compute Engine costs plus an additional charge per vCPU per minute. An HDFS compliant connector is available for Cloud Storage that can be used to store data that needs to survive after the cluster has been shut down.

Data processing pipelines can be built using Cloud Data-flow,fully programmable framework, available for Java and Python, and a distributed compute platform. Cloud Data flow supports both batch and streaming workers.

Store

Big Query: Big Query is Google’s fully managed, petabyte scale, low cost analytics data warehouse. Big Query is fully managed services PaaS offering from google, there is no infrastructure to manage , you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.

BigQuery can connect to a variety of visualization tools, providing the ability to generate reports using customer tool of preference. Ad-hoc analyses on the program data can also be done directly in the BigQuery UI using the Query Editor.

Cloud Datastore is a high-performance NoSQL database, designed for auto-scaling and ease of application development. While it’s NoSQL, Datastore offer many features which are similar to traditional databases. Other potential databases, which can be considered for the program solution were Cloud SQL and Cloud Spanner.

Model

Cloud Auto ML is fully managed platform for training and hosting Tensorflow models. It relies on Cloud Dataflow for data and feature processing and Cloud Storage for data storage. There is also Cloud Datalab, a lab notebook environment based on Jupyter. A set of pre-trained models are also available. Vision API detects features in images, such as text, faces or company logos, Speech API converts audio to text across a range of languages, Natural Language API can be used to extract meaning from text, and there is an API for translation.

Consumption

Data Studio

Data studio is closely integrated with Google Cloud, allows you to easily access data from Google Analytics, Google Ads, Display & Video 360, Search Ads 360, YouTube Analytics, Google Sheets, Google BigQuery and over 500 more data sources, both Google and non-Google, to visualize and interactively explore data. It allows you to easily share your insights . And beyond just sharing, Data Studio offers seamless real-time collaboration with others

Google announced lot of enhancements to their data bigdata platform in cloudnext 19

Here’s an overview of what’s new:

Simplifying data migration and integration
- Cloud Data Fusion (beta)
- BigQuery DTS SaaS application connectors (beta)
- Data warehouse migration service to BigQuery (beta)
- Cloud Dataflow SQL (public alpha, coming soon)
- Dataflow FlexRS (beta)
Accelerating time to insights
- BigQuery BI Engine (beta)
- Connected sheets (beta, coming soon)
Turning data into predictions
- BigQuery ML (GA, coming soon), with additional models supported
- AutoML Tables (beta)
Enhancing data discovery and governance
- Cloud Data Catalog (beta, coming soon)

Google Cloud Big Platform Security

Encryption by default in transit and at rest
Cloud Key Management System (KMS) –Manage cryptographic keys for your cloud services.
Cloud Data Loss Prevention (DLP)-Fast, scalable de-identification for sensitive data like credit card numbers, names, social security numbers, and more. It is mainly aimed at text data and and allows to detect and redact sensitive data such as credit card numbers, phone numbers and names.
Backup and recovery-In storage, encryption at rest protects data on backup media. Data is also replicated in encrypted form for backup and disaster recovery.
Cloud Data Catalog-Fully managed and scalable metadata management service that empowers you to quickly discover, manage, and understand your data.

Cloud Identity-Aware Proxy (Cloud IAP) controls access to cloud
applications running on Google Cloud Platform

Access Transparency for GCP, a service that creates logs in near-real-time when GCP administrators interact with your data for support

Conclusion

Bigdata and Analytics is becoming a critical component of modern business and a core capability that is driving cloud adoption. All three providers offer similar building blocks for data processing, orchestration, streaming analytics, ML and visualizations.

AWS has all the bases covered with a solid set of products that will meet most needs except managed lab notebooks.

Azure offers a comprehensive and impressive suite of managed analytical products. They support open source big data solutions alongside new server less analytical products such as Data Lake,also offers pre-trained models through to custom R models running over big data. Azure also offer the capability for organisations to track and document their data assets.

Google provide their own set of products with their range of services. Data proc and Data flow, Google have a strong core to their offerings. Tensor flow,Cloud Auto ML are getting a lot of attention. Google has a strong rich set of pre-trained APIs.

Will discuss about server less Bigdata platform pipeline and architecture in the next blog ….

Ref : AWS,Google and Azure online documents

Data and Analytics Platform Modernization –Data Lake -Cloud Big data Platforms

Big Data Platform

Big data platform is an enterprise information platform that combines the features and capabilities of several big data application and utilities in one single solution.

Bigdata platform converts large amounts of structured and unstructured raw data retrieved from different sources to a data product useful for organizations business.Single version of truth for any enterprise to build decision making and forecasting applications .This single platform collects data from all data sources and munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.

Let’s discuss about traditional data analytics platform architecture and its components and why this platform has to be modernized by enterprise to ingest and consume data from multi sources including online streaming data and generate insights, automated advisory solutions for business.

Traditional D& A Platform

Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.

Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database.

Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools.

Data Warehouse Database – The central database is the foundation of the data warehousing environment. This database is implemented on the RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS system is optimized for transactional database processing and not for data warehousing.Use of multidimensional database (MDDBs) to overcome any limitations which are placed because of the relational data model.

ETL

Anonymize data as per regulatory stipulations.
Eliminating unwanted data in operational databases from loading into Data warehouse.
Search and replace common names and definitions for data arriving from different sources.
Calculating summaries and derived data
In case of missing data, populate them with defaults
These Extract, Transform, and Load tools may generate cron jobs, background jobs, COBOL programs, shell scripts, etc. that regularly update data in Datawarehouse.

Metadata –Metadata is data about data which defines the data warehouse. Metadata plays an important role as it specifies the source, usage, values, and features of data. It also defines how data can be changed and processed.

Query Tools

Query and reporting tools.
Application Development tools.
Data mining tools.
OLAP tools.

Data & Analytics Platform Modernization

Traditional data & analytics platform cannot maintain and manage the non-structure data.Speed & volume and variety of data formats usage like ORC,Parquet are the mandatory requirements in data management processing and reporting layers of the D&A platform

Big Data is not the Solution….It’s the Challenge

Your data and analytics platform should support

Huge Volumes,Massive Streams,Mixed Structures and Complex Processing.

Also support Emerging Technologies and App

Blockchain Applications,Internet of Things (IoT) Solutions,Personalized Services.
Recommender Systems,Geolocation Services.
Logistics & SC Optimization,Preventive Maintenance.
Cybersecurity.

Hadoop Based Data Platforms

Hadoop based data hub platforms are common in all enterprise IT to have unstructured , online streaming data in their D&A platform .Hadoop open source eco system contains lot of tools and utilities to ingest, store, manage and maintain the enterprise data.

An enterprise data hub is a big data management model that uses a Hadoop platform as the central data repository. The goal of an enterprise data hub is to provide an organization with a centralized, unified data source that can quickly provide diverse business users with the information they need to do their jobs.

Hadoop Architecture

Moving on , let’s talk about Data life cycle frameworks to build Hadoop based platforms .There are many frameworks, methods available for the data life cycle in the bigdata and analytics platforms, we will be discussing about following frameworks.

CRISP Methodology
SIPMAA Framework

CRISP Methodology

SIPMAA Framework

Architects Framework Support most sophisticated patterns
Not linear but rather interactive dynamic and continuous

Hadoop Tools & Utilities -SIPMAA Mapping

Hadoop allows you to build modern data & analytics platform with its open source tools & utilities for Data sourcing ,prep and presentation.

Let’s discuss about the Hadoop tools & utilities which are used to ingest,process,access and model ,apply the data in the modern data & analytics platform.

1) Ingest

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows models.

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

Sqoop

Sqoop can be used to export/import the data from RDBMS to Hadoop Platform.
Sqoop 2 exposes REST API as a web service, which can be easily integrated with other systems. Connectors can be non-JDBC based.
As a service-oriented design, Sqoop 2 can have role-based authentication and audit trail logging to increase the security.

2) Process

HDFS
The Hadoop Distributed File System (HDFS) is a Java-based distributed, scalable, and portable filesystem designed to span large clusters of commodity servers. The design of HDFS is based on GFS, the Google File System, which is published by Google. Like many other distributed filesystems, HDFS holds a large amount of data and provides transparent access to many clients distributed across a network.

Master/slave architecture
Name Node
Data Nodes
Data Distribution
Data Replication

YARN

YARN (Yet Another Resource Negotiator) is a core component of Hadoop, managing access to all resources in a cluster. Before YARN, jobs were forced to go through the Map Reduce framework, which is designed for long-running batch operations. Now, YARN brokers access to cluster compute resources on behalf of multiple applications, using selectable criteria such as fairness or capacity, allowing for a more general-purpose experience. Some new capabilities unlocked with YARN include:

In-memory Execution: Apache Spark is a data processing engine for Hadoop, offering performance-enhancing features like in-memory processing and cyclic data flow. By interacting directly with YARN, Spark is able to reach its full performance potential on a Hadoop cluster.
Real-time Processing: Apache Storm lets users define a multi-stage processing pipeline to process data as it enters a Hadoop cluster. Users expect Storm to process millions of events each second with low latency, so customers wanting run Storm and batch processing engines like MapReduce on the same cluster need YARN to manage resource sharing

Resource Management and job scheduling/monitoring

Apache Hadoop – MapReduce

Map Reduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multiterabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The Map Reduce program runs on Hadoop which is an Apache open-source framework . It is a processing technique and a program model for distributed computing based on java.

The Map Reduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name Map Reduce implies, the reduce task is always performed after the map job.

Map Reduce
To help solve Big Data problems
Specifically sorting intensive jobs or disc read intensive
You would have to code two functions:
Mapper – Converts Input into “key – value” pairs
Reducer – Aggregates all the values for a key

Hbase

HBase is a data model that is similar to Google’s big table. It is an open source, distributed database developed by Apache software foundation written in Java. HBase is an essential part of Hadoop ecosystem. HBase runs on top of HDFS. It can store massive amounts of data from terabytes to petabytes. It is column oriented and horizontally scalable

Use Hbase when you need random, realtime read/write access to your Big Data.
Hosting of very large tables — billions of rows X millions of columns
Leverages the distributed data storage provided Hadoop and HDFS.

Apache Spark

Apache Spark is a data processing engine for Hadoop, offering performance-enhancing features like in-memory processing and cyclic data flow. By interacting directly with YARN, Spark is able to reach its full performance potential on a Hadoop cluster

In memory distributed Processing
Scala, Python, Java and R
Resilient Distributed Dataset (RDD)
Mllib – Machine Learning Algorithms
SQL and Data Frames / Pipelines
Streaming
Big Graph Analytics

3)Model

Apache Spark – Mllib

MLlib is Apache Spark’s machine learning library and provides us with Spark’s superb scalability and ease-of-use when trying to solve machine learning problems.

Apache Hadoop – Mahout
Apache Mahout is a powerful, scalable machine-learning library that runs on top of Hadoop MapReduce. Machine learning is a discipline of artificial intelligence that enables systems to learn based on data alone, continuously improving performance as more data is processed.

Machine Learning and Data Mining Library Leverage Hadoop and MapReduce
Designed for Massive Data
Integrates with Hadoop Ecosystem
Support for Variety of Algorithms: Classification, Clustering, Collaborative Filtering, Dimensionality Reduction, Topic Modeling, Others

4) Access

Apache Hive – Hive is a data ware house system for Hadoop. It runs SQL like queries called HQL (Hive query language) which gets internally converted to map reduce jobs. Hive was developed by Facebook. Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.

SQL queries are submitted to Hive and they are executed as follows:

Hive compiles the query.
An execution engine, such as Tez or MapReduce, executes the compiled query.
The resource manager, YARN, allocates resources for applications across the cluster.
The data that the query acts upon resides in HDFS (Hadoop Distributed File System). Supported data formats are ORC, AVRO, Parquet, and text.
Query results are then returned over a JDBC/ODBC connection.

Apache Impala

Impala raises the bar for SQL query performance on Apache Hadoop while retaining a familiar user experience.
With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

5) APPLY

Finally integrate all considered hadoop utilities and build modern data and analytics system with the following features.

Massively Parallel Processing
Shared Nothing
Massively Parallel Data Loading
Integration with Hadoop
Native MapReduce

Hadoop Security

Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Apache Metron provides a scalable advanced security analytics framework built with the Hadoop Community evolving from the Cisco OpenSOCProject. A cyber security application framework that provides organizations the ability to detect cyber anomalies and enable organizations to rapidly respond to identified anomalies.

Apache Sentryis a system to enforce fine grained role based authorization to data and metadata stored on a Hadoop cluster.

Apache Eagle: Analyze Big Data Platforms For Security and Performance.Apache Eagle is an Open Source Monitoring Platform for Hadoop ecosystem, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time.

In conjunction with components (such as Ranger, Sentry, Knox, DgSecureand Splunketc.), Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.

Hadoop Governance

Enterprises adopting modern data architecture with Hadoop must reconcile data management realities when they bring existing and new data from disparate platforms under management.

As customers deploy Hadoop into corporate data and processing environments, metadata and data governance must be vital parts of any enterprise-readydata lake.

Apache Atlas and Apache Falcon

Apache Atlas is a scalable and extensible set of core foundational governance services. It enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

Apache Falcon is a framework for managing data life cycle in Hadoop clusters addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing.

Falcon centrally manages the data life cycle,facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs.

Will discuss about how to build Data lake using public cloud offerings and the reference architecture in the next blog …

Microservices Container Platform Part 2 – Securing Kubernetes Environment

Microservices container platform Part 1 : https://digitaltransformationcloud.home.blog/2019/04/17/digital-transformation-container-based-platforms-for-microservice-applications

As i mentioned in my previous micro services container platform blog , let’s discuss about how to secure the kubernetes environment in this blog.

Default behavior of Kubernetes cluster can cause many security gaps and issues ,you need to secure the Kubernetes platform by implementing appropriate security solutions and controls.

Kubernetes platform security design must include the following security principles.

Defense in depth.
Least privilege.
Limiting the attack surface.

It’s preferable to have multiple layers of defense against attacks on your Kubernetes cluster. If you’re relying on a single defensive solution , attackers might find their way around it.

There are various ways that an attacker could attempt to get access to your Kubernetes cluster , applications running on it.

Work with Kubernetes team and find out the solutions for the following questions.

Do you have end to end visibility of Kubernetes pods deployment ?
How are you being alerted when internal service pods or containers start to scan ports internally ?
Is there a monitoring solution to Monitor network connections?
Are you able to Monitor what’s going on inside – pod ,container?
Is there a Reviewed access rights to the Kubernetes cluster to know potential insider attack vectors?
Do you have a security checklist – Kubernetes services, access controls,and container hosts?

Kubernetes Platform -Security layers

You need to have security solutions at all layers of the kubernetes platform

Cluster,Node,Namespace,Pod,Container

Protect the cluster.
Protect the nodes,registry, etcd .
Protect the API Server: Configure RBAC for the API Server.
Restrict KUBELET permissions: Configure RBAC for Kubelets and manage certificate rotation to secure the Kubelet.
Implement authentication for all external ports.
Limit/remove console access.

Kubernetes best practices and features to secure your container platform

RBAC-Use Namespaces, role and role bindings , service accounts

Namespaces: Logical segmentation and isolation, or “virtual clusters”. Correct use of Kubernetes namespaces is fundamental for security, as you can group together users, roles and resources according to business logic without granting global privileges for the cluster
Service Accounts: Used to assign permissions to software entities. Kubernetes creates its own default service Accounts, and you can create additional ones for your pods/deployments. Any pod run by Kubernetes gets its own privileges through its service Account, applied to all processes run within the containers of that pod.
Start by making sure your cluster configuration supports RBAC. The location of the configuration file is your kube-apiserver manifest, and this depends on the deployment method but it’s usually inside /etc/kubernetes/manifests in either the master node(s) or the apiserver pod.
Look for this flag: –authorization-mode=Node,RBAC

Kubernetes admission controllers

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. Admission controllers pre-process the requests and can provide utility functions (such as filling out empty parameters with default values), but can also be used to enforce security policies and other checks.
Admission controllers are found on the kube-apiserver conf file:

–admission control=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,NodeRestriction,ResourceQuota

Kubernetes Security Context

When you declare a pod/deployment, you can group several security-related parameters, like SELinux profile, Linux capabilities, etc, in a Security context block

Kubernetes Security Policy (Pod security Policy)

Kubernetes Pod Security Policy (PSP), often shortened to Kubernetes Security Policy is implemented as an admission controller. Using security policies you can restrict the pods that will be allowed to run on your cluster, only if they follow the policy we have defined

Kubernetes Network Policies

A network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints.

Kubelet security – access to the kubelet API

Verify the following Kubernetes security settings when configuring kubelet parameters:

anonymous-auth is set to false to disable anonymous access (it will send 401 Unauthorized responses to unauthenticated requests).

kubelet has a `–client-ca-file flag, providing a CA bundle to verify client certificates.

`–authorization-mode is not set to AlwaysAllow, as the more secure Webhook mode will delegate authorization decisions to the Kubernetes API server.

–read-only-port is set to 0 to avoid unauthorized connections to the read-only endpoint (optional).

Securing Kubernetes etcd

etcd is a key-value distributed database that persists Kubernetes state. The etcd configuration and upgrading guide stresses the security relevance of this component:

“Access to etcd is equivalent to root permission in the cluster so ideally, only the API server should have access to it. Considering the sensitivity of the data, it is recommended to grant permission to only those nodes that require access to etcd clusters.”

You can enforce these restrictions in three different (complementary) ways:

• Regular Linux firewalling (iptables/netfilter, etc).

• Run-time access protection.

• PKI-based authentication + parameters to use the configured certs.

Run-time access protection An example of run-time access protection could be making sure that the etcd binary only reads and writes from a set of configured directories or network sockets, any run-time access that is not explicitly whitelisted will raise an alarm

Using a trusted Docker registry

Configure private Docker registry in Kubernetes

Kubernetes provides a convenient way to configure a private Docker registry and store access credentials, including server URL, as a secret:

kubectl create secret docker-registry regcred –docker-server=<your-registry-server> –docker-username=<your-name> –docker-password=<your-pword> –docker-email=<your-email>

Authentication ,authorization and Admission controllers for API

API -Authentication,Authorization and admission controllers

CI/CD Best Practices

Implement the security solutions at build,ship and run phases

Checklist to review Kubernetes deployments during run-time

Staging

Use namespaces.
Restrict Linux capabilities.
Enable SELinux.
Utilize Seccomp.
Use a minimal Host OS.
Update system patches.
Conduct security auditing.

Run-Time

Enforce isolation.
Inspect network connections for application attacks.
Monitor containers for suspicious process or file system activity.
Protect worker nodes from host privilege escalations, suspicious processes or file system activity.
Capture packets for security events.
Scan containers & hosts for vulnerabilities.
Alert, log, and respond in real-time to security incidents.
Conduct security auditing.

Ref :Kubernetes Security Operating Kubernetes Clusters and Applications Safely -Liz
Ref: The Ultimate Guide to Kubernetes Security -NeuVectorv
Ref: Kubernetes Docs

Will discuss about kubernetes applications deployment best practices in the next blog …

Digital Transformation – Container based platforms for Microservice applications

Digital transformation is fundamentally changing how enterprise IT operates and delivers value to their customers. Cloud native application platform is the preferred platform in enterprise digital journey to transform ,deploy and operate the application workloads in an optimized manner.

Migrating the workloads to cloud platform as is and deploying ,operating in the same way cannot be qualified as digital transformation.

Digital Transformation is

Deploying and managing applications at scale.
Able to recreate the platform including infrastructure using code (IaC –Infra as code).
Deploying and releasing the module /code to production in an automated seamless manner.
Automated product component upgrades using Blue green deployment.
Continuous integration and delivery.
Move workloads without having to redesign your applications or completely rethink your infrastructure, which lets you standardize on a platform and avoid vendor lock-in.
Easily performing canary deployments and rollbacks.
Zero downtime Platform infrastructure maintenance (Upgrade /Patching) –Rolling update, Migration with node pools.

Microservice application architecture plays an important role in enterprise IT transformation journey.

Let’s move further and discuss about Monolithic architecture pattern and its issues.

Monolithic Architecture

A monolithic application is built as a single unit. Enterprise Applications contains three parts: a database , a client-side user interface and a server-side application. This server-side application will handle HTTP requests, execute some domain-specific logic, retrieve and update data from the database, and populate the HTML views to be sent to the browser.To make any alterations to the system, a developer must build and deploy an updated version of the server-side application.
In the Monolithic architecture, all the different REST endpoints, business and data layers were wired together as one REST interface . The only physically separate component was the front end.

Monolithic Application Issues

Application is too large and complex to fully understand and made changes fast and correctly.
Application SIZE can slow down the start-up time.
Redeploy the entire application on each update.
Continuous deployment is difficult.
Difficult to scale when different modules have conflicting resource requirements.
All instances of the application are identical, one bug will impact the availability of the entire application.
Changes in frameworks or languages will affect an entire application

Moving on , Let’s proceed to discuss Microsevice architecture framework and how it helps enterprise to do digital transformation.

Microservice architecture

Micro services architecture design principles support enterprise to modernize their application platform .Application Modernization using micro services design principle is splitting your application into a set of smaller, interconnected services instead of building a single monolithic application.

Each microservice is a small application that has its own hexagonal architecture consisting of business logic along with various adapters. Some microservice would expose a REST, RPC or message-based API and most services consume APIs provided by other services. Other microservice might implement a web UI.

The Microservice architecture pattern significantly impacts the relationship between the application and the database.

Some APIs are also exposed to the mobile, desktop, web apps. The apps don’t, however, have direct access to the back-end services. Instead, communication is mediated by an intermediary known as an API Gateway. The API Gateway is responsible for tasks such as load balancing, caching, access control, API metering, and monitoring.

Microservice architecture tackles the problem of complexity by decomposing application into a set of manageable services which are much faster to develop, and much easier to understand and maintain.

Microservice architecture enables each service to be developed independently by a team that is focused on that service.

Microservice architecture reduces barrier of adopting new technologies since the developers are free to choose whatever technologies make sense for their service and not bounded to the choices made at the start of the project.

Microservice architecture enables each Microservice to be deployed independently. As a result, it makes continuous deployment possible for complex applications. Microservice architecture enables each service to be scaled independently.

Microservice platform using open source packages as a platform components

Microservice application platform can be built using the open source components

API layer where all the Rest API services reside, a proxy layer that acts as intermediary services that connects the API to Data streams(Kafka) and finally the Kafka layer that provides centralised data streams-Apigee.

A centralised log system aggregates logs from different micro-service components and makes it available in a centralised location -Splunk.

Distributed tracing systems to provide analytics on latency in the micro service architecture –Zipkin.

Service Registry – phone book for your micro services. Each service registers itself with the service registry and tells the registry where it lives (host, port, node name) and perhaps other service-specific metadata –Zookeeper.

Data Streams –Apache Kafka.

Config Management – server and client-side support for externalized configuration in a distributed system –spring cloud.

CI/CD and infrastructure configuration management – Jenkins, Ansible, Cloud formation templates -Jenkin ,CI.

However, it’s become much more complicated to automate and operate this type of platform.Nearly all applications nowadays need to have answers for things like Replication of components, Auto-scaling, Load balancing, rolling updates, Logging across components, Monitoring and health checking, Service discovery and Authentication.

As application development moves towards a container-based approach, the need to orchestrate and manage resources is important. Kubernetes is the leading platform that provides the ability to provide reliable scheduling of fault-tolerant application workloads.

The marriage of Kubernetes with containers for a microservices architecture makes a lot of sense, and clears the path for efficient software delivery.

As a container orchestration tool, Kubernetes helps to automate many aspects of microservices development, including:

Container deployment.
Elasticity (scaling up and down to meet demand).
The logical grouping of containers.
Management of containers and applications that use them.

Kubernetes –Container orchestration platform

Open source container orchestration platform, allowing large numbers of containers to work together reducing operational burden.
Construct at the top on Docker.
Provides a platform for automating deployment, scaling, and operations of application containers across clusters of hosts.
Scaling up or down by adding or removing containers when demand changes.
It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF).

Kubernetes -Key Components

Container – Sealed application package that can be run in isolation.
Pod – Small group of tightly coupled containers (web server).
Controller: Loop that drives current state toward desired state – replication controller – Kube controller (kube-controller-manager).
Service –A set of running pods that work together –Load balanced back ends
Labels: Metadata attached to the objects (Phase = Canary, Phase = Prod).
API server -API server validates and configures data for the API objects which include pods, services, replication controllers, and others. The API Server services REST operations and provides the front end to the cluster’s shared state through which all other components interact.
Etcd – reliable distributed key-value store.
Scheduler-Kubernetes scheduler is one of the critical components of the platform. It runs on the master nodes working closely with the API server and controller. The scheduler is responsible for matchmaking — a pod with a node.

Kubernetes managed service platform provided by the public cloud providers removes the need for having many components and integrating all in the micro services platform.

Public cloud provider managed service kubernetes platform helps enterprise to deploy and operate the modernized workloads in an easy and seamless way.

Google Kubernetes Engine – GKE -Reference Architecture

Kubernetes Engine is a managed Kubernetes service offered by Google. It allows customers to easily create and maintain Kubernetes clusters that use the open-source version of Kubernetes as a base. Kubernetes Engine also adds additional components (add-ons) to the cluster that help the applications running in the cluster to use other Google products and services, such as Google Container Registry, Stackdriver monitoring and logging, and integration with Identity and Access Management (IAM). Google also manages the Kubernetes control plane, sometimes referred to as the Kubernetes Master, and guarantees certain up-time for the control plane.

AWS Kubernetes -EKS Reference Architecture

Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a managed service that makes it easy for you to run Kubernetes.

Amazon EKS runs Kubernetes control plane instances across multiple Availability Zones to ensure high availability. Amazon EKS automatically detects and replaces unhealthy control plane instances, and it provides automated version upgrades and patching for them.

Amazon EKS is also integrated with many AWS services to provide scalability and security for your applications, including the following:

Amazon ECR for container images.
Elastic Load Balancing for load distribution.
IAM for authentication.
Amazon VPC for isolation.

Amazon EKS runs up-to-date versions of the open-source Kubernetes software, so you can use all the existing plugins and tooling from the Kubernetes community.

How EKS works

AWS ECS Reference Architecture –Non Managed Container Platform

You can use AWS ECS based platform for deploying the microservice applications, but you need to manage the servers /clusters. Now AWS is providing Fargate is a compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters.

Azure AKS reference Architecture

Azure Kubernetes Service (AKS) is the quickest way to use Kubernetes on Azure. AKS provides capabilities to deploy and manage Docker containers using Kubernetes. Azure DevOps helps in creating Docker images for faster deployments and reliability using the continuous build option.

AKS reduces the complexity and operational overhead of managing Kubernetes by offloading much of that responsibility to Azure like other cloud providers . As a hosted Kubernetes service, Azure handles critical tasks like health monitoring and maintenance . The Kubernetes masters are managed by Azure.

Azure Managed Kubernetes (AKS) Platform Reference Architecture

It’s easy to create AKS platform and deploy the applications using Azure DevOps.

Azure AKS Platform Key Components

Azure Kubernetes Service (AKS). AKS is an Azure service that deploys a managed Kubernetes cluster.

Kubernetes cluster. AKS is responsible for deploying the Kubernetes cluster and for managing the Kubernetes masters. You only manage the agent nodes.

Virtual network. By default, AKS creates a virtual network to deploy the agent nodes into. For more advanced scenarios, you can create the virtual network first, which lets you control things like how the subnets are configured, on-premises connectivity, and IP addressing.

Ingress. An ingress exposes HTTP(S) routes to services inside the cluster.

External data stores. Microservice are typically stateless and write state to external data stores, such as Azure SQL Database or Cosmos DB.

Azure Active Directory. AKS uses an Azure Active Directory (Azure AD) identity to create and manage other Azure resources such as Azure load balancers. Azure AD is also recommended for user authentication in client applications.

Azure Container Registry. Use Container Registry to store private Docker images, which are deployed to the cluster. AKS can authenticate with Container Registry using its Azure AD identity. Note that AKS does not require Azure Container Registry. You can use other container registries, such as Docker Hub.

Azure Pipelines. Pipelines is part of Azure DevOps Services and runs automated builds, tests, and deployments. You can also use third-party CI/CD solutions such as Jenkins.

Helm. Helm is as a package manager for Kubernetes — a way to bundle Kubernetes objects into a single unit that you can publish, deploy, version, and update.

Azure Monitor. Azure Monitor collects and stores metrics and logs, including platform metrics for the Azure services in the solution and application telemetry. Use this data to monitor the application, set up alerts and dashboards, and perform root cause analysis of failures. Azure Monitor integrates with AKS to collect metrics from controllers, nodes, and containers, as well as container logs and master node logs.

Will discuss more about how to secure the Kubernetes platform in the next blog ….

The Journey Begins

Thanks for joining me!

Good company in a journey makes the way seem shorter. — Izaak Walton

post