Cloud platforms are transforming the way the enterprises are managing and consuming the data. Big data and analytics service components introduced by cloud providers now helps business to perform decision making and open up new business opportunities based on data insights for enterprises.
Reference Architecture is a collection of different modules which break down the solution into elements that results in capabilities to address the set of concerns.
Reference architecture has to solve multiple concerns :
- Concerns related to external constraints on the system to be built, design decisions.
- Concern related to reuse or sharing of modules, capabilities provided by a module were typically shared within or external to system.
- Concerns aligned with stakeholder communities of interest or stakeholders roles.

Bigdata Reference Architecture -Modules
Bigdata platform modules can be grouped under 3 categories (Application provider,framework provider and cross cutting).
- Big Data Application Provider Modules.
- Big Data Framework Provider Modules.
- Cross cutting modules.

Big Data Application Provider Modules
Application Orchestration Module-Application Orchestration configures and combines other modules of the big data Application Provider, integrating activities into a cohesive application. An application is the end-to-end data processing through the system to satisfy one or more use cases.
Collection Module-The Collection module is primarily concerned with the interface to external Data Providers.
Preparation Module –The main concern of the Preparation module is transforming data to make it useful for the other downstream modules.
- Data validation.
- Cleansing.
- Optimization.
- Data Transformation and standardization.
- Performance optimization to faster lookup.
Analytics Module-This module is concerned with efficiently extracting insights from the data. Analytics can contribute further to the transform stage of the ETL cycle by performing more advanced transformations.
Visualization Module -The Visualization module is concerned with presenting processed data and the outputs of analytics to a human Data Consumer, in a format that communicates meaning and knowledge. It provides a “human interface” to the big data.
Access Module-The Access module is concerned with the interactions with external actors, such as the Data Consumer, or with human users.
Big Data Framework Provider Modules
Processing Module– The Processing module is concerned with efficient, scalable, and reliable execution of analytic. A common solution pattern to achieve scalability and efficiency is to distribute the processing logic and execute it locally on the same nodes where data is stored, transferring only the results of processing over the network.
Messaging Module –The Messaging module is concerned with reliable queuing,transmission, and delivery of data and control functions between components.
Data Storage Module- The primary concerns of the Data Storage module are providing reliable and efficient access to the persistent data.
Infrastructure Module-The Infrastructure module provides the infrastructure resources necessary to host and execute the activities.Infrastructure and data centre design are concerns when architecting a big data solution, and can be an important factor in achieving desired performance. Big data infrastructure needs to be scalable, reliable and support target workloads.
Cross cutting modules
Security Module-The Security module is concerned with controlling access to data and applications, including enforcement of access rules and restricting access based on classification.
Management Module-System Management, including activities such as monitoring,configuration, provisioning and control of infrastructure and applications;Data Management, involving activities surrounding the data life cycle of collection, preparation, analytics, visualization and access.
Federation Module– The Federation module is concerned with inter operation between Federated instances of the platform.
Moving on , Let’s discuss about the Big data and analytics platform components provided by different cloud providers and how we can build big data platform using those components.
Bigdata Cloud Platform -Logical Architecture
Below is the logical reference architecture of bigdata platform, will map the components provided by the cloud providers and see how the platform can be designed

AWS -Bigdata Platform
First let’s map the components provided by AWS to design and build the Bigdata platform

Ingest:
AWS has Kinesis streams to handle high frequency real time data. Data consumers can push data in real time to Kinesis streams, you can also connect Kinesis to an Apache storm cluster.
Kinesis fire hose can be used for large scale data ingestion, data pushed can be automatically transferred to different storage layers like S3, red-shift Database and elastic search services.
- Kinesis Video Stream-Videos
- Kinesis Data Streams: Processing stream using popular streaming frameworks such as Apache Spark, Flink.
- Kinesis Data Firehose: Capture, transform load data with light transformation directly to BI.
- Kinesis Data Analytics: Ability to run SQLs on streaming data.
AWS snow ball can be used for transporting data in and out of cloud. Snowballs comes in size of 50 TB / 80 TB and multiple snowballs can be daisy chained for larger data volume transports.
Process
EMR:AWS Elastic Map Reduce (EMR), a managed Hadoop, Spark and Presto solution. EMR takes care of setting up an underlying EC2 cluster and provides integration with a number of AWS services including S3 and DynamoDB. AWS offers EMR (Elastic Map Reduce) clusters on which Spark can run to do big data computation.
Data Pipeline is a data orchestration product, that moves, copies, transforms and enriches data. Data Pipeline manages the scheduling, orchestration and monitoring of the pipeline activities.
Store
AWS S3 is the primary storage because of its durability and availability, and can maintain a segregation between RAW storage buckets, Transformed / Discovery data buckets and Curated data buckets.
Redshift is columnar database can be used to store terabyte to petabyte data scale data warehouse data.
AWS DynamoDB can be used to store Key Value pairs of metadata, can be used for building data catalog of Data Lake raw storage.
Access
External tables built on AWS S3 enable services like AWS Athena to run SQLs on files stored in AWS S3. AWS Red shift Spectrum which enables to query Red shift and S3 External Tables parallelly using SQL.
Metadata Management: metadata can be built on AWS native services such as Glue catalog or built bespoke on NoSQL databases such as DynamoDB and advanced search can be enabled through services such as AWS CloudSearch or AWS Elasticsearch.
Document Classification: AWS Macie can discover, classify and protect sensitive data in AWS. It can do classification of documents based on rules which can evolve further based machine learning models for data discovery, classification and protection.
Model
AWS makes it very easy to create predictive models without the need to learn complex algorithms. To create a model users are guided through the process of selecting data, preparing data, training and evaluating models through a simple wizard based UI. Once trained the model can be used to create predictions via online API (request / response) or a batch API for processing multiple input records.
Consumption
AWS offers Quicksight for data . Dashboards can be built from data stored across most AWS data storage services and supports a number of third party solution.
AWS Bigdata Platform Security Components
- AWS Identity and Access Management (IAM) lets you define individual user accounts with permissions across AWS resources ,MFA for privileged accounts.
- Data encryption capabilities available in AWS storage and database services, such as EBS, S3, Glacier, Oracle RDS, SQL Server RDS, and Redshift.
- AWS Key Management Service (KMS) to choose whether AWS can manage can encryption keys or customer keep complete control over keys.
Azure -Bigdata Platform
let’s map the components provided by Azure to design and build the Bigdata platform

Ingest
Event Hub –Azure Event Hubs is a Big Data streaming platform and event ingestion service, capable of receiving and processing millions of events per second. Event Hubs can process and store events, data, or telemetry produced by distributed software and devices. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters. Event Hubs for Apache Kafka enables native Kafka clients, tools, and applications such as Mirror Maker, Apache Flink.
Stream Analytics- Stream Analytics is Microsoft’s latest addition to its suite of advanced, fully managed, server-less Platform-as-a-Service (PaaS) cloud components. The process of spinning up complex data pipelines and analytics has in the past been both time-consuming and expensive, but can now be done within minutes to hours for a very reasonable cost. Azure Stream Analytics currently supports three types of inputs: blob storage, IoT and Event Hub.
Process
Azure’s managed Apache platform HDInsight which comes with Hadoop, Spark, Storm or HBase. The platform has a standard and premium tier, the latter including the option of running RServer, Microsoft’s enterprise solution for building and running R models at scale. HDInsight comes with a local HDFS and can also connect to blob storage or Data Lake Store.
Azure Data Factory is a data orchestration service that is used to build data processing pipelines. Data factory can read data from a range of Azure and third party data sources, and through Data Management Gateway, can connect and consume on-premise data. Data Factory comes with a range of activities that can run compute tasks in HDInsight, Azure Machine Learning, stored procedures, Data Lake and custom code running on Batch.
Store
Azure Storage is highly available, secure, durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage Gen2, Azure Files, Azure Queues, and Azure Tables.
Azure Data lake store: Azure Data Lake Storage is a high speed, scalable, secure and cost-effective platform.The high-performance Azure blob file system (ABFS) is built for big data analytics and is compatible with the Hadoop Distributed File System. Some of the attractive features of the new service are the following: encryption of data in transit with TLS 1.2, encryption of the data at rest, storage account firewalls, virtual network integration, role-based access security, and hierarchical namespaces with granular ACLs.
Azure Data Catalog is a registry of data assets within an organisation. Technical and business users can then use Data Catalog to discover datasets and their intent.
Model :
Azure Machine Learning is a fully managed data science platform that is used to build and deploy powerful predictive and statistical models. Azure Machine Learning comes with a flexible UI canvas and a set of predefined modules that can be used to build and run powerful data science experiments. The platform comes with a series of predefined machine learning models and includes the ability to run custom R or Python code. Trained models can be published as web services for consumption either as a realtime request/response API or for batch execution, also comes with interactive Jupyter notebooks for recording and documenting lab notes.
Consumption
Power BI can consume data from a range of Azure and third party services, as well as being able to connect to on-premise data sources, also allows users to run R scripts and embed R generated visuals .
Cognitive Services is a suite of readymade intelligence APIs that make it easy to enable and integrate advanced speech, vision, and natural language into business solutions.
Azure Bigdata platform Security
Azure Storage Service Encryption (SSE) can automatically encrypt data before it is stored, and it automatically decrypts the data when you retrieve it. The process is completely transparent to users. Storage Service Encryption uses 256-bit Advanced Encryption Standard (AES) encryption, which is one of the strongest block ciphers available. AES handles encryption, decryption, and key management transparently.
TDE is used to encrypt SQL Server, Azure SQL Database, and Azure SQL Data Warehouse data files in real time, using a Database Encryption Key (DEK), which is stored in the database boot record for availability during recovery.
TDE protects data and log files, using AES and Triple Data Encryption Standard (3DES) encryption algorithms.
Azure Cosmos DB is Microsoft’s globally distributed, multi-model database. User data that’s stored in Cosmos DB in non-volatile storage (solid-state drives) is encrypted by default.
Data Lake Store supports “on by default,” transparent encryption of data at rest.
Microsoft uses the Transport Layer Security (TLS) protocol to protect data when it’s traveling between the cloud services and customers. Microsoft datacenters negotiate a TLS connection with client systems that connect to Azure services. TLS provides strong authentication, message privacy, and integrity (enabling detection of message tampering, interception, and forgery), interoperability, algorithm flexibility, and ease of deployment and use.
Azure Key Vault– A secure secrets store for the passwords, connection strings, and other information you need to keep your apps working.
Google Cloud -Bigdata Platform
Let’s map the components provided by Google to design and build the Bigdata platform

Ingest
Pub Sub :
Cloud Pub/Sub is a scalable, durable, event ingestion and delivery system that supports the publish-subscribe pattern at large and small scales. Cloud Pub/Sub makes your systems more robust by decoupling publishers and subscribers of event data.
Process
Cloud Data proc is Google’s fully managed Hadoop and Spark offering. Google boasts an impressive 90 second lead time to start or scale Cloud Data proc clusters, by far the quickest of the three providers. Pricing is based on the underlying Compute Engine costs plus an additional charge per vCPU per minute. An HDFS compliant connector is available for Cloud Storage that can be used to store data that needs to survive after the cluster has been shut down.
Data processing pipelines can be built using Cloud Data-flow,fully programmable framework, available for Java and Python, and a distributed compute platform. Cloud Data flow supports both batch and streaming workers.
Store
Big Query: Big Query is Google’s fully managed, petabyte scale, low cost analytics data warehouse. Big Query is fully managed services PaaS offering from google, there is no infrastructure to manage , you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.
BigQuery can connect to a variety of visualization tools, providing the ability to generate reports using customer tool of preference. Ad-hoc analyses on the program data can also be done directly in the BigQuery UI using the Query Editor.
Cloud Datastore is a high-performance NoSQL database, designed for auto-scaling and ease of application development. While it’s NoSQL, Datastore offer many features which are similar to traditional databases. Other potential databases, which can be considered for the program solution were Cloud SQL and Cloud Spanner.
Model
Cloud Auto ML is fully managed platform for training and hosting Tensorflow models. It relies on Cloud Dataflow for data and feature processing and Cloud Storage for data storage. There is also Cloud Datalab, a lab notebook environment based on Jupyter. A set of pre-trained models are also available. Vision API detects features in images, such as text, faces or company logos, Speech API converts audio to text across a range of languages, Natural Language API can be used to extract meaning from text, and there is an API for translation.
Consumption
Data Studio
Data studio is closely integrated with Google Cloud, allows you to easily access data from Google Analytics, Google Ads, Display & Video 360, Search Ads 360, YouTube Analytics, Google Sheets, Google BigQuery and over 500 more data sources, both Google and non-Google, to visualize and interactively explore data. It allows you to easily share your insights . And beyond just sharing, Data Studio offers seamless real-time collaboration with others
Google announced lot of enhancements to their data bigdata platform in cloudnext 19
Here’s an overview of what’s new:
- Simplifying data migration and integration
- Cloud Data Fusion (beta)
- BigQuery DTS SaaS application connectors (beta)
- Data warehouse migration service to BigQuery (beta)
- Cloud Dataflow SQL (public alpha, coming soon)
- Dataflow FlexRS (beta)
- Accelerating time to insights
- BigQuery BI Engine (beta)
- Connected sheets (beta, coming soon)
- Turning data into predictions
- BigQuery ML (GA, coming soon), with additional models supported
- AutoML Tables (beta)
- Enhancing data discovery and governance
- Cloud Data Catalog (beta, coming soon)
Google Cloud Big Platform Security
Encryption by default in transit and at rest
Cloud Key Management System (KMS) –Manage cryptographic keys for your cloud services.
Cloud Data Loss Prevention (DLP)-Fast, scalable de-identification for sensitive data like credit card numbers, names, social security numbers, and more. It is mainly aimed at text data and and allows to detect and redact sensitive data such as credit card numbers, phone numbers and names.
Backup and recovery-In storage, encryption at rest protects data on backup media. Data is also replicated in encrypted form for backup and disaster recovery.
Cloud Data Catalog-Fully managed and scalable metadata management service that empowers you to quickly discover, manage, and understand your data.
Cloud Identity-Aware Proxy (Cloud IAP) controls access to cloud
applications running on Google Cloud Platform
Access Transparency for GCP, a service that creates logs in near-real-time when GCP administrators interact with your data for support
Conclusion
Bigdata and Analytics is becoming a critical component of modern business and a core capability that is driving cloud adoption. All three providers offer similar building blocks for data processing, orchestration, streaming analytics, ML and visualizations.
AWS has all the bases covered with a solid set of products that will meet most needs except managed lab notebooks.
Azure offers a comprehensive and impressive suite of managed analytical products. They support open source big data solutions alongside new server less analytical products such as Data Lake,also offers pre-trained models through to custom R models running over big data. Azure also offer the capability for organisations to track and document their data assets.
Google provide their own set of products with their range of services. Data proc and Data flow, Google have a strong core to their offerings. Tensor flow,Cloud Auto ML are getting a lot of attention. Google has a strong rich set of pre-trained APIs.
Will discuss about server less Bigdata platform pipeline and architecture in the next blog ….
Ref : AWS,Google and Azure online documents


































