Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Big Data CoE

Analytics on
Conceptualization
Cloud
Response to
Version - Draft (under
proposal April 2019 review)

Feb-18
Understanding of Requirements
Key Objective
Vodafone Idea Limited, wants to build a Big Data and Analytics platform on Cloud to help the requirements of Machine Learning models and
provide ML output feed to downstream systems. VIL wants to apply security policies on data before ingesting the same onto Cloud platform

Requirements
Landing Zone
• Create a Landing Zone on VIL premise from the mentioned source of data (EDW – Netezza and DB2, RDBMS – Oracle and SQL, Flat file – csv and txt) as per discussion with VIL
on 13th April
• Perform basic data reconciliation and data validations on Landing Zone
• Apply data security as data masking, data encryption, data tokenization on the PII and SPI data residing in landing zone
• Hardware/Software sizing for Landing zone (2 TB of data daily to be available for 7 days)
Data Lake on Cloud
• Ingest the secured data from Landing Zone to data lake on Cloud
• Implement data processing for Machine Learning models
• Deploy the Machine Learning models and monitor the same
• Create data feeds to downstream systems (Flytxt, Call Center, Adobe Digital Platform) in flat file format
• Monitor and Manage the big data platform on Cloud (24X7)
• Cloud monitoring and administration (24X7)
• One time initial data migration from source to Cloud platform (1.5 PB of data)

2
Assumptions

• All the source data would be available in Data Warehouse Systems (DB2 and Netezza) and RDBMS (Oracle and SQL) and flat files (csv, txt)

• The total number of feeds considered for ingesting data into landing zone is 1000 (100 – flat files, 200 – Oracle and SQL, 700 in EDW)

• Basic validations (e.g. null values, date formats) and reconciliation would be done in LZ layer

• Custom frameworks to support ingestion in LZ is considered in scope e.g. Audit logging, Reconciliation, Data Quality Checks, file check, house
keeping, scheduling and orchestration, etc.

• Data transformation and data processing would not be done in LZ and would be taken care on Cloud platform. The daily data would get
appended in Landing Zone.

• Historical data would get migrated to Cloud platform directly from sources. LZ would not be configured for data migration.

• Managing monitoring of LZ would be done by SI during the development phase.

• Customer to support the AWS Snowball preparation for applying security policies on data before loading to Cloud Platform

• Activities required for monitoring would be built during implementation. Resource sizing for post go-live monitoring activities would be done
later

• Network links and required routers will be procured by Vodafone Idea Limited.

• SLAs for Cloud network links will be as per commitment of respective network service provider.

• Link sizing is done as per requirement of 2TB data transfer in 5 hrs. Network link utilization can be monitored subsequently for actual load and
traffic pattern to augment the links further.

3
Solution Architecture for the Use Case on AWS Cloud
Source Data Layer Landing Zone on VIL Premise Data Lake on Cloud

Platform Management Platform Management Layer CloudWatch


Not in current Laye
Scope r AWS
CloudTrial
Batch data Management
Portal
Semi ingestion
structured On VIL Premise
Data Data Processing Layer
Big Data Lake Interface Consumption
Data
XML, JSON Validations Purging Layer Layer
Reconciliation Batch Processing Layer
Data Migration
Structured Data Cleanser
Parser
Enricher Transformer
Flat files (csv, txt) Storage Layer AWS Glue Analytical
Adobe - DMP & Raw Data layer Reconciliation Users
Analytics Idea Processed Data Layer Data Storage Layer
– Secured Data
KeySecure Flytxt
RDBMS Agent
Granular Data
EDW Security Layer Layer Backup Curated Layer Live
Adobe
DB2 Dedicated Application Server to
Connect
Digital
Netezza Mask Data (Structured Data) Platform
Oracle
Advanced Analytics API
S Agent installed on
Key Encryption
QL (Appliance Base)
Data Node for
Machine Learning
Call
Encryption Analytics
Workbench Center

4
Data Masking in Hadoop
Step1
Data Masking Utility

Application server 1 to
perform masking. Developing
efforts are required here.

Hadoop database.
CC number: 1234567890

Step 3

Step1 : Run Data Masking utility on application server


which takes input PII data and generates a
corresponding masked value.

Step2 : Authenticate with KeySecure. Keysecure is a Hadoop database.


centralized device where all the keys will be managed
and owned by Vodafone.
Masked CC number: 1234XXXXXX

Step3 : Once data is masked. Can import it back to


database. From here any client tries to access this
data, would only see its masked values.

5
Encryption at rest in Hadoop

Hadoop database. • Encryption Agent(software) will be installed on


Name each data node. The agent will encrypt the
Node HDFS data directory containing data blocks of
HDFS.(Hadoop Distributed File System )

For example :
We encrypt parent directory of HDFS, that is
defined in "hdfs-site.xml" under property of
"dfs.datanode.data.dir".

example:
<property>
Data Data Data <name>dfs.datanode.data.dir</name>
Node Node Node <value>/hadoop/hdfs/data</value>
<final>true</final>
SafeNet </property>
ProtectFile
(Agent)

Keysecure to manage
encryption keys
centrally. 6
Solution Details
As per the requirement to ingest structured and un-structured data into Landing Zone, TCS proposes Hadoop based solution for the landing
zone. The solution details for the for the required use case is detailed out below

 Creating Data extracts and loading in to Landing Zone

Source Scenario Mode of Extraction


RDBMS sources EDW - (Netezza, Database access available and change data The data would be extracted using Sqoop and loaded into HDFS of Hadoop
DB2) , Oracle and SQL identifier is present landing zone
Database access is available and change data Standard CDC tool like OGG for Big data would be used. The logs of the
identifier is not present database would be read and using Kafka messaging queue would be loaded
into the HDFS file system of Hadoop landing zone

Alternatively, VIL’s existing infrastructure to capture change data record on a


landing area can be leveraged

Database access is not available The source needs to provide incremental data extracts in flat file format to
SFTP location from where the files would be copied and loaded into HDFS of
Hadoop landing zone using SFTP tool

Files (csv, txt) The source needs to provide incremental data extracts in flat file format to
SFTP location from where the files would be copied and loaded into HDFS of
Hadoop landing zone using SFTP tool

7
Solution Details
 Data Processing in Landing Zone
• Post loading of data into landing zone, custom framework would be developed for basic data validation and reconciliation of data with source.
• Data masking and data encryption would be carried out using 3rd party security tool deployed on Application Server
 Data Ingestion from Landing Zone to Cloud
For AWS Cloud, the data ingestion would be done using AWS Glue

 Data Processing in Big Data Lake on AWS Cloud


Data Quality Checks and reconciliation

Post the raw data gets loaded into AWS S3 in the raw format, the data would get validated and cleansed as per the quality requirements and stored in Processed
S3 bucket by creating custom components

Data Curation

As per the requirement to build the ML model, the data would get aggregated and stored in S3/Redshift using AWS Glue and/or Sagemaker

 Data Feed to Downstream Systems


• The data would be fed to the downstream systems using the File Transfer protocol. Alternate options on sharing access through API creation or JDBC connect
are also available based on the requirement.

 Data Migration

• The data from sources would get migrated to AWS Cloud platform using AWS Snowball

• The data masking and encryption as per VIL’s requirement would be done on premise with support of VIL.

8
Infra Requirement – Landing Zone
Landing Zone – Indicative Sizing

CPU (cores with


Node Count Storage Size Hyper
Node Type OS Disk Memory (GB) Threading) Network OS with Version
Production Environment
Cloudera Master, Secondary NM, Metadata DB, HDFS, 3 x 1 TB SATA 2 * 300 GB (RAID 1) 128 2*8 10 Gbps RHEL 7
Sentry, Management Service & Edge Services) 3

Cloudera Data Nodes 3 5 x 6 TB SATA 2 * 300 GB (RAID 1) 128 2*8 10 Gbps RHEL 7

Test + Dev Environment


Cloudera Master Node + Edge Node 1 3 x 1 TB SATA 2 * 300 GB (RAID 1) 128 2*8 10 Gbps RHEL 7
Cloudera Data Nodes 2 2 x 6 TB SATA 2 * 300 GB (RAID 1) 128 2*8 10 Gbps RHEL 7

• The sizing is done based on daily data ingestion of 2TB to be available for 7 days. Earlier data would be purged as there is no requirement of archival

• No data processing or data transformation is required in the Landing Zone. Basic validation and reconciliation would be done

• Data encryption would be done using 3rd party security tool

• Infrastructure can be vertically scaled by adding more compute nodes in case the load increases due to addition of processing and any other operations

• The Dev and UAT will be single cluster with different areas for UAT and Dev and user based access for both environments will be provisioned.

• HA for Dev and UAT environment for data platforms is not considered, it can be provisioned if required.

9
Infra Requirement – Security Application Server
OS with
Node Type Node Count Storage Size OS Disk (SSD) Memory (GB) CPU (cores) Network
Version
Production Environment
Application Hypervisor Nodes 2 200 GB SATA 2 * 50 GB 32 04 1 NIC tomcat
Server for Data Webserver
Masking

• These can be deployed on virtual machines, 2 nodes are stated for HA


• Same configuration would be needed on Cloud platform also.

10
Networking Solution

TCS Confidential 11
Cloud Connectivity Model | High level Architecture

VIL
DC
1 gbps
VIL CE ISP PE Cloud PE

1 gbps
VIL MPLS

Public Cloud

Primary Link to Public Cloud


Secondary Link to Public Cloud

12
Cloud Connectivity & Availability

13
Network Sizing & Availability

• Dual 1 Gbps Active - Active private connect MPLS links for high availability from VIL DC to Cloud from MPLS service
providers considering 2 TB data to be transferred in 5 hours
• These 2 links will provide high & steady throughput as well as will act as redundancy for each other
• In case high availability is not required then single 1 gbps link is sufficient for transferring 2 TB data in 5 hours
• Dual routers at Vodafone DC for maintaining high availability in case of 2 links
• SLA’s of 99.90 can be obtained with managed links from service providers

14
ICC (Integrated Command Center) Managed Shared IT Infrastructure Services

Integrated Command Center (ICC) model, Services are provided with shared resources and integrated
tools to achieve higher service levels, greater responsiveness, agile expansion, reduced setup time Pre-
built & working platform - Ready to Go Support
 Defined & Time tested ITIL framework processes
 Reduced time to rollout and integrate into operations support
 Integrated Tool set ( Ticketing , Monitoring ,Remote Management , Access Management & Automation
)
 Pre hosted environment for tools
 Shared resources - Service Desk, EUC and DC L1, L2, Service management & Voice Quality Assurance
 Optimized resources based on skill for L1, L2 specific niche skills as required
 Flexibility to align with client requirements
 Privileged Identity Management (PIM) & Automation as part of service delivery
 Triple ISO certified – ISO9000, ISO20000 & ISO27001

15
Solution on GCP and Azure

TCS Confidential 16
16
Solution Architecture for the Use Case on GCP Cloud
Source Data Layer Landing Zone on VIL Premise
Data Lake on Cloud

Platform Management Format, Encryption Platform Management Layer


& Compression On VIL Premise
Not in current Laye
Security
Scope r Audit & Balancing Data Governance Workflow Orchestration
Consumption
Batch data Cloud IAM
Cloud Metadata Management Monitoring
Semi ingestion Composer Layer
structured
Data Data Processing Layer GCP Data Lake
Data
XML, JSON Validations Purging Data Processing
API layer Data
Reconciliation Data Recon Validations Enricher
Data Migration Transformer Analytical
Structured Data GS Util
Delta Processing Cloud Data Prep Cloud Dataproc Users

API, File Sharing


Flat files (csv, txt) Storage Layer Cloud Data Flytxt
Flow
Adobe - DMP & Raw Data layer
Analytics Idea Processed Data Layer
Data Storage layer
– Secured Data
Adobe
Cloud
Digital
RDBMS Storage -
Cloud
Storage - Platform
EDW Raw Processed
Security Layer
DB2 Dedicated Application Server to
Netezza Mask Data (Structured Data) Advanced Analytics Call
Oracle Cloud ML Big Query -
Center
Engine ML
S Agent installed on
Key Encryption
QL (Appliance Base)
Data Node for
Encryption

Data Migration Services

17
Solution Architecture for Use Case on Azure
Source Data Layer Landing Zone on VIL Premise
Data Lake on Cloud

Platform Management
Not in current Laye Platform Management Layer Azure
Application On VIL Premise
Scope r Insights
Batch data OMS Log
Azure Data Azure Azure Data Azure Active
Semi ingestion Catalog Analytics KeyVault Factory Directory
structured Consumption
Data Data Processing Layer
Azure Data Lake Layer
Data
XML, JSON Validations Purging Data Processing layer Data
Reconciliation
Data Migration Data Recon Validations Enricher Transform er
Structured Data Azure Data
Delta Processing
Factory Analytical
Azure Data API, Users
Flat files (csv, txt) Storage Layer
Factory File
Adobe - DMP & Raw Data layer
sharing, Flytxt
Analytics Idea Processed Data Layer Azure Storage Layer File
– Secured Data
Azure SQL Data transfer
RDBMS Warehouse Adobe
Azure Blob storage
EDW Security Layer Digital
DB2 Platform
Dedicated Application Server to
Netezza Mask Data (Structured Data) Advanced Analytics
Oracle Call
S Agent installed on
Key Encryption Center
QL (Appliance Base)
Data Node for
Encryption

Physical Device Transfer

18
Team Structure

TCS Confidential 19
19
Proposed Team Structure
Resources/Roles Skills/Profiles Number of Resources
Program Manager Project Kick-off, Project scheduling, work assignment and tracking, status reporting, handling 1
issues and escalations

Cloud Architect Responsible for cloud solution based on use case and future requirements 1
Big Data Architect Architecting the Big Data solution for the use case and future requirements 1

Hadoop Tech Lead Responsible for understanding the requirements and hadoop layer high level and detail design, 1
helping the developers for successful delivery

Hadoop Developers Responsible for developing the Hadoop platform for data ingestion, various frameworks for 4
auditing, reconciliation, etc.

Cloud Tech Lead Responsible for understanding the requirements and hadoop layer high level and detail design, 1
helping the developers for successful delivery

Cloud Developers Work on API, Spark, Glue, Java, etc. 4


Cyber Security SME Understand and design the security layers and adhere to VIL security policy 2

Network Administrator Network Administration For BAU Support below are the options
a. VIL network team
b. TCS resource from ICC - shared service
model
c. 6 TCS resource for 24X7 support

Quality Resource Testing of Hadoop and Cloud Systems 2


Cloud Admin Monitoring and Managing the cloud platform 2
Hadoop Admin Manage and monitor Hadoop platform and build activities required for monitoring 1

The above resource count are based on assumptions stated in the Assumption slide and would vary post the requirements are finalized

20
Thank you!
Looking forward to work with you in your Data journey…….

You might also like