Professional Documents
Culture Documents
TCS Big Data Lake Presentation - VIL - 17apr2019
TCS Big Data Lake Presentation - VIL - 17apr2019
Analytics on
Conceptualization
Cloud
Response to
Version - Draft (under
proposal April 2019 review)
Feb-18
Understanding of Requirements
Key Objective
Vodafone Idea Limited, wants to build a Big Data and Analytics platform on Cloud to help the requirements of Machine Learning models and
provide ML output feed to downstream systems. VIL wants to apply security policies on data before ingesting the same onto Cloud platform
Requirements
Landing Zone
• Create a Landing Zone on VIL premise from the mentioned source of data (EDW – Netezza and DB2, RDBMS – Oracle and SQL, Flat file – csv and txt) as per discussion with VIL
on 13th April
• Perform basic data reconciliation and data validations on Landing Zone
• Apply data security as data masking, data encryption, data tokenization on the PII and SPI data residing in landing zone
• Hardware/Software sizing for Landing zone (2 TB of data daily to be available for 7 days)
Data Lake on Cloud
• Ingest the secured data from Landing Zone to data lake on Cloud
• Implement data processing for Machine Learning models
• Deploy the Machine Learning models and monitor the same
• Create data feeds to downstream systems (Flytxt, Call Center, Adobe Digital Platform) in flat file format
• Monitor and Manage the big data platform on Cloud (24X7)
• Cloud monitoring and administration (24X7)
• One time initial data migration from source to Cloud platform (1.5 PB of data)
2
Assumptions
• All the source data would be available in Data Warehouse Systems (DB2 and Netezza) and RDBMS (Oracle and SQL) and flat files (csv, txt)
• The total number of feeds considered for ingesting data into landing zone is 1000 (100 – flat files, 200 – Oracle and SQL, 700 in EDW)
• Basic validations (e.g. null values, date formats) and reconciliation would be done in LZ layer
• Custom frameworks to support ingestion in LZ is considered in scope e.g. Audit logging, Reconciliation, Data Quality Checks, file check, house
keeping, scheduling and orchestration, etc.
• Data transformation and data processing would not be done in LZ and would be taken care on Cloud platform. The daily data would get
appended in Landing Zone.
• Historical data would get migrated to Cloud platform directly from sources. LZ would not be configured for data migration.
• Customer to support the AWS Snowball preparation for applying security policies on data before loading to Cloud Platform
• Activities required for monitoring would be built during implementation. Resource sizing for post go-live monitoring activities would be done
later
• Network links and required routers will be procured by Vodafone Idea Limited.
• SLAs for Cloud network links will be as per commitment of respective network service provider.
• Link sizing is done as per requirement of 2TB data transfer in 5 hrs. Network link utilization can be monitored subsequently for actual load and
traffic pattern to augment the links further.
3
Solution Architecture for the Use Case on AWS Cloud
Source Data Layer Landing Zone on VIL Premise Data Lake on Cloud
4
Data Masking in Hadoop
Step1
Data Masking Utility
Application server 1 to
perform masking. Developing
efforts are required here.
Hadoop database.
CC number: 1234567890
Step 3
5
Encryption at rest in Hadoop
For example :
We encrypt parent directory of HDFS, that is
defined in "hdfs-site.xml" under property of
"dfs.datanode.data.dir".
example:
<property>
Data Data Data <name>dfs.datanode.data.dir</name>
Node Node Node <value>/hadoop/hdfs/data</value>
<final>true</final>
SafeNet </property>
ProtectFile
(Agent)
Keysecure to manage
encryption keys
centrally. 6
Solution Details
As per the requirement to ingest structured and un-structured data into Landing Zone, TCS proposes Hadoop based solution for the landing
zone. The solution details for the for the required use case is detailed out below
Database access is not available The source needs to provide incremental data extracts in flat file format to
SFTP location from where the files would be copied and loaded into HDFS of
Hadoop landing zone using SFTP tool
Files (csv, txt) The source needs to provide incremental data extracts in flat file format to
SFTP location from where the files would be copied and loaded into HDFS of
Hadoop landing zone using SFTP tool
7
Solution Details
Data Processing in Landing Zone
• Post loading of data into landing zone, custom framework would be developed for basic data validation and reconciliation of data with source.
• Data masking and data encryption would be carried out using 3rd party security tool deployed on Application Server
Data Ingestion from Landing Zone to Cloud
For AWS Cloud, the data ingestion would be done using AWS Glue
Post the raw data gets loaded into AWS S3 in the raw format, the data would get validated and cleansed as per the quality requirements and stored in Processed
S3 bucket by creating custom components
Data Curation
As per the requirement to build the ML model, the data would get aggregated and stored in S3/Redshift using AWS Glue and/or Sagemaker
Data Migration
• The data from sources would get migrated to AWS Cloud platform using AWS Snowball
• The data masking and encryption as per VIL’s requirement would be done on premise with support of VIL.
8
Infra Requirement – Landing Zone
Landing Zone – Indicative Sizing
Cloudera Data Nodes 3 5 x 6 TB SATA 2 * 300 GB (RAID 1) 128 2*8 10 Gbps RHEL 7
• The sizing is done based on daily data ingestion of 2TB to be available for 7 days. Earlier data would be purged as there is no requirement of archival
• No data processing or data transformation is required in the Landing Zone. Basic validation and reconciliation would be done
• Infrastructure can be vertically scaled by adding more compute nodes in case the load increases due to addition of processing and any other operations
• The Dev and UAT will be single cluster with different areas for UAT and Dev and user based access for both environments will be provisioned.
• HA for Dev and UAT environment for data platforms is not considered, it can be provisioned if required.
9
Infra Requirement – Security Application Server
OS with
Node Type Node Count Storage Size OS Disk (SSD) Memory (GB) CPU (cores) Network
Version
Production Environment
Application Hypervisor Nodes 2 200 GB SATA 2 * 50 GB 32 04 1 NIC tomcat
Server for Data Webserver
Masking
10
Networking Solution
TCS Confidential 11
Cloud Connectivity Model | High level Architecture
VIL
DC
1 gbps
VIL CE ISP PE Cloud PE
1 gbps
VIL MPLS
Public Cloud
12
Cloud Connectivity & Availability
13
Network Sizing & Availability
• Dual 1 Gbps Active - Active private connect MPLS links for high availability from VIL DC to Cloud from MPLS service
providers considering 2 TB data to be transferred in 5 hours
• These 2 links will provide high & steady throughput as well as will act as redundancy for each other
• In case high availability is not required then single 1 gbps link is sufficient for transferring 2 TB data in 5 hours
• Dual routers at Vodafone DC for maintaining high availability in case of 2 links
• SLA’s of 99.90 can be obtained with managed links from service providers
14
ICC (Integrated Command Center) Managed Shared IT Infrastructure Services
Integrated Command Center (ICC) model, Services are provided with shared resources and integrated
tools to achieve higher service levels, greater responsiveness, agile expansion, reduced setup time Pre-
built & working platform - Ready to Go Support
Defined & Time tested ITIL framework processes
Reduced time to rollout and integrate into operations support
Integrated Tool set ( Ticketing , Monitoring ,Remote Management , Access Management & Automation
)
Pre hosted environment for tools
Shared resources - Service Desk, EUC and DC L1, L2, Service management & Voice Quality Assurance
Optimized resources based on skill for L1, L2 specific niche skills as required
Flexibility to align with client requirements
Privileged Identity Management (PIM) & Automation as part of service delivery
Triple ISO certified – ISO9000, ISO20000 & ISO27001
15
Solution on GCP and Azure
TCS Confidential 16
16
Solution Architecture for the Use Case on GCP Cloud
Source Data Layer Landing Zone on VIL Premise
Data Lake on Cloud
17
Solution Architecture for Use Case on Azure
Source Data Layer Landing Zone on VIL Premise
Data Lake on Cloud
Platform Management
Not in current Laye Platform Management Layer Azure
Application On VIL Premise
Scope r Insights
Batch data OMS Log
Azure Data Azure Azure Data Azure Active
Semi ingestion Catalog Analytics KeyVault Factory Directory
structured Consumption
Data Data Processing Layer
Azure Data Lake Layer
Data
XML, JSON Validations Purging Data Processing layer Data
Reconciliation
Data Migration Data Recon Validations Enricher Transform er
Structured Data Azure Data
Delta Processing
Factory Analytical
Azure Data API, Users
Flat files (csv, txt) Storage Layer
Factory File
Adobe - DMP & Raw Data layer
sharing, Flytxt
Analytics Idea Processed Data Layer Azure Storage Layer File
– Secured Data
Azure SQL Data transfer
RDBMS Warehouse Adobe
Azure Blob storage
EDW Security Layer Digital
DB2 Platform
Dedicated Application Server to
Netezza Mask Data (Structured Data) Advanced Analytics
Oracle Call
S Agent installed on
Key Encryption Center
QL (Appliance Base)
Data Node for
Encryption
18
Team Structure
TCS Confidential 19
19
Proposed Team Structure
Resources/Roles Skills/Profiles Number of Resources
Program Manager Project Kick-off, Project scheduling, work assignment and tracking, status reporting, handling 1
issues and escalations
Cloud Architect Responsible for cloud solution based on use case and future requirements 1
Big Data Architect Architecting the Big Data solution for the use case and future requirements 1
Hadoop Tech Lead Responsible for understanding the requirements and hadoop layer high level and detail design, 1
helping the developers for successful delivery
Hadoop Developers Responsible for developing the Hadoop platform for data ingestion, various frameworks for 4
auditing, reconciliation, etc.
Cloud Tech Lead Responsible for understanding the requirements and hadoop layer high level and detail design, 1
helping the developers for successful delivery
Network Administrator Network Administration For BAU Support below are the options
a. VIL network team
b. TCS resource from ICC - shared service
model
c. 6 TCS resource for 24X7 support
The above resource count are based on assumptions stated in the Assumption slide and would vary post the requirements are finalized
20
Thank you!
Looking forward to work with you in your Data journey…….