01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam

IBM Approach for Building AI Data Lake
—
Solutions & Technologies
Le Nhan Tam, Ph.D.
CTO, Technical Sales Leader

IBM Vietnam
Data Lake – Growth Data Lake Market
Data Lake Market – Growth Rate

by region 2019-2024
Study period: 2018-2019
Base year: 2018
Fastest growing
market: Asia-Pacific
Largest market: North-America

2019 2024 CAGR: 27.4%
Regional The data lakes market is expected to witness growth at a

Growth Rates CAGR of 27.4% over the forecast period 2019-2024. Data
High lakes have become an economical option for many
Medium companies.
Low
The cost of maintaining a data lake is lower owing to the
number of operations and space involved in building the
database for warehouses.
© 2019 IBM Corporation Source: https://www.mordorintelligence.com/industry-reports/data-lakes-market

What problem(s) are we trying to solve with a Data Lake?
Organizations typically have lots of data
• Difficult to find and access

• Difficult to integrate
• Meaning is not necessarily clear
• Accuracy may be questionable
• etc
Difficult to understand
And difficult to trust
3 © IBM Corporation
Ideally, the data is well organized and can be found easily
• Ideally, data from throughout
the organization will be
categorized, integrated and
can be easily found, like a
The desire… well-managed library.
In reality … • In reality however, while

there may be pockets of well-
organized data, the overall
corpus of data is frequently
badly organized and badly
understood. This leads to
confusion and lack of trust in
data, which in turn leads to
lack of trust in the insights
based on that data.
• This is the situation that an

effective Data Lake is
intended to avoid.
Big Data ➔ Bigger confusion
The objective of an effective Data Lake
• As we collect data
• Can we preserve clarity?
• Do we know what we are
• Allow users to easily find the data they need collecting?
• ‘Shop for data’ • Can we find the data we need?
• Collect and aggregate data from multiple sources
• Minimize the need for lengthy IT involvement
• Allow them to understand and trust the data they find

• Understand its meaning in a business context
• Are we creating a data swamp?
• Understand its origin (where it came from and when)
• Assess its completeness
• Assess its accuracy
• How do we build trust in big
• Support many different types of data data?
• Multiple types and formats • Do we know what data is being
• Structured and unstructured used for?
• Internal and external sources
Systems of
Insight
What is a Data Lake?
IBM’s view point on Data Lake
A Data Lake is…
A group of repositories
which provide self-service access
to trusted data
and which are governed, managed, protected and connected

by metadata
© 2019 IBM Corporation
Traditional Data Lake = Hadoop
• Data Warehouse Offloading

• Hadoop First & New Data Type
Landing SQL Apps

Data Tools
Zone Warehouse Users
Source Systems Offloading Data
EL-T
SQL
CDC – Kafka – Spark Streaming Hadoop

New Sources
External Data
Evolving to address the challenges
AI Data Lake
Need for automation (DataOps and ML/Ops), focus on ML/AI workload, hybrid cloud
Design Principle: Microservices, Business Outcome.
Cloud Object Storage + Spark (Cloud Data Lake)

Growth of cloud and AI experimentation led requirement for elastic compute environment.
“Data Lake”
Design Principle: isolation of compute and storage.
from here
Hadoop Data Lake

Growing digital footprint and web scale analytics, demanded a low cost, scalable infrastructure.
Design principle: Focus on ad-hoc analysis and batch processing for large scale data.
Data Warehouse
Proliferation of data silos and need for enterprise Insight, led to data warehouse.
Design principle: Focus on KPI , known matrix, Business intelligence.
Major Architecture and Technology shifts runaway
Cloud Compute Kubernets Streaming

Experience and storage and containers and ML/AI
Easy to use, self-service, Separation in public and Adoption as standard Multi-function analytics
on-demand, elastic, private clouds for increased operating environment for the data-driven
consumption performance for flexibility and agility enterprise
IBM AI Data Lake
• An analytics sandbox for exploring data to gain insight.
• An enterprise-wide catalog to find data across the
enterprise and to link from business term to technical
metadata.
• An environment where users can access vast amounts raw
data at low cost.
• Tools and technologies for processing large volume of data.
AI Data Lake
DataOps Data as a Service ML/AI driven

Focus on Business Focus on Publish/Subscribe Model Focus on data monetization
10
Business Value
IBM’s Data Lake – designed for data access – with safeguards
Data Lake Services
Data Lake Repositories
Information Management and Governance Fabric
Data Lake (System of Insight)
IBM’s Data Lake = Efficient Management, Governance, Protection and Access.
Personas/roles supported by the Data Lake
Data Scientists Data Stewards

IT Security & Compliance Users supported by the Data
& &
App Developers Strategists
Teams Lake:
• Enterprise IT
• Analytics Teams (Data
Scientists)
Enterprise IT
• Information Curator
Data Lake Services
Systems of • Governance, Risk and
Record LOB Users
Compliance Team
Data Lake Repositories • Line of Business Teams
Systems of
Engagement • Data Lake Operations
New Sources Business

Analysts
Systems of
Automation
Other Data
Lakes
Information Management and Governance Fabric Data Lake
Operations
The Data Lake sub-systems
Data Scientists Data Stewards

IT Security & Compliance The Data Lake subsystems:
& &
Teams
App Developers Strategists • Enterprise IT Data Exchange
• Catalogue
• Self-Service Access for
Enterprise IT
Analytics Teams (Data
Raw Data Scientists)
Catalogue
Interaction
Systems of • Self-Service Access for Line of
Record LOB Users
Business Teams
Data Lake Repositories • Information Management and
Systems of Enterprise
Engagement IT Data Governance Fabric
View-
Exchange Based
New Sources Interactio Business
n Analysts
Systems of
Automation
Other Data
Lakes
Information Management and Governance Fabric Data Lake
Operations
Who benefits from a Data Lake?
Campaign IT Security
Manager Business Data Scientist/ Data Data &
LOB Analyst Developer Strategist Steward Governance
Business IT
•LOB users, business analysts and data scientists can easily find the information
they need without extensive IT involvement.
•Data strategists and data stewards can make information available to users in
an organized and well-governed manner.
•IT security and governance teams can be assured that information is governed
according to well-defined organizational and regulatory policies.
IBM Reference Architecture for AI Data Lake
Micro Services based architecture deployed on OpenShift RedHat Maybe Existing
Data Sources
Data Catalog + governance
Machine and sensor
data Dashboard/Reporting
Streaming
Image and video
Message Hub
Enterprise Data
Warehouse
Enterprise content Data Virtualization
ML Model
SQL + RestAPI Data Science tools
Transaction and Data Integration Deployment
& Transformation
application data
Spark cluster Hadoop cluster
(ad-hoc query) (transformation)
Social data
Data as a Service
Third-party data Object Store –
Raw data + processed data (parquet file)

IBM Product Offerings for AI Data Lake
Micro Services based architecture deployed on OpenShift RedHat Maybe Existing
Data Sources
Machine and sensor CP4D-WKC
data Dashboard/Reporting
Streaming CP4D-Data CP4D-Cognos Analytics

IBM Event Streams Virtualization
Image and video Dashboard
Message Hub
Enterprise Data
Warehouse CP4D-Watson Studio / AutoAI
CP4D-WML CP4D-Warehouse
ML Model
Transaction and Data Integration Deployment
& Transformation
application data
IBM BigIntegrate Spark cluster Hadoop cluster
Social data CP4D-Analytics IBM Advanced Data
Engine Preparation
Data as a Service
Third-party data Object Store – IBM ESS

Note: CP4D = Cloud Pak for Data
IBM Reference Architecture for AI Data Lake
Value Propositions
Data Sources
Micro Services based architecture deployed on OpenShift RedHat

Maybe Existing
1 Isolation of compute and storage provide

seamless scalability and elasticity.
Machine and sensor
data Dashboard/ Reporting
Streaming
I mage and video
2
Enterprise Data
Data Governance integrated with the

Warehouse
stack.
ML Model
Transaction and Deployment
application data
Spark cluster Hadoop cluster

Social data
Data as a Service
Third-party data Object Store –
3 Multi-tenant spark as a service, allow

experiment and production workload to
coexist without any conflict.
4 Easily onboard new tools and services.

Focus on AI driven applications.
IBM Unique Value Proposition

5 Hybrid deployment architecture, support
multi-cloud deployment.
Data Catalog & Governance/Quality for AI Data Lake
Automated Metadata Automated Metadata Self-Services Interaction
Data Sources Curation Services Management
Auto Discover Data Search & Find
Relevant Data
Business-ready Data foundation

Machine and sensor
data
Auto Classify Data Tagging,
Annotations,
Comments
Image and video Auto Detect Knowledge
Sensitive Data
Catalog Workflow &
Collaboration
Auto Analyze Data
Enterprise content Quality
Self-Services Data
Auto Assign
Preparation
Transaction and Business Terms
application data
Automated Core Governance & Master Data Management Services
Social data Policy Management Consent Business Glossary
Data Lineage
&Enforcement Management Management
Data Archival & Model Governance & Entity Management Data Quality
Disposal Bias Reporting & Resolution Management
Third-party data
Machine Learning & Automation

Data Virtualization
The ability to view, access, manipulate and analyze data without the need to know or understand its physical
format or location, and without having to move or copy it.
Data Sources
Data Virtualization Services
Business
Machine and sensor Data Lake Repositories Applications
data
AI, ML &
I mage and video Flat Files NoSQL Optimization
Data Integration
Compliance
Enterprise content Object Store Relational DV Data Access Reporting
Data Movement Engine
Data Replication (SQL, APIs, NLQ) Discovery &
…
Transaction and
application data
Hadoop
Exploration
Mask / De-Identify Optional: Policy Enforcement Deny access Self-Services

Mask/ De-Identify
Social data via Ranger / Guardium Analytics
Policy Enforcement Policy Enforcement
BI Reporting,
Third-party data Dashboard
Policy Knowledge Catalog

• Data Assets Details
Engine • Policies & Rules
• Lineage
• ….
Watson Knowledge Catalog

Data Analytics &
Collaboration Build and train at scale Embed ML in your business
Tools Authoring Tools Operationalize
IBM Adv Decision Data

AutoAI ML/DL Python/R Notebooks
Quality of Service analysis Notebooks Data Preparation Optimization Models Functions
Pipelin
es
Internet attacks Machine Learning Runtimes Deep Learning Runtimes Deployment methods
Customer Lifestyle profiling
Predicting inbound calls
Detecting preferred channel Batc Edge
h
Real time speech to text and chatbots Scalable & modern infrastructure Hadoop Execution Engine On x-86, Power + GPUs Management & Monitoring
Governed data lake
Real Time Marketing (RTM) Use Cases
Version Control Lineage Automation
Model Development Model Deployment & Operational
On-prem
“AI Data Lake” – Think Big, Start Small …
Existing Hadoop Lite Data Lake New Data Lake Initiatives and
Infrastructure ( Cloud Object Storage + Spark ) Data Warehouse replacements
Upgrade existing Hortonwork/Cloudera Look from use cases:
Lead with minimal Cloud Pak for
to CDP Customer behavior analytics,
Data & Object store Data warehouse modernization,
Augment with Cloud Pak for Data - Audit and compliance analytics
DataOps/MLOps …
Lead with Cloud Pak for Data & Object store

Example Deployment of Data Lake in Hybrid Cloud
1. Raw data is stored on Object Storage.

AI Customer
2. Data is reduced, enhanced or refined with Spark/BigSQL. WML
Apps
3. Data analysis occurs in Watson Studio.
4. The end-user accesses a web application. WSL
5. Refined data is pulled from Object Storage.
6. Charts are built using IBM Cognos Dashboard.
7. Data governance using Watson Knowledge Catalog.
spark
Object store
BigSQL BI/Dashboard
WKC
© 2019 IBM Corporation On-Premise Off-Premise

Thank you!

Cloud Paks – Enterprise-ready containerized software
A faster, more secure way to move your core business applications to any cloud
through enterprise-ready containerized software solutions
Complete yet simple

IBM containerized software Application, data and AI services,
Packaged with Open Source components, fully modular and easy to consume
pre-integrated with the common operational services,
and secure by design
IBM certified
Full software stack support, and ongoing
security, compliance and version compatibility
Container platform
and operational services Run anywhere
Logging, monitoring, security, On-premises, on private and public clouds,
identity access management
and in pre-integrated systems
I BM Cloud Edge Private Systems
24
Cloud Pak for Data
Delivers the foundational platform for deploying an information architecture for AI, on
any cloud
Eliminate data silos,

connect all data Cloud Pak for Data
A set of unified, pre-integrated data and AI services delivered
within an open and extensive cloud native platform
Automate and govern
the data & AI lifecycle
Collect Data Organize Data Analyze Data Infuse AI
Operationalize AI with Cloud-native container

trust & transparency platform & operational services
Logging, monitoring, security,
identity access management
Avoid lock-in, run

anywhere with agility IBM Cloud Hyperconverged
Private Cloud System
Delivering Insight Applications
Test Development Model Refining

Insight Deployment
(AI models, dashboards, etc.) building Data
Continuous Delivery Continuous Delivery

of Applications of Insights
Application Workloads Data Warehouse Data & AI Workloads
New Requirements Acquiring Data/
App Monitoring Self Service
Deployment & Engagement Hadoop OLTP
Retraining
Virtualized
Data Access
Data Search for
Data
Cloud Pak for Data

Watson Studio and Watson Machine Learning inject AI
firepower into your business
Build and train at scale Embed ML in your business
Authoring Tools Operationalize
Decision Data
AutoAI ML/DL Python/R Notebooks
Optimization Pipelin
Models Functions
es
Machine Learning Runtimes Deep Learning Runtimes Deployment methods
Batc Edge
h
Scalable & modern infrastructure Hadoop Execution Engine On x-86, Power + GPUs Management & Monitoring
Version Control Lineage Automation
Watson Studio Watson Machine Learning

Mix and Match your deployment
✓ Cloud – IBM Cloud, Azure, AWS
✓ On Premise / Private Data center
On-prem
✓ Desktop
IBM Watson Studio
Enterprise Data Science platform that helps your
team work together to build models to make better
data driven decisions for your business
Analyze any data, no matter where it lives

Connect to and analyze your data without moving a single byte
through dozens of connectors and multiple deployment options
Empower your entire organization with notebooks,

visual productivity, and automation tools
Leverage your entire organization with a variety of tools in a
single integrated platform
One platform to rule them all from discovery to

production
Analyze data, build predictive models, and seamlessly integrate
Watson Machine Learning to deploy at scale
Watson
Machine
Learning
2
9
IBM Watson Machine Learning
Embed Machine Learning and Deep Learning
in your Business
Deploy and Manage Models

Move models to production, in an easy, secure, and
compliant way
Intelligent Model Operations

Embed intelligent training services, with feedback
loops that constantly learn from new data, regardless
where it resides
Accelerate Compute Intensive Workloads

Distribute your deep learning training and
Hadoop/Spark workloads with multi-tenant job
scheduling
IBM Elastic Storage Server (ESS)
Integrated scale-out data management for file and object data
Optimal building block for high-performance, scalable,

reliable enterprise Spectrum Scale storage
• Faster data access with choice to scale-up or scale-out
• Easy to deploy clusters with unified system GUI Entry ESS 3000
• Simplified storage administration with IBM Spectrum Control integration ESS 3000 cluster
One solution for all your Spectrum Scale data needs

• Single repository of data with unified file and object support
• Anywhere access with multi-protocol support using
protocol nodes - NFS 4.0, SMB, Object
• Ideal for Big Data analytics including full Hadoop transparency
Ready for business-critical data

• Disaster recovery with synchronous or asynchronous replication
• Ensure reliability and fast rebuild times using Spectrum Scale RAID’s
dispersed data and erasure code
• Five 9s (99.999%) of availability Elastic Storage
Server cluster
IBM Elastic Storage Server (ESS) solution packaging
Integrated Optimal storage Non-disruptive High performance

solution capacity upgrades connectivity
& economy
Spectrum Scale is ESS has various Capacity upgrades can Optional integrated
integrated, tested, and models providing SAS, be performed without InfiniBand or 100Gb
factory preloaded NL-SAS, SSD, or application disruption Mellanox Ethernet
NVMe storage switch
Leverage the latest Software automatically
IBM Spectrum Choose from various rebalances data across Provides a lower cost
Scale releases sizes of HDD, SSD, all drives high performance
Data Management and NVMe network interconnect
Edition and Data Rack-mountable
Access Edition solution

01 - IBM Data Lake Solutions &amp; Technologies - Le Nhan Tam

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 - IBM Data Lake Solutions &amp; Technologies - Le Nhan Tam

Uploaded by

Copyright:

Available Formats

IBM Approach for Building AI Data Lake

Le Nhan Tam, Ph.D.

CTO, Technical Sales Leader

Data Lake Market – Growth Rate

Base year: 2018

Largest market: North-America

Regional The data lakes market is expected to witness growth at a

© 2019 IBM Corporation Source: https://www.mordorintelligence.com/industry-reports/data-lakes-market

• Difficult to find and access

In reality … • In reality however, while

• This is the situation that an

• Allow them to understand and trust the data they find

A Data Lake is…

which provide self-service access

and which are governed, managed, protected and connected

Traditional Data Lake = Hadoop

• Data Warehouse Offloading

Landing SQL Apps

Source Systems Offloading Data

CDC – Kafka – Spark Streaming Hadoop

Cloud Object Storage + Spark (Cloud Data Lake)

Hadoop Data Lake

Cloud Compute Kubernets Streaming

DataOps Data as a Service ML/AI driven

Data Lake Services

Data Lake Repositories

Information Management and Governance Fabric

Data Lake (System of Insight)

IBM’s Data Lake = Efficient Management, Governance, Protection and Access.

Data Scientists Data Stewards

New Sources Business

Data Lake (System of Insight)

Data Scientists Data Stewards

Data Lake (System of Insight)

© 2019 IBM Corporation

Streaming CP4D-Data CP4D-Cognos Analytics

© 2019 IBM Corporation

Data Catalog + governance

1 Isolation of compute and storage provide

Data Governance integrated with the

SQL + RestAPI Data Science tools

Spark cluster Hadoop cluster

3 Multi-tenant spark as a service, allow

4 Easily onboard new tools and services.

IBM Unique Value Proposition

Business-ready Data foundation

Machine Learning & Automation

Mask / De-Identify Optional: Policy Enforcement Deny access Self-Services

Policy Knowledge Catalog

Watson Knowledge Catalog

Tools Authoring Tools Operationalize

IBM Adv Decision Data

Version Control Lineage Automation

Model Development Model Deployment & Operational

Lead with Cloud Pak for Data & Object store

1. Raw data is stored on Object Storage.

© 2019 IBM Corporation On-Premise Off-Premise

© 2019 IBM Corporation

Complete yet simple

I BM Cloud Edge Private Systems

Eliminate data silos,

Operationalize AI with Cloud-native container

Avoid lock-in, run

Test Development Model Refining

01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam

01 - IBM Data Lake Solutions & Technologies - Le Nhan Tam