Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

IBM Approach for Building AI Data Lake


Solutions & Technologies

Le Nhan Tam, Ph.D.

CTO, Technical Sales Leader


IBM Vietnam
Data Lake – Growth Data Lake Market

Data Lake Market – Growth Rate


by region 2019-2024
Study period: 2018-2019

Base year: 2018

Fastest growing
market: Asia-Pacific

Largest market: North-America


2019 2024 CAGR: 27.4%

Regional The data lakes market is expected to witness growth at a


Growth Rates CAGR of 27.4% over the forecast period 2019-2024. Data
High lakes have become an economical option for many
Medium companies.
Low
The cost of maintaining a data lake is lower owing to the
number of operations and space involved in building the
database for warehouses.

© 2019 IBM Corporation Source: https://www.mordorintelligence.com/industry-reports/data-lakes-market


What problem(s) are we trying to solve with a Data Lake?
Organizations typically have lots of data

• Difficult to find and access


• Difficult to integrate
• Meaning is not necessarily clear
• Accuracy may be questionable
• etc

Difficult to understand
And difficult to trust

3 © IBM Corporation
What problem(s) are we trying to solve with a Data Lake?
Ideally, the data is well organized and can be found easily
• Ideally, data from throughout
the organization will be
categorized, integrated and
can be easily found, like a
The desire… well-managed library.

In reality … • In reality however, while


there may be pockets of well-
organized data, the overall
corpus of data is frequently
badly organized and badly
understood. This leads to
confusion and lack of trust in
data, which in turn leads to
lack of trust in the insights
based on that data.

• This is the situation that an


effective Data Lake is
intended to avoid.
Big Data ➔ Bigger confusion

4 © IBM Corporation
What problem(s) are we trying to solve with a Data Lake?
The objective of an effective Data Lake
• As we collect data
• Can we preserve clarity?
• Do we know what we are
• Allow users to easily find the data they need collecting?
• ‘Shop for data’ • Can we find the data we need?
• Collect and aggregate data from multiple sources
• Minimize the need for lengthy IT involvement

• Allow them to understand and trust the data they find


• Understand its meaning in a business context
• Are we creating a data swamp?
• Understand its origin (where it came from and when)
• Assess its completeness
• Assess its accuracy
• How do we build trust in big
• Support many different types of data data?
• Multiple types and formats • Do we know what data is being
• Structured and unstructured used for?
• Internal and external sources

Systems of
Insight

5 © IBM Corporation
What is a Data Lake?
IBM’s view point on Data Lake

A Data Lake is…

A group of repositories

which provide self-service access

to trusted data

and which are governed, managed, protected and connected


by metadata

6 © IBM Corporation
© 2019 IBM Corporation

Traditional Data Lake = Hadoop

• Data Warehouse Offloading


• Hadoop First & New Data Type

Landing SQL Apps


Data Tools
Zone Warehouse Users

Source Systems Offloading Data

EL-T
SQL

CDC – Kafka – Spark Streaming Hadoop


New Sources
External Data

7 © IBM Corporation
© 2019 IBM Corporation
Evolving to address the challenges

AI Data Lake
Need for automation (DataOps and ML/Ops), focus on ML/AI workload, hybrid cloud
Design Principle: Microservices, Business Outcome.

Cloud Object Storage + Spark (Cloud Data Lake)


Growth of cloud and AI experimentation led requirement for elastic compute environment.
“Data Lake”
Design Principle: isolation of compute and storage.
from here

Hadoop Data Lake


Growing digital footprint and web scale analytics, demanded a low cost, scalable infrastructure.
Design principle: Focus on ad-hoc analysis and batch processing for large scale data.

Data Warehouse
Proliferation of data silos and need for enterprise Insight, led to data warehouse.
Design principle: Focus on KPI , known matrix, Business intelligence.

8 © IBM Corporation
© 2019 IBM Corporation
Major Architecture and Technology shifts runaway

Cloud Compute Kubernets Streaming


Experience and storage and containers and ML/AI
Easy to use, self-service, Separation in public and Adoption as standard Multi-function analytics
on-demand, elastic, private clouds for increased operating environment for the data-driven
consumption performance for flexibility and agility enterprise

9 © IBM Corporation
IBM AI Data Lake
• An analytics sandbox for exploring data to gain insight.
• An enterprise-wide catalog to find data across the
enterprise and to link from business term to technical
metadata.
• An environment where users can access vast amounts raw
data at low cost.
• Tools and technologies for processing large volume of data.

AI Data Lake

DataOps Data as a Service ML/AI driven


Focus on Business Focus on Publish/Subscribe Model Focus on data monetization

10
Business Value
What is a Data Lake?
IBM’s Data Lake – designed for data access – with safeguards

Data Lake Services

Data Lake Repositories

Information Management and Governance Fabric

Data Lake (System of Insight)

IBM’s Data Lake = Efficient Management, Governance, Protection and Access.

11 © IBM Corporation
What is a Data Lake?
Personas/roles supported by the Data Lake

Data Scientists Data Stewards


IT Security & Compliance Users supported by the Data
& &
App Developers Strategists
Teams Lake:
• Enterprise IT
• Analytics Teams (Data
Scientists)
Enterprise IT
• Information Curator
Data Lake Services
Systems of • Governance, Risk and
Record LOB Users
Compliance Team
Data Lake Repositories • Line of Business Teams
Systems of
Engagement • Data Lake Operations

New Sources Business


Analysts

Systems of
Automation

Other Data
Lakes
Information Management and Governance Fabric Data Lake
Operations

Data Lake (System of Insight)

12 © IBM Corporation
What is a Data Lake?
The Data Lake sub-systems

Data Scientists Data Stewards


IT Security & Compliance The Data Lake subsystems:
& &
Teams
App Developers Strategists • Enterprise IT Data Exchange
• Catalogue
• Self-Service Access for
Enterprise IT
Analytics Teams (Data
Raw Data Scientists)
Catalogue
Interaction
Systems of • Self-Service Access for Line of
Record LOB Users
Business Teams
Data Lake Repositories • Information Management and
Systems of Enterprise
Engagement IT Data Governance Fabric
View-
Exchange Based
New Sources Interactio Business
n Analysts

Systems of
Automation

Other Data
Lakes
Information Management and Governance Fabric Data Lake
Operations

Data Lake (System of Insight)

13 © IBM Corporation
What is a Data Lake?
Who benefits from a Data Lake?

Campaign IT Security
Manager Business Data Scientist/ Data Data &
LOB Analyst Developer Strategist Steward Governance

Business IT

•LOB users, business analysts and data scientists can easily find the information
they need without extensive IT involvement.
•Data strategists and data stewards can make information available to users in
an organized and well-governed manner.
•IT security and governance teams can be assured that information is governed
according to well-defined organizational and regulatory policies.

14 © IBM Corporation
IBM Reference Architecture for AI Data Lake
Micro Services based architecture deployed on OpenShift RedHat Maybe Existing

Data Sources
Data Catalog + governance
Machine and sensor
data Dashboard/Reporting

Streaming
Image and video
Message Hub
Enterprise Data
Warehouse
Enterprise content Data Virtualization

ML Model
SQL + RestAPI Data Science tools
Transaction and Data Integration Deployment
& Transformation
application data
Spark cluster Hadoop cluster
(ad-hoc query) (transformation)
Social data

Data as a Service
Third-party data Object Store –
Raw data + processed data (parquet file)

© 2019 IBM Corporation


IBM Product Offerings for AI Data Lake
Micro Services based architecture deployed on OpenShift RedHat Maybe Existing

Data Sources
Data Catalog + governance
Machine and sensor CP4D-WKC
data Dashboard/Reporting

Streaming CP4D-Data CP4D-Cognos Analytics


IBM Event Streams Virtualization
Image and video Dashboard
Message Hub
Enterprise Data
Warehouse CP4D-Watson Studio / AutoAI
Enterprise content Data Virtualization
CP4D-WML CP4D-Warehouse

ML Model
SQL + RestAPI Data Science tools
Transaction and Data Integration Deployment
& Transformation
application data
IBM BigIntegrate Spark cluster Hadoop cluster
(ad-hoc query) (transformation)
Social data CP4D-Analytics IBM Advanced Data
Engine Preparation

Data as a Service
Third-party data Object Store – IBM ESS
Raw data + processed data (parquet file)

© 2019 IBM Corporation


Note: CP4D = Cloud Pak for Data
IBM Reference Architecture for AI Data Lake
Value Propositions
Data Sources
Micro Services based architecture deployed on OpenShift RedHat

Data Catalog + governance


Maybe Existing

1 Isolation of compute and storage provide


seamless scalability and elasticity.
Machine and sensor
data Dashboard/ Reporting

Streaming
I mage and video

2
Enterprise Data

Data Governance integrated with the


Warehouse
Enterprise content Data Virtualization

SQL + RestAPI Data Science tools

stack.
ML Model
Transaction and Deployment
application data

Spark cluster Hadoop cluster


(ad-hoc query) (transformation)
Social data

Data as a Service
Third-party data Object Store –
Raw data + processed data (parquet file)

3 Multi-tenant spark as a service, allow


experiment and production workload to
coexist without any conflict.

4 Easily onboard new tools and services.


Focus on AI driven applications.

IBM Unique Value Proposition


5 Hybrid deployment architecture, support
multi-cloud deployment.
Data Catalog & Governance/Quality for AI Data Lake
Automated Metadata Automated Metadata Self-Services Interaction
Data Sources Curation Services Management
Auto Discover Data Search & Find
Relevant Data

Business-ready Data foundation


Machine and sensor
data
Auto Classify Data Tagging,
Annotations,
Comments
Image and video Auto Detect Knowledge
Sensitive Data
Catalog Workflow &
Collaboration
Auto Analyze Data
Enterprise content Quality

Self-Services Data
Auto Assign
Preparation
Transaction and Business Terms
application data
Automated Core Governance & Master Data Management Services
Social data Policy Management Consent Business Glossary
Data Lineage
&Enforcement Management Management

Data Archival & Model Governance & Entity Management Data Quality
Disposal Bias Reporting & Resolution Management
Third-party data

Machine Learning & Automation


Data Virtualization
The ability to view, access, manipulate and analyze data without the need to know or understand its physical
format or location, and without having to move or copy it.
Data Sources
Data Virtualization Services
Business
Machine and sensor Data Lake Repositories Applications
data

AI, ML &
I mage and video Flat Files NoSQL Optimization
Data Integration
Compliance
Enterprise content Object Store Relational DV Data Access Reporting
Data Movement Engine
Data Replication (SQL, APIs, NLQ) Discovery &

Transaction and
application data
Hadoop
Exploration

Mask / De-Identify Optional: Policy Enforcement Deny access Self-Services


Mask/ De-Identify
Social data via Ranger / Guardium Analytics
Policy Enforcement Policy Enforcement
BI Reporting,
Third-party data Dashboard

Policy Knowledge Catalog


• Data Assets Details
Engine • Policies & Rules
• Lineage
• ….

Watson Knowledge Catalog


Data Analytics &
Collaboration Build and train at scale Embed ML in your business

Tools Authoring Tools Operationalize

IBM Adv Decision Data


AutoAI ML/DL Python/R Notebooks
Quality of Service analysis Notebooks Data Preparation Optimization Models Functions
Pipelin
es
Internet attacks Machine Learning Runtimes Deep Learning Runtimes Deployment methods
Customer Lifestyle profiling
Predicting inbound calls
Detecting preferred channel Batc Edge
h
Real time speech to text and chatbots Scalable & modern infrastructure Hadoop Execution Engine On x-86, Power + GPUs Management & Monitoring
Governed data lake
Real Time Marketing (RTM) Use Cases

Version Control Lineage Automation

Model Development Model Deployment & Operational

On-prem
“AI Data Lake” – Think Big, Start Small …

Existing Hadoop Lite Data Lake New Data Lake Initiatives and
Infrastructure ( Cloud Object Storage + Spark ) Data Warehouse replacements
Upgrade existing Hortonwork/Cloudera Look from use cases:
Lead with minimal Cloud Pak for
to CDP Customer behavior analytics,
Data & Object store Data warehouse modernization,
Augment with Cloud Pak for Data - Audit and compliance analytics
DataOps/MLOps …

Lead with Cloud Pak for Data & Object store


Example Deployment of Data Lake in Hybrid Cloud

1. Raw data is stored on Object Storage.


AI Customer
2. Data is reduced, enhanced or refined with Spark/BigSQL. WML
Apps
3. Data analysis occurs in Watson Studio.
4. The end-user accesses a web application. WSL
5. Refined data is pulled from Object Storage.
6. Charts are built using IBM Cognos Dashboard.
7. Data governance using Watson Knowledge Catalog.
spark

Object store

BigSQL BI/Dashboard

WKC

© 2019 IBM Corporation On-Premise Off-Premise


Thank you!

© 2019 IBM Corporation


Cloud Paks – Enterprise-ready containerized software
A faster, more secure way to move your core business applications to any cloud
through enterprise-ready containerized software solutions

Complete yet simple


IBM containerized software Application, data and AI services,
Packaged with Open Source components, fully modular and easy to consume
pre-integrated with the common operational services,
and secure by design
IBM certified
Full software stack support, and ongoing
security, compliance and version compatibility
Container platform
and operational services Run anywhere
Logging, monitoring, security, On-premises, on private and public clouds,
identity access management
and in pre-integrated systems

I BM Cloud Edge Private Systems

24
Cloud Pak for Data
Delivers the foundational platform for deploying an information architecture for AI, on
any cloud

Eliminate data silos,


connect all data Cloud Pak for Data
A set of unified, pre-integrated data and AI services delivered
within an open and extensive cloud native platform
Automate and govern
the data & AI lifecycle
Collect Data Organize Data Analyze Data Infuse AI

Operationalize AI with Cloud-native container


trust & transparency platform & operational services
Logging, monitoring, security,
identity access management

Avoid lock-in, run


anywhere with agility IBM Cloud Hyperconverged
Private Cloud System
Delivering Insight Applications

Test Development Model Refining


Insight Deployment
(AI models, dashboards, etc.) building Data

Continuous Delivery Continuous Delivery


of Applications of Insights
Application Workloads Data Warehouse Data & AI Workloads
New Requirements Acquiring Data/
App Monitoring Self Service
Deployment & Engagement Hadoop OLTP
Retraining

Virtualized
Data Access
Data Search for
Data

Cloud Pak for Data


Watson Studio and Watson Machine Learning inject AI
firepower into your business
Build and train at scale Embed ML in your business
Authoring Tools Operationalize

Decision Data
AutoAI ML/DL Python/R Notebooks
Optimization Pipelin
Models Functions
es
Machine Learning Runtimes Deep Learning Runtimes Deployment methods

Batc Edge
h
Scalable & modern infrastructure Hadoop Execution Engine On x-86, Power + GPUs Management & Monitoring

Version Control Lineage Automation

Watson Studio Watson Machine Learning


Mix and Match your deployment
✓ Cloud – IBM Cloud, Azure, AWS
✓ On Premise / Private Data center
On-prem
✓ Desktop
IBM Watson Studio
Enterprise Data Science platform that helps your
team work together to build models to make better
data driven decisions for your business

Analyze any data, no matter where it lives


Connect to and analyze your data without moving a single byte
through dozens of connectors and multiple deployment options

Empower your entire organization with notebooks,


visual productivity, and automation tools
Leverage your entire organization with a variety of tools in a
single integrated platform

One platform to rule them all from discovery to


production
Analyze data, build predictive models, and seamlessly integrate
Watson Machine Learning to deploy at scale
Watson
Machine
Learning

2
9
IBM Watson Machine Learning
Embed Machine Learning and Deep Learning
in your Business

Deploy and Manage Models


Move models to production, in an easy, secure, and
compliant way

Intelligent Model Operations


Embed intelligent training services, with feedback
loops that constantly learn from new data, regardless
where it resides

Accelerate Compute Intensive Workloads


Distribute your deep learning training and
Hadoop/Spark workloads with multi-tenant job
scheduling
IBM Elastic Storage Server (ESS)
Integrated scale-out data management for file and object data

Optimal building block for high-performance, scalable,


reliable enterprise Spectrum Scale storage
• Faster data access with choice to scale-up or scale-out
• Easy to deploy clusters with unified system GUI Entry ESS 3000
• Simplified storage administration with IBM Spectrum Control integration ESS 3000 cluster

One solution for all your Spectrum Scale data needs


• Single repository of data with unified file and object support
• Anywhere access with multi-protocol support using
protocol nodes - NFS 4.0, SMB, Object
• Ideal for Big Data analytics including full Hadoop transparency

Ready for business-critical data


• Disaster recovery with synchronous or asynchronous replication
• Ensure reliability and fast rebuild times using Spectrum Scale RAID’s
dispersed data and erasure code
• Five 9s (99.999%) of availability Elastic Storage
Server cluster
IBM Elastic Storage Server (ESS) solution packaging

Integrated Optimal storage Non-disruptive High performance


solution capacity upgrades connectivity
& economy
Spectrum Scale is ESS has various Capacity upgrades can Optional integrated
integrated, tested, and models providing SAS, be performed without InfiniBand or 100Gb
factory preloaded NL-SAS, SSD, or application disruption Mellanox Ethernet
NVMe storage switch
Leverage the latest Software automatically
IBM Spectrum Choose from various rebalances data across Provides a lower cost
Scale releases sizes of HDD, SSD, all drives high performance
Data Management and NVMe network interconnect
Edition and Data Rack-mountable
Access Edition solution

You might also like