BIETL0096 BigDataAnalytics and Hadoop

“A Sneak Peek at Big Data Analytics and Hadoop”
Session 1
March 6, 2013
Vivek Seth, Lead BI Architect (BI/ETL Program)
Committee Chair, TDWI D.C. Chapter
Agenda
 What is Big Data?
 Big Data Characteristics
 Big Data Technology and Services Market Sizing Criteria
 Value of Big Data
 What Big Data is Not
 Why the Sudden “Hype” in Big Data Analytics?
 Traditional vs. Big Data Analytics
 Industry Segments where Big Data Will Play a Major Role
 Data Footprint & Time Horizon
 Three Big Data Platforms (Systems)
 Hadoop and Map Reduce
 Big Data:
– Future Outlook
– Challenges
– Key Observations to Consider
 Executive Summary
2
What is “Big Data”?
What do we hear from the grapevine?
a) Lots of data
b) Different types of data
c) More data than you can handle
d) Purpose-built analytical systems
e) Distributed file system
f) New staging area and archive
g) A Java developer’s employment act
h) A replacement for the RDBMS
i) A club for hip data people
3
4
A pragmatic definition of Big Data must be actionable for both IT and
business professionals.
Big Data is the frontier of a firm’s ability to store, process, and access
(SPA) all the data it needs to operate effectively, make decisions,
reduce risks, and serve customers.
To remember the pragmatic definition of big data, think SPA — the
three questions of big data:
Store. Can you capture and store the data?
Process. Can you cleanse, enrich, and analyze the data?
Access. Can you retrieve, search, integrate, and visualize the data?
5
Big Data is Getting Bigger…
6
How BIG is 50 Petabytes?
7
Big Data Characteristics
8
Big Data Technology and Services Market
Sizing Criteria
9
Value of Big Data
10
What Big Data is Not…
• It is not a replacement for your
Database strategy.
• It is not a replacement for your Data
Warehouse strategy.
• It is not a solution by itself; it needs
jobs/applications to drive value.
11
Why the Sudden “Hype” in Big Data
Analytics?
12
Traditional vs. Big Data Analytics
13
Industry Segments where Big Data Will
Play a Major Role
Financial Services Healthcare
• Detect fraud • Optimal treatment pathways
• Model and manage risk • Remote patient monitoring
• Improve debt recovery rates • Predictive modeling for new drugs
• Personalize banking/insurance • Personalized medicine
products
Government Web/Social/Mobile
• Reduce fraud and waste • Location-based marketing
• Segment populations, customize action • Social segmentation
• Support open data initiatives • Sentiment analysis
• Automate decision making • Price comparison services
14
Data Footprint & Time Horizon
Real Near Hourly Daily Monthly Yearly 3 Years 5 Years 10 Years
Time Real Weekly Quarterly
Time
Highly
Summarized
Visualization
& Dashboards
Aggregated
Analytic
Marts & Cubes
Predictive
Detailed Core ERP Analytics
& Legacy Applications
Events/Facts & Data Warehouse
Unstructured Big Data

Web/Telemetry Hadoop etc.
Consumption
Real Time Daily Monthly Yearly
Source GB TB PB
15
Three Big Data Platforms (Systems)
 General-Purpose Relational Database
 Analytical Database
 Hadoop
16
General-Purpose RDBMS - Powers First-
Generation DW
Benefits:
- RDBMS already in house
Operational - SQL-based
System
- Trained DBAs
Operational
System Data Warehouse BI Reports/
ETL (Teradata Oracle,
ETL Datamarts
(PDMs & Server Dashboards
DB2, SQL server, VDMs)
etc.)
Operational
System
Challenges:
- Cost to deploy and upgrade
Operational
System - Doesn’t support complex analytics
- Scalability and performance
17
Analytical Platforms
1010data
Aster Data (Teradata)
Calpont Purpose-built database
Datallegro (Microsoft) management systems designed
Exasol explicitly for query processing
Greenplum (EMC)
and analysis that provides
IBM SmartAnalytics
Infobright dramatically higher
Kognitio price/performance and
Netezza (IBM) availability compared to general
Oracle Exadata purpose solutions.
ParAccel
Pervasive Deployment Options
Sand Technology • Software only (ParAccel, Vertica)
SAP HANA • Appliance (SAP, Exadata, Netezza)
Sybase IQ (SAP) • Hosted(1010data, Kognitio)
Teradata
Vertica (HP)
18
Business Value of Analytic Platforms
 Kelley Blue Book –

Consolidates millions of auto
Analytical
transactions each week to Appliance
calculate car valuations
 AT&T Mobility – Tracks
purchasing patterns for 80M Analytical
Database
customers daily to optimize
targeted marketing
19
Hadoop
• Ecosystem of open source projects

• Hosted by Apache Foundation
• Google developed and shared
concepts
• Distributed file system that scales
out on commodity servers with direct
attached storage and automatic
failover
20
What is Hadoop?
Hadoop is a framework that provides open source libraries for distributed
computing using a simple single MapReduce interface and its own
distributed filesystem called HDFS. It facilitates scalability and takes care
of detecting and handling failures.
21
Hadoop – Version History
• 1.0.X - Current stable version, 1.0 release

• 1.1.X - Current beta version, 1.1 release
• 2.X.X - Current alpha version
• 0.23.X - Similar to 2.X.X but missing NN HA
• 0.22.X - Does not include security
• 0.20.203.X - Old legacy stable version
• 0.20.X - Old legacy version
22
Why Hadoop?
23
Hadoop Distilled: What’s New?
Unstructured Data Benefits
- Comprehensive
Distributed
Data Scientist File System - Agile
- Expressive
- Affordable
“Schema at Read”
BIG
DATA
Open Source $$
No SQL
Drawbacks
MapReduce - Immature
- Batch-oriented
- Expertise
- TCO
24
What is MapReduce?
 Framework introduced by Google
 Processes vast amounts of data (multi-terabyte data-sets) in
parallel
 Achieves high performance on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant
manner
 Splits the input data-set into independent chunks
 Sorts the outputs of the maps, which are then input to the
reduce tasks
 Takes care of scheduling tasks, monitoring them, and re-
executing the failed tasks
MapReduce - Process
Inputs & Outputs
The MapReduce framework operates exclusively on <key, value> pairs;
that is, the framework views the input to the job as a set of <key, value>
pairs and produces a set of <key, value> pairs as the output of the job,
conceivably of different types.
The key and value classes have to be serializable by the framework and
hence need to implement the Writable interface. Additionally, the key
classes have to implement the Writable Comparable interface to facilitate
sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> ->
reduce -> <k3, v3> (output)
Hadoop Ecosystem
28
Top Uses for Hadoop
• Risk Modeling:
– How business/industry can better understand
customers and market.
• Customer Churn Analysis:

– Why companies really lose customers.
• Recommendation Engine:
– How to predict customer preferences.
29
Top Uses for Hadoop Continued
• AD Targeting:
– How to increase campaign efficiency
• Point of Sale Transaction Analysis:

– Targeting promotions to make customers buy
• Predicting Network Failure:

– Using machine-generated data to identify trouble
spots
30
Top Uses for Hadoop Continued
• Threat Analysis:
– Detecting threats and fraudulent analysis
• Trade Surveillance:
– Help businesses spot the rogue trader
• Search Quality:
– Delivering more relevant search results to customers
31
Big Data Future Outlook
Worldwide Big Data Technology and Services Revenue, 2010–2016
Source: Worldwide Big Data and Technology and Services 2012-2016 Forecast
32
Big Data Challenges
 Organizations seem to either struggle or focus on questions such as what to measure and how best to
measure it. That's the analytics part of deciding what data is relevant. Whole industry segments are
being reengineered, new business models are being introduced, and new products and services are
being launched based on management and analysis of big data. In most of the cases of innovation,
new data and/or new analytic techniques are being used, such as a new search algorithm or an
application of machine learning using topology.
Source: IDC - Big Opportunities and Big Challenges 2012
33
Key Observations to Consider
To capitalize on the opportunities, vendors should consider the following:
 One size (technology) does not fit all (use cases). There is no such thing as a single Big Data technology,
nor is there such a thing as an enterprise-wide Big Data project. It's imperative to approach the market with
discrete, albeit connected, products that enable incremental deployment and dynamic scaling.
 The measures of volume, variety, and velocity represent only the first and highest level of opportunity
segmentation and need to be followed by industry, business process, and activity-based segmentation of
potential opportunities.
 Package and price offerings to address specific use cases. There's no such thing as an enterprise-wide Big
Data product. Large vendors should be ready to offer a range of functionality. Big Data solutions will need to
incorporate a range of functionality with discrete, workload-specific products.
 Be prepared for a more consultative role of a trusted advisor. Many end users want more than just to
purchase technology. They are looking for advice about what data to track, how to analyze the data, and
how to influence action and effect change based on the results of the data analysis. Only 25% of
organizations report having a business analytics strategy. It's important for vendors to help end users create
such a strategy as well as have one of their own.
 Big Data analytics services will see a surge in demand as customers find it difficult to source experienced
talent for both Big Data and analytics needs. Services vendors should capitalize on this opportunity by
bolstering their Big Data services portfolios as customers will increasingly leverage the service providers'
infrastructure and expertise (around business analytics tools, cloud, mobility, etc.) as well as their ability to
provide industry- and domain-specific solutions.
Source: RN Analysis
34
Executive Summary
 Big Data is becoming either a meaningless hyperbole or the foundation for foreseeable progress and
innovation that will propel us all into the intelligent economy of smart cars, smart buildings, smart
healthcare, smart law enforcement, smart education, and other previously "not so smart" human
endeavors.
 The organizations today define Big Data for themselves without limiting the definition to a specific
technology or specific characteristics of the data. They focus on specific business processes within
the organization and industry and look for ways to drive return on their data assets.
 To address the ongoing needs of these and other organizations, vendors across the technology and
services spectrum should focus not just on a general-purpose Big Data solution but on one or
several solutions that address specific use cases and drivers of adoption.
 IDC expects the Big Data technology and services market to grow from $6 billion in 2011 to $23.8
billion in 2016. This represents a compound annual growth rate (CAGR) of 31.7%, or about seven
times that of the overall information and communication technology (ICT) market.
 Opportunities for vendors will exist at all levels of the Big Data technology stack, including
infrastructure, software, and services and via an on-premises or cloud delivery model.
 The Big Data technology and services demand and supply factor continues to evolve rapidly,
necessitating a frequent review of market sizing methodology and forecast assumptions, in addition
to competitive market assessment.
 Big Data analytics services will see a surge in demand as customers find it difficult to source
experienced talent for both Big Data and analytics needs.
35
Questions?
36

BIETL0096 BigDataAnalytics and Hadoop

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIETL0096 BigDataAnalytics and Hadoop

Uploaded by

Copyright:

Available Formats

“A Sneak Peek at Big Data Analytics and Hadoop”

Unstructured Big Data

 Kelley Blue Book –

• Ecosystem of open source projects

• 1.0.X - Current stable version, 1.0 release

• Customer Churn Analysis:

• Point of Sale Transaction Analysis:

• Predicting Network Failure:

You might also like