Introduc@on to Big Data,

Apache Hadoop, and Cloudera
Ian Wrigley, Curriculum Manager, Cloudera

The Mo@va@on for Hadoop

Tradi@onal Large-Scale Computa@on

Tradi*onally, computa*on has been processor-bound

Rela@vely small amounts of data
Signicant amount of complex processing performed on that data
For decades, the primary push was to increase the compu*ng power of a
single machine
Faster processor, more RAM

The Data Explosion


1.8 trillion gigabytes of data was


created in 2011

More than 90% is unstructured data

Approx. 500 quadrillion les
5,000 Quan@ty doubles every 2 years

2005 2010 2015


Source: IDC 2011

Current Solu@ons


Current Database Solutions are


designed for structured data.

Op@mized to answer known ques*ons quickly

Schemas dictate form/context
Dicult to adapt to new data types and new
Expensive at Petabyte scale

0 10%
2005 2010 2015


Why Use Hadoop?

Move beyond rigid legacy frameworks

Hadoop handles any data Hadoop grows with your Hadoop is 100% Apache Hadoop helps you derive
type, in any quan*ty business licensed and open source the complete value of all
your data
Structured, unstructured No vendor lock-in
Proven at petabyte scale Drives revenue by extrac@ng
Schema, no schema Community development value from data that was
Capacity and performance previously out of reach
High volume, low volume grow simultaneously Rich ecosystem of related
projects Controls costs by storing data
All kinds of analy@c Leverages commodity more aordably than any
applica@ons hardware to mi@gate costs other pla`orm

1 2 3
The Origins of Hadoop

Launches SQL support

for Hadoop

Open Source
Open source web Publishes MapReduce MapReduce and HDFS Runs 4,000-node Hadoop wins Terabyte Releases CDH and
crawler project created and GFS Paper project created by Hadoop cluster sort benchmark Cloudera Enterprise
by Doug Cuang Doug Cuang

2002 2007 2012

Core Hadoop: HDFS

Self-healing, high bandwidth


4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4

HDFS breaks incoming les into blocks and stores them redundantly across the cluster.

Core Hadoop: MapReduce


3 MR

4 2 1 1 2 1
4 2 3 3 3
5 5 5 4 5 4

Processes large jobs in parallel across many nodes and combines the results.

Hadoop and Databases

You need

Best Used For: Best Used For:

Interac@ve OLAP Analy@cs (<1sec) Structured or Not (Flexibility)
Mul@step ACID Transac@ons Scalability of Storage/Compute
100% SQL Compliance Complex Data Processing

Typical Datacenter Architecture

Enterprise web site

intelligence apps

Interactive Data export OLAP load Oracle,

database SAP...

Adding Hadoop To The Mix

Enterprise web site

intelligence apps
OLAP queries

New Oracle,
Interactive Hadoop SAP...

Recommendations, etc...

Why Cloudera?

Cloudera is

in Customers and Users in Integrated Partners

in banking, across hardware, pla`orms,

telecommunica@ons, mobile services, defense & intelligence, database and business intelligence (BI)
media and retail depend on Cloudera

than for hardware, pla`orms, sokware and services
all other Hadoop systems combined

in Training and Certification in Nodes Under Management

developers, administrators and

managers trained on 6 con@nents since 2009
in Open Source Contributions

for developers, administrators and managers

in Data Science

Experienced and Proven Across Hundreds of Deployments

The Only Vendor With a Complete Solu@on

Clouderas Distribu*on Including Apache Hadoop (CDH) COMPU- INTEGRA-

Big Data storage, processing and analy@cs pla`orm based on TATION TION
Apache Hadoop 100% open source

Cloudera Enterprise 4.0

Cloudera Manager DIAGNOS-

End-to-end management applica@on for the YMENT ATION ING REPORT-
deployment and opera@on of CDH ING

Produc*on Support ISSUE KNOW-

Our team of experts on call to help you meet RESOLU- LEDGE
your Service Level Agreements (SLAs) TION BASE

Cloudera University
Partner Ecosystem Equipping the Big Data workforce 12,000+ trained
250+ partners across hardware, software, platforms and services

Professional Services
Use case discovery, pilots, process & team development

Solving Problems with Hadoop

Eight Common Hadoop-able Problems

1. Modeling true risk 5. Analyzing network data to

predict failure
2. Customer churn analysis
6. Threat analysis
3. Recommenda*on engine
7. Search quality
4. PoS transac*on analysis
8. Data sandbox

1. Modeling True Risk

How much risk exposure does an organiza*on really have with each
Mul@ple sources of data and across mul@ple lines of business
Solu*on with Hadoop:
Source and aggregate disparate data sources to build data picture
e.g. credit card records, call recordings, chat sessions,
emails, banking ac@vity
Structure and analyze
Sen@ment analysis, graph crea@on, pa=ern recogni@on
Typical Industry:
Financial Services (banks, insurance companies)

2. Customer Churn Analysis

Why is an organiza*on really losing customers?
Data on these factors comes from dierent sources
Solu*on with Hadoop:
Rapidly build behavioral model from disparate data sources
Structure and analyze with Hadoop
Graph crea@on
Pa=ern recogni@on
Typical Industry:
Telecommunica@ons, Financial Services

3. Recommenda@on Engine/Ad Targe@ng

Using user data to predict which products to recommend
Solu*on with Hadoop:
Batch processing framework
Allow execu@on in in parallel over large datasets
Collabora*ve ltering
Collec@ng taste informa@on from many users
U@lizing informa@on to predict what similar users like
Typical Industry
Ecommerce, Manufacturing, Retail

4. Point of Sale Transac@on Analysis

Analyzing Point of Sale (PoS) data to target promo*ons and manage
Sources are complex and data volumes grow across chains of stores and
other sources
Solu*on with Hadoop:
Batch processing framework
Allow execu@on in in parallel over large datasets
Paiern recogni*on
Op@mizing over mul@ple data sources
U@lizing informa@on to predict demand
Typical Industry:
5. Analyzing Network Data to Predict Failure

Analyzing real-*me data series from a network of sensors
Calcula@ng average frequency over @me is extremely tedious because
of the need to analyze terabytes
Solu*on with Hadoop:
Take the computa*on to the data
Expand from simple scans to more complex data mining
Beier understand how the network reacts to uctua*ons
Discrete anomalies may, in fact, be interconnected
Iden*fy leading indicators of component failure
Typical Industry:
U@li@es, Telecommunica@ons, Data Centers

6. Threat Analysis/Trade Surveillance

Detec*ng threats in the form of fraudulent ac*vity or aiacks
Large data volumes involved
Like looking for a needle in a haystack
Solu*on with Hadoop:
Parallel processing over huge datasets
Paiern recogni*on to iden*fy anomalies,
i.e., threats
Typical Industry:
Security, Financial Services,
General: spam gh@ng, click fraud

7. Search Quality

Providing real *me meaningful search results
Solu*on with Hadoop:
Analyzing search aiempts in conjunc*on with structured data
Paiern recogni*on
Browsing pa=ern of users performing searches in dierent categories
Typical Industry:
Web, Ecommerce

8. Data Sandbox

Data Deluge
Dont know what to do with the data or what analysis to run
Solu*on with Hadoop:
Dump all this data into an HDFS cluster
Use Hadoop to start trying out dierent analysis on the data
See paierns to derive value from data
Typical Industry:
Common across all industries

Orbitz: Major Online Travel Booking Service

Orbitz performs millions of searches and transac@ons daily, which leads
to hundreds of gigabytes of log data every day
Not all of that data has value (i.e., it is logged for historic reasons)
Much is quite valuable
Want to capture even more data
Solu*on with Hadoop:
Hadoop provides Orbitz with
ecient, economical, scalable,
and reliable storage and processing
of these large amounts of data
Hadoop places no constraints
on how data is processed

Before Hadoop

Orbitzs data warehouse contains a full archive of all transac*ons

Every booking, refund, cancella@on etc.
Non-transac*onal data was thrown away because it was uneconomical to

Non-transactional Data Transactional Data

(e.g., Searches) (e.g., Bookings)

Data Warehouse

Aker Hadoop

Hadoop was deployed late 2009/early 2010 to begin collec*ng this non-
transac*onal data
Orbitz has been using CDH for that en@re period with great success.
Much of this non-transac*onal data is contained in Web analy*cs logs

Non-transactional Data Transactional Data

(e.g., Searches) (e.g., Bookings)

Hadoop Data Warehouse

What Now?

Access to this non-transac*onal data enables a number of applica*ons

Op@mizing hotel search
E.g., op@mize hotel ranking and show consumers hotels more
closely matching their preferences
User specic product Recommenda@ons
Web page performance tracking
Analyses to op@mize search result cache performance
User segments analysis, which can drive personaliza@on
Lots of press coverage in June 2012: company discovered that
people using Macs are willing to spend 30% more on hotels that PC
Mac users are now presented with pricier hotels rst in the list

Major Na@onal Bank

100M customers
Rela@onal data: 2.5B records/month
Card transac@ons, home loans, auto loans, etc.
Data volume growing by hundreds of TB/year
Needs to incorporate non-rela@onal data as well
Web clicks, check images, voice data
Uses Hadoop to
Iden@fy credit risk, fraud
Proac@vely manage capital

Financial Regulatory Body

Stringent data reliability requirements

Must store seven years of data
850TB of data collected from every Wall Street trade each year
Data volumes growing at 40% each year
Replacing EMC Greenplum + SAN with CDH
Goal is to store data from years two to seven in Hadoop
Will have 5PB of data in Hadoop by the end of 2013
Cost savings predicted to be 10s of millions of dollars
Applica*on performance tes*ng is showing speed gains of 20x in some

Leading North American Retailer

Storing 400TB of data in CDH cluster

Capture and analysis of data on individual customers and SKUs across
4,000 loca@ons
Using Hadoop for:
Loyalty program analy@cs and personal pricing
Fraud detec@on
Supply chain op@miza@on
Marke@ng and promo@ons
Loca@ng and pricing overstocked items for clearance

Digital Media Company

Needs to quickly and reliably process high volume clickstream and

pageview data
Experienced database boilenecks and reliability issues
Now using CDH
A cluster of just 20 nodes
Inges*ng 75 million clickstream, page view, and user prole events per
15GB of data
Processes 430 million records from six million users in 11 minutes
Alterna@ve solu@on would have required 10x more investment in
database sokware, high-end servers, developer @me

Leader in Real-Time Adver@sing Technology

Hundreds of customers need unique views of the data

Were using Netezza; unable to run more than 2-3 big jobs per day
Too expensive to scale
Now using CDH
Processing hundreds of jobs concurrently
200-300GB/hour per job
Inges@ng 10TB of data per day
Moving data between CDH, Netezza, and Ver@ca

Before Hadoop
Nightly processing of logs
Imported into a database
As data volume grew, it took more than 24 hours to process and load a
days worth of logs
Today, an hourly Hadoop job processes logs for quicker availability to the
data for analysis/BI
Currently inges*ng approximately 1TB of data per day

Hadoop as Cheap Storage

Before Hadoop: $1 million for 10TB storage
With Hadoop: $1 million for1 PB of storage
Other Large Company
Before Hadoop: $5 million to store data in Oracle
With Hadoop: $240K to store the data in HDFS
Hadoop as unied storage

Hadoop Jobs

The Roles People Play

System Administrators
Data Stewards

System Administrators

Required skills:
Strong Linux administra@on skills
Networking knowledge
Understanding of hardware
Job responsibili*es
Install, congure and upgrade Hadoop sokware
Manage hardware components
Monitor the cluster
Integrate with other systems (e.g., Flume and Sqoop)

Required skills:
Strong Java or scrip@ng capabili@es
Understanding of MapReduce and algorithms
Job responsibili*es:
Write, package and deploy MapReduce programs
Op@mize MapReduce jobs and Hive/Pig programs

Data Analyst/Business Analyst

Required skills:
Understanding data analy@cs/data mining
Job responsibili*es:
Extract intelligence from the data
Write Hive and/or Pig programs

Data Steward

Required skills:
Data modeling and ETL
Scrip@ng skills
Job responsibili*es:
Cataloging the data (analogous to a librarian for books)
Manage data lifecycle, reten@on
Data quality control with SLAs

Combining Roles

System Administrator + Steward analogous to DBA

Required skills:
Data modeling and ETL
Scrip@ng skills
Strong Linux administra@on skills
Job responsibili*es:
Manage data lifecycle, reten@on
Data quality control with SLAs
Install, congure and upgrade Hadoop sokware
Manage hardware components
Monitor the cluster
Integrate with other systems (e.g., Flume and Sqoop)

Finding The Right People

Hiring Hadoop experts

Strong Hadoop skills are scarce and expensive
Hadoop User Groups
Key words
Developers: MapReduce, Cloudera Cer@ed Developer for Apache
Hadoop (CCDH)
System Admins: distributed systems (e.g., Teradata, RedHat
Cluster), Linux, Cloudera Cer@ed Administrator for Apache Hadoop
Consider cross-training, especially system administrators and data

Clouderas Academic Partnership Program

Clouderas Academic Partnerships: Overview

Clouderas Academic Partnerships (CAP)

An essen@al component of Clouderas strategy to provide
comprehensive Apache Hadoop training to current and future data
Designed to be a mutually benecial rela*onship
Universi@es are enabled to deliver new and relevant areas of study to
their students
Cloudera is able to help ll the demand for qualied data professionals
to help the market con@nue is explosive growth
With CDH and Cloudera Manager available for free, and our curriculum
and Virtual Machine, we provide universi*es the founda*on to start
experimen*ng with Hadoop and developing exper*se among their

Clouderas Academic Partnerships: Goals

Introduce students to Apache Hadoop

Provide students and instructors with quality course materials and virtual
machine images to complete hands-on labs
Grant 50% discount on cer*ca*on costs to students associated with the
program who are interested in aiemp*ng Cloudera's industry leading
Hadoop cer*ca*on exams
Highly recommended they take the class and a=empt the cer@ca@on
Allow academic ins*tu*ons op*ons to augment their degree program

Clouderas Academic Partnerships: Financial Overview

Cloudera does not currently charge Academic Partners for usage of the
training materials
This is a program designed solely to facilitate students learning of an
emerging technology
Our reward is helping the industry grow, and ideally the exposure to
Cloudera is a posi@ve one which will be remembered when the students
we service today are making decisions for their business tomorrow
Instructors who are delivering the Cloudera courses are eligible for a 50%
discount to commercial training courses delivered by Cloudera
We want to make sure the folks leading the classes have the skillset to
help their students be successful
Normally we provide universi*es with courses focused on the roles of
Hadoop Developer or Administrator

Ian Wrigley 50
