Big Data Hadoop Insight

Data Science Analytics &
Research Centre
9/20/2014
Data Science Analytics & Research Centre
Big Data
HDFS
Hadoop
9/20/2014
Big Data Overview

Characteristics
Applications & Use Case
Hadoop Distributed File System (HDFS) Overview

HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations
Hadoop Overview
Inputs & Outputs
Data Types
What is MapReduce (MR)
Example
Functionalities of MR
Speculative Execution
Hadoop Streaming
Hadoop Job Scheduling
Big Data Overview

Characteristics
Applications & Use Case
Data Footprint & Time Horizon
Technology Adoption Lifecycle
9/20/2014
9/20/2014
9/20/2014
9/20/2014
Real
Time
Near
Real
Time
Hourly
Daily
Weekly
Monthly
Quarterly
Yearly
3 Years
5 Years
10 Years
Highly
Summarized
Visualization &
Dashboards
Aggregated
Analytic
Marts & Cubes
Detailed
Events / Facts
Predictive
Analytics
Core ERP
& Legacy Applications
& Data Warehouse
Unstructured
Web /
Telemetry
Big Data
Hadoop etc.
Consumption
Source
9/20/2014
Real Time
GB
Daily
Monthly
TB
Yearly
PB
9/20/2014
9/20/2014
9/20/2014
10
Financial Services
Healthcare
Detect fraud
Optimal treatment pathways
Model and manage risk
Remote patient monitoring
Improve debt recovery rates
Predictive modeling for new drugs
Personalize banking/insurance
products
Personalized medicine
Retail
In-store behavior analysis
Cross selling
Optimize pricing, placement, design
Optimize inventory and distribution
9/20/2014
11
Web / Social / Mobile
Government
Location-based marketing
Reduce fraud
Social segmentation
Segment populations, customize

action
Sentiment analysis
Price comparison services
Support open data initiatives

Automate decision making
Manufacturing
Design to value
Crowd-sourcing
Digital factory for lean manufacturing
Improve service via product sensor data
9/20/2014
12
9/20/2014
13
Hadoop Distributed File System (HDFS)

Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations
9/20/2014
14
Hadoop own implementation of distributed file system.

Is coherent and provides all facilities of a file system.
Implements ACLs and provides a subset of usual UNIX
commands for accessing or querying the filesystem.
It has large block size (default 64MB) 128MB
recommended for storage to compensate for seek time to
network bandwidth. So very large files for storage are
ideal.
Streaming data access. Write once and read many times
architecture. Since files are large time to read is significant
parameter than seek to first record.
Commodity hardware. It is designed to run on commodity
hardware which may fail. HDFS is capable of handling it.
E.g.: 420MB file is split as:
128 MB
9/20/2014
128 MB
128 MB
36 MB
15
9/20/2014
16
9/20/2014
17
File 1
Create
Complete
B1
B2
n1
B1
n1
B2
n1
B1
n2
B1
n2
B2
n2
B3
n3
B2
n3
B3
n3
B3
n4
Namenode
Rack 1
9/20/2014
B3
n4
n4
Rack 2
Rack 3
18
9/20/2014
19
9/20/2014
20
HDFS Flow Read
9/20/2014
HDFS Flow Write
21
Command
Usage
Syntax
cat
Copies source paths to stdout
hadoop dfs -cat URI [URI ]
chgrp
Change group association of files. With -R, make the

change recursively through the directory structure. hadoop dfs -chgrp [-R] GROUP URI [URI ]
chmod
Change the permissions of files. With -R, make the hadoop dfs -chmod [-R] <MODE[,MODE]... |
change recursively through the directory structure OCTALMODE> URI [URI ]
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls (or) lsr
9/20/2014
Change the owner of files. With -R, make the

hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI
change recursively through the directory structure [URI ]
Similar to put command, except that the source is
restricted to a local file reference.
hadoop dfs -copyFromLocal <localsrc> URI
Similar to get command, except that the destination hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
is restricted to a local file reference.
<localdst>
Copy files from source to destination
hadoop dfs -cp URI [URI ] <dest>
Displays aggregate length of files contained in the
directory or the length of a file in case its just a file.
Displays a summary of file lengths.
Empty the Trash
Copy files to the local file system
Concatenates files in source into the destination
local file
File - returns stat on the file
Directory - returns list of its direct children
hadoop dfs -du URI [URI ]

hadoop dfs -dus <args>
hadoop dfs -expunge
hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
hadoop dfs -getmerge <src> <localdst> [addnl]
hadoop dfs -ls <args>
22
Command
Usage
Syntax
mkdir
Takes path uri's as argument and creates

directories
hadoop dfs -mkdir <paths>

dfs -moveFromLocal <src> <dst>
movefromLocal
mv
setrep
Moves files from source to destination

hadoop dfs -mv URI [URI ] <dest>
Copy single src, or multiple srcs from local file
system to the destination filesystem
hadoop dfs -put <localsrc> ... <dst>
Delete files specified as args. Only deletes non
empty directory and files
hadoop dfs -rm URI [URI ]
Changes the replication factor of a file. -R option
is for recursively increasing the replication factor
of files within a directory
hadoop dfs -setrep [-R] <path>
stat
Returns the stat information on the path
hadoop dfs -stat URI [URI ]
tail
hadoop dfs -tail [-f] URI
text
Displays last kilobyte of the file to stdout

e - if the file exists
z - if the file is zero length
d - if the path is directory
Takes a source file and outputs the file in text
format
touchz
Create a file of zero length
hadoop dfs -touchz URI [URI ]
put
rm (or) rmr
test
9/20/2014
hadoop dfs -test -[ezd] URI

hadoop dfs -text <src>
23
Low latency data access. It is not optimized for low latency data access it
trades latency to increase the throughput of the data.
Lots of small files. Since block size is 64 MB and lots of small files(will
waste blocks) will increase the memory requirements of namenode.
Multiple writers and arbitrary modification. There is no support for
multiple writers in HDFS and files are written to by a single writer after
end of each file.
9/20/2014
24
Hadoop Overview
Inputs & Outputs
Data Types
What is MR
Example
Functionalities of MR
Speculative Execution
How Hadoop runs MR
Hadoop Streaming
Hadoop Job Scheduling
9/20/2014
25
Hadoop is a framework which provides open source libraries for distributed computing
using simple single map-reduce interface and its own distributed filesystem called HDFS. It
facilitates scalability and takes cares of detecting and handling failures.
9/20/2014
26
1.0.X
- current stable version, 1.0 release
1.1.X
- current beta version, 1.1 release
2.X.X
- current alpha version
0.23.X - similar to 2.X.X but missing NN HA.

0.22.X - does not include security
0.20.203.X
- old legacy stable version
0.20.X - old legacy version
9/20/2014
27
9/20/2014
28
Risk Modeling:
How business/industry can better understand
customers and market.
Customer Churn Analysis:

Why companies really loose customers.
Recommendation Engine:
How to predict customer preferences.
9/20/2014
29
AD Targeting:
How to increase campaign efficiency.
Point of Sale Transaction Analysis:

Targeting promotions to make customers buy.
Predicting network Failure:

Using machine-generated data to identify trouble spots.
9/20/2014
30
Threat Analysis:
Detecting threats and fraudulent analysis.
Trade Surveillance:
Help business spot the rogue trader.
Search Quality:
Delivering more relevant search results to customers.
9/20/2014
31
Framework is introduced by google.
Process vast amounts of data (multi-terabyte data-sets) in-parallel.
Achieves high performance on large clusters (thousands of nodes) of commodity

hardware in a reliable, fault-tolerant manner.
Splits the input data-set into independent chunks.
Sorts the outputs of the maps, which are then input to the reduce tasks.
Takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
9/20/2014
32
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> -> reduce -> <k3, v3> (output)
9/20/2014
33
9/20/2014
34
9/20/2014
35
9/20/2014
36
Serialization is the process of turning structured objects into a byte stream for transmission over
a network or for writing to persistent storage.
Hadoop has writable interface supporting serialization

There are following predefined implementations available for WritableComparable.
1. IntWritable
2. LongWritable
3. DoubleWritable
4. VLongWritable. Variable size, stores as much as needed. 1-9 bytes storage
5. VIntWritable. Less used ! as it is pretty much represented by Vlong.
6. BooleanWritable
7. FloatWritable
9/20/2014
37
8.
BytesWritable.
9.
NullWritable.
10. MD5Hash
11. ObjectWritable
12. GenericWritable
Apart from the above there are four Writable Collection types
1.
ArrayWritable
2.
TwoDArrayWritable
3.
MapWritable
4.
SortedMapWritable
9/20/2014
38
MapperClass
Input Data
Input Data
Format
<K1, V1>
Mapper
<K2, V2>
ReducerClass
<K2, List(V2)>
Reducer
<K3, V3>
9/20/2014
public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
39
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01

Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Mapper implementation:
Combiner implementation:
Lines: 18 - 25
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Line: 46
Output of first map emits:
< Bye, 1>
< Hello, 1>
< World, 2>
The second map emits:

< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Output of second map emits:

< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
9/20/2014
Reducer implementation:
Lines: 29 - 35
Output of job:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
40
A way of coping with individual Machine performance

The same input can be processed multiple times in parallel, to exploit differences in machine
capabilities
Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do
not have other work to perform
Name
Value
Description
mapred.map.tasks.
speculative.execution
true
If true, then multiple instances of some map

tasks may be executed in parallel.
Mapred.reduce.tasks.
speculative.execution
true
If true, then multiple instances of some reduce

tasks may be executed in parallel.
9/20/2014
41
9/20/2014
42
Utility that comes with the Hadoop distribution

Allows you to create and run map/reduce jobs with any executable or script as the mapper
and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper\
-reducer /bin/wc \
-jobconf mapred.reduce.tasks=2
9/20/2014
43
9/20/2014
44
Default Scheduler
Single priority based queue of jobs.
Scheduling tries to balance map and reduce load on all tasktrackers in the cluster.
Capacity Scheduler
Within a queue, jobs with higher priority will have access to the queue's resources before jobs with
lower priority.
In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on
the percentage of resources allocated to a user at any given time, if there is competition for them.
Fair Scheduler
Multiple queues (pools) of jobs sorted in FIFO or by fairness limits
Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm.
Scheduler tries to ensure that over time, all jobs receive the same number of resources.
9/20/2014
45
Thank you !!
Data Science
Analytics &
Research Centre
9/20/2014
46

Big Data Hadoop Insight

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Hadoop Insight

Uploaded by

Copyright:

Available Formats

Data Science Analytics &

Data Science Analytics & Research Centre

Big Data Overview

Hadoop Distributed File System (HDFS) Overview

Data Science Analytics & Research Centre

Big Data Overview

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Optimal treatment pathways

Model and manage risk

Remote patient monitoring

Improve debt recovery rates

Predictive modeling for new drugs

Data Science Analytics & Research Centre

Web / Social / Mobile

Segment populations, customize

Price comparison services

Support open data initiatives

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Hadoop Distributed File System (HDFS)

Data Science Analytics & Research Centre

Hadoop own implementation of distributed file system.

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

HDFS Flow Read

HDFS Flow Write

Data Science Analytics & Research Centre

Copies source paths to stdout

hadoop dfs -cat URI [URI ]

Change group association of files. With -R, make the

Change the owner of files. With -R, make the

hadoop dfs -du URI [URI ]

Data Science Analytics & Research Centre

Takes path uri's as argument and creates

hadoop dfs -mkdir <paths>

Moves files from source to destination

Returns the stat information on the path

hadoop dfs -stat URI [URI ]

hadoop dfs -tail [-f] URI

Displays last kilobyte of the file to stdout

Create a file of zero length

hadoop dfs -touchz URI [URI ]

hadoop dfs -test -[ezd] URI

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre

- current stable version, 1.0 release

- current beta version, 1.1 release

- current alpha version

0.23.X - similar to 2.X.X but missing NN HA.

- old legacy stable version