Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Data Science Analytics &

Research Centre

9/20/2014

Data Science Analytics & Research Centre

Big Data

HDFS

Hadoop

9/20/2014

Big Data Overview


Characteristics
Applications & Use Case

Hadoop Distributed File System (HDFS) Overview


HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations

Hadoop Overview
Inputs & Outputs
Data Types
What is MapReduce (MR)
Example
Functionalities of MR
Speculative Execution
Hadoop Streaming
Hadoop Job Scheduling

Data Science Analytics & Research Centre

Big Data Overview


Characteristics
Applications & Use Case
Data Footprint & Time Horizon
Technology Adoption Lifecycle

9/20/2014

Data Science Analytics & Research Centre

9/20/2014

Data Science Analytics & Research Centre

9/20/2014

Data Science Analytics & Research Centre

9/20/2014

Data Science Analytics & Research Centre

Real
Time

Near
Real
Time

Hourly

Daily
Weekly

Monthly
Quarterly

Yearly

3 Years

5 Years

10 Years

Highly
Summarized
Visualization &
Dashboards

Aggregated

Analytic
Marts & Cubes

Detailed
Events / Facts

Predictive
Analytics

Core ERP
& Legacy Applications
& Data Warehouse

Unstructured
Web /
Telemetry

Big Data
Hadoop etc.

Consumption
Source

9/20/2014

Real Time
GB

Daily

Monthly
TB

Data Science Analytics & Research Centre

Yearly
PB

9/20/2014

Data Science Analytics & Research Centre

9/20/2014

Data Science Analytics & Research Centre

9/20/2014

Data Science Analytics & Research Centre

10

Financial Services

Healthcare

Detect fraud

Optimal treatment pathways

Model and manage risk

Remote patient monitoring

Improve debt recovery rates

Predictive modeling for new drugs

Personalize banking/insurance
products

Personalized medicine

Retail
In-store behavior analysis

Cross selling
Optimize pricing, placement, design
Optimize inventory and distribution
9/20/2014

Data Science Analytics & Research Centre

11

Web / Social / Mobile

Government

Location-based marketing

Reduce fraud

Social segmentation

Segment populations, customize


action

Sentiment analysis

Price comparison services

Support open data initiatives


Automate decision making

Manufacturing
Design to value
Crowd-sourcing
Digital factory for lean manufacturing
Improve service via product sensor data
9/20/2014

Data Science Analytics & Research Centre

12

9/20/2014

Data Science Analytics & Research Centre

13

Hadoop Distributed File System (HDFS)


Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations

9/20/2014

Data Science Analytics & Research Centre

14

Hadoop own implementation of distributed file system.


Is coherent and provides all facilities of a file system.
Implements ACLs and provides a subset of usual UNIX
commands for accessing or querying the filesystem.
It has large block size (default 64MB) 128MB
recommended for storage to compensate for seek time to
network bandwidth. So very large files for storage are
ideal.
Streaming data access. Write once and read many times
architecture. Since files are large time to read is significant
parameter than seek to first record.
Commodity hardware. It is designed to run on commodity
hardware which may fail. HDFS is capable of handling it.
E.g.: 420MB file is split as:
128 MB

9/20/2014

128 MB

128 MB

36 MB

Data Science Analytics & Research Centre

15

9/20/2014

Data Science Analytics & Research Centre

16

9/20/2014

Data Science Analytics & Research Centre

17

File 1

Create
Complete
B1

B2

n1

B1

n1

B2

n1

B1

n2

B1

n2

B2

n2

B3

n3

B2

n3

B3

n3

B3

n4

Namenode

Rack 1

9/20/2014

B3

n4

n4

Rack 2

Data Science Analytics & Research Centre

Rack 3

18

9/20/2014

Data Science Analytics & Research Centre

19

9/20/2014

Data Science Analytics & Research Centre

20

HDFS Flow Read

9/20/2014

HDFS Flow Write

Data Science Analytics & Research Centre

21

Command

Usage

Syntax

cat

Copies source paths to stdout

hadoop dfs -cat URI [URI ]

chgrp

Change group association of files. With -R, make the


change recursively through the directory structure. hadoop dfs -chgrp [-R] GROUP URI [URI ]

chmod

Change the permissions of files. With -R, make the hadoop dfs -chmod [-R] <MODE[,MODE]... |
change recursively through the directory structure OCTALMODE> URI [URI ]

chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls (or) lsr
9/20/2014

Change the owner of files. With -R, make the


hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI
change recursively through the directory structure [URI ]
Similar to put command, except that the source is
restricted to a local file reference.
hadoop dfs -copyFromLocal <localsrc> URI
Similar to get command, except that the destination hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
is restricted to a local file reference.
<localdst>
Copy files from source to destination
hadoop dfs -cp URI [URI ] <dest>
Displays aggregate length of files contained in the
directory or the length of a file in case its just a file.
Displays a summary of file lengths.
Empty the Trash
Copy files to the local file system
Concatenates files in source into the destination
local file
File - returns stat on the file
Directory - returns list of its direct children

hadoop dfs -du URI [URI ]


hadoop dfs -dus <args>
hadoop dfs -expunge
hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
hadoop dfs -getmerge <src> <localdst> [addnl]
hadoop dfs -ls <args>

Data Science Analytics & Research Centre

22

Command

Usage

Syntax

mkdir

Takes path uri's as argument and creates


directories

hadoop dfs -mkdir <paths>


dfs -moveFromLocal <src> <dst>

movefromLocal
mv

setrep

Moves files from source to destination


hadoop dfs -mv URI [URI ] <dest>
Copy single src, or multiple srcs from local file
system to the destination filesystem
hadoop dfs -put <localsrc> ... <dst>
Delete files specified as args. Only deletes non
empty directory and files
hadoop dfs -rm URI [URI ]
Changes the replication factor of a file. -R option
is for recursively increasing the replication factor
of files within a directory
hadoop dfs -setrep [-R] <path>

stat

Returns the stat information on the path

hadoop dfs -stat URI [URI ]

tail

hadoop dfs -tail [-f] URI

text

Displays last kilobyte of the file to stdout


e - if the file exists
z - if the file is zero length
d - if the path is directory
Takes a source file and outputs the file in text
format

touchz

Create a file of zero length

hadoop dfs -touchz URI [URI ]

put
rm (or) rmr

test

9/20/2014

hadoop dfs -test -[ezd] URI


hadoop dfs -text <src>

Data Science Analytics & Research Centre

23

Low latency data access. It is not optimized for low latency data access it
trades latency to increase the throughput of the data.
Lots of small files. Since block size is 64 MB and lots of small files(will
waste blocks) will increase the memory requirements of namenode.
Multiple writers and arbitrary modification. There is no support for
multiple writers in HDFS and files are written to by a single writer after
end of each file.

9/20/2014

Data Science Analytics & Research Centre

24

Hadoop Overview
Inputs & Outputs
Data Types
What is MR
Example
Functionalities of MR
Speculative Execution
How Hadoop runs MR
Hadoop Streaming
Hadoop Job Scheduling

9/20/2014

Data Science Analytics & Research Centre

25

Hadoop is a framework which provides open source libraries for distributed computing
using simple single map-reduce interface and its own distributed filesystem called HDFS. It
facilitates scalability and takes cares of detecting and handling failures.

9/20/2014

Data Science Analytics & Research Centre

26

1.0.X

- current stable version, 1.0 release

1.1.X

- current beta version, 1.1 release

2.X.X

- current alpha version

0.23.X - similar to 2.X.X but missing NN HA.


0.22.X - does not include security
0.20.203.X

- old legacy stable version

0.20.X - old legacy version

9/20/2014

Data Science Analytics & Research Centre

27

9/20/2014

Data Science Analytics & Research Centre

28

Risk Modeling:
How business/industry can better understand

customers and market.

Customer Churn Analysis:


Why companies really loose customers.

Recommendation Engine:
How to predict customer preferences.

9/20/2014

Data Science Analytics & Research Centre

29

AD Targeting:

How to increase campaign efficiency.

Point of Sale Transaction Analysis:


Targeting promotions to make customers buy.

Predicting network Failure:


Using machine-generated data to identify trouble spots.

9/20/2014

Data Science Analytics & Research Centre

30

Threat Analysis:

Detecting threats and fraudulent analysis.

Trade Surveillance:
Help business spot the rogue trader.

Search Quality:
Delivering more relevant search results to customers.

9/20/2014

Data Science Analytics & Research Centre

31

Framework is introduced by google.

Process vast amounts of data (multi-terabyte data-sets) in-parallel.

Achieves high performance on large clusters (thousands of nodes) of commodity


hardware in a reliable, fault-tolerant manner.

Splits the input data-set into independent chunks.

Sorts the outputs of the maps, which are then input to the reduce tasks.

Takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

9/20/2014

Data Science Analytics & Research Centre

32

The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of

<key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the

WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:


(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> -> reduce -> <k3, v3> (output)

9/20/2014

Data Science Analytics & Research Centre

33

9/20/2014

Data Science Analytics & Research Centre

34

9/20/2014

Data Science Analytics & Research Centre

35

9/20/2014

Data Science Analytics & Research Centre

36

Serialization is the process of turning structured objects into a byte stream for transmission over
a network or for writing to persistent storage.

Hadoop has writable interface supporting serialization


There are following predefined implementations available for WritableComparable.
1. IntWritable
2. LongWritable
3. DoubleWritable
4. VLongWritable. Variable size, stores as much as needed. 1-9 bytes storage
5. VIntWritable. Less used ! as it is pretty much represented by Vlong.
6. BooleanWritable
7. FloatWritable

9/20/2014

Data Science Analytics & Research Centre

37

8.

BytesWritable.

9.

NullWritable.

10. MD5Hash
11. ObjectWritable
12. GenericWritable

Apart from the above there are four Writable Collection types
1.

ArrayWritable

2.

TwoDArrayWritable

3.

MapWritable

4.

SortedMapWritable

9/20/2014

Data Science Analytics & Research Centre

38

MapperClass

Input Data
Input Data
Format
<K1, V1>
Mapper

<K2, V2>

ReducerClass

<K2, List(V2)>
Reducer
<K3, V3>

9/20/2014

public void map(LongWritable key, Text value,


OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

public void reduce(Text key, Iterator<IntWritable> values,


OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Data Science Analytics & Research Centre

39

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01


Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

Mapper implementation:

Combiner implementation:

Lines: 18 - 25
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

Line: 46
Output of first map emits:
< Bye, 1>
< Hello, 1>
< World, 2>

The second map emits:


< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

Output of second map emits:


< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

9/20/2014

Data Science Analytics & Research Centre

Reducer implementation:
Lines: 29 - 35
Output of job:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

40

A way of coping with individual Machine performance


The same input can be processed multiple times in parallel, to exploit differences in machine
capabilities
Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do
not have other work to perform

Name

Value

Description

mapred.map.tasks.
speculative.execution

true

If true, then multiple instances of some map


tasks may be executed in parallel.

Mapred.reduce.tasks.
speculative.execution

true

If true, then multiple instances of some reduce


tasks may be executed in parallel.

9/20/2014

Data Science Analytics & Research Centre

41

9/20/2014

Data Science Analytics & Research Centre

42

Utility that comes with the Hadoop distribution


Allows you to create and run map/reduce jobs with any executable or script as the mapper
and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper\
-reducer /bin/wc \
-jobconf mapred.reduce.tasks=2

9/20/2014

Data Science Analytics & Research Centre

43

9/20/2014

Data Science Analytics & Research Centre

44

Default Scheduler

Single priority based queue of jobs.

Scheduling tries to balance map and reduce load on all tasktrackers in the cluster.

Capacity Scheduler

Within a queue, jobs with higher priority will have access to the queue's resources before jobs with
lower priority.

In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on
the percentage of resources allocated to a user at any given time, if there is competition for them.

Fair Scheduler

Multiple queues (pools) of jobs sorted in FIFO or by fairness limits

Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm.

Scheduler tries to ensure that over time, all jobs receive the same number of resources.

9/20/2014

Data Science Analytics & Research Centre

45

Thank you !!

Data Science
Analytics &
Research Centre

9/20/2014

Data Science Analytics & Research Centre

46

You might also like