Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Hello everyone, my name is Nikita Kukreja.

And today i am going to host the Knowledge Sharing


Session.

Today I will focus on Big data.

so let's have a look at

the agenda so this is what we'll be

discussing today we'll begin by

understanding how data evolved that is

how big data came into existence then

we'll see what exactly is big data we'll

understand what sort of data can be

considered as big data after that we'll

see how big data can be an opportunity

now obviously big data is an opportunity

but we know that there are no free

lunches in life so we'll focus on the

various problems associated in encasing

this opportunity and it won't be fair if

I tell you only about the problems and

not the solution therefore we'll see how

Hadoop solve this problems and we'll dig

a bit deep and understand about few

components of this Hadoop framework.

.......................................................

now I think it's the best

time to tell the story about how data

evolved and see how

technology has evolved earlier we had

landline phones but now we have

smartphones we have Android we have iOS


that are making a life smarter as other

phone smarter

apart from that we were also using bulky

desktops for processing MB of data now

if you can remember we were using

floppies and you know how much data it

can store right then came hard disk for

storing Tbs of data and now we can store

data on cloud as well and similarly

nowadays even self-driving cars have

come up I know you must be thinking why

are we telling that now if you notice

due to this enhancement of technology

we're generating a lot of data so let's

take the example of your phones have you

ever noticed how much data is generated

due to your fancy smart phones your

every action even one video that is sent

through whatsapp or any other messenger

app that generates data

now this is just an example you have no

idea how much data you're generating

because of every action you do and also this data is not in a format that

our relational database can handle and

apart from that even the volume of data

has also increased exponentially now I

was talking about self-driving cars so

basically this cars have sensors that

records every new details like the

size of the obstacle the distance from

the obstacle and many more and then it


decides how to react now you can imagine

how much data is generated for each

kilometer that you drive on that car.

.........................................................

Let see now

what exactly is Big Data

how do we consider a data as big data so

let's move forward and understand what

exactly it is okay now let us look at

the proper definition of Big Data so Big Data is a term for

collection of data set so large and

complex that it becomes difficult to

process using on-hand database system

tools or traditional data processing

applications, the real problem is there is

too much data to process when the

traditional systems were invented in the

beginning we never anticipated that we

would have to deal with such enormous

model of the data.

.........................................................

There are a lot of problems

in dealing with big data but there are

always different ways to look at

something so let us get some positivity

in the environment now and let us

understand how can we use big data as an

opportunity yes so let us go through the

fields where we can use big data as a


boon and there are certain unknown

problem solved only cuz we started

dealing with big data and the boon that

you're talking about is big data

analytics first thing with big data we

figured out how to store our data

cost-effectively we were spending too

much money on storage before until big

data came into the picture we never

thought of using commodity hardware to

store and manage the data which is both

reliable and feasible as compared to the

costly servers now let me give you a few

examples in order to show you how

important big data analytics is nowadays

so when you go to a website like Amazon

or YouTube or Pandora Netflix any other

website so they'll actually they'll recommend

some products or some videos or some

movies or some songs for you right so

how do you think they do that so

basically whatever data that you are

generating on these kind of websites

they make sure that they analyze it

properly and let me tell you guys that

data is not small it is actually big

data now they analyze that big data and

they make sure that whatever you like or

whatever your preferences are

accordingly they generate

recommendations for you so


how do you think it happens it happen only

because of big data analytics.

....................................................

As we have seen Big data is an opportunity

but there are many problems to encase

this opportunity right so let us focus

on those problems one by one so the

first problem is storing large amount

of data so storing this huge data in

traditional system is not possible the

reason is obvious the storage will be

limited for one system for example you

have a server with a storage limit of 10

terabytes but your company is growing

really fast and data is exponentially

increasing now what do you do now at one

point you exhaust all the storage so

investing in this servers is definitely

not a cost-effective solution so

what do you think what can be the

solution to this problem according to me

a distributed file system will be a

better way to store this huge data

because with this we'll be saving a lot

of money let me tell you how because due

to this distributed system you can

actually store your data in commodity

hardware instead of spending money on

high-end servers.

.............................................
let's see few more okay so since we saw

that the data is not only huge but it is

present in various formats as well like

unstructured semi-structured and

structured so you not only need to store

this huge data but you also need to make

sure that a system is present to store

this varieties of data generated from

various sources

.....................................................

and now let's focus on

the next problem now let's focus on the

diagram so over here you can notice that

the hard disk capacity is increasing but

the distance of performance or speed is

not increasing at that rate let me

explain you this with an example if you

have only 100 Mbps input-output channel

and you're processing say one terabyte

of data now how much time will it take

maybe calculate it'll be somewhere

around two point nine one hours right so

will be somewhere around two point nine

Menards and I have taken an example of

one terabytes what if you're processing

some zettabyte of data so you can

imagine how much time will it take now

what if you have a four input output

channels for the same amount of data

then it will take approximately 0.72


hours or convert into minutes so it'll

be around 43 minutes approximately right

...................................................................

so basically it won't be fair if we don't

discuss the solution to these problems

so what is the solution

Hadoop Hadoop is a solution so let's

introduce Hadoop now ok so now what is

Hadoop so Hadoop is a framework that

allows you to first store big data in a

distributed environment so that it can

process it parallely there are basically

two parts one is HDFS that is Hadoop

distributed file system for storage it

allows you to store data of various

formats across a cluster and the second

part is MapReduce now it is nothing but

a processing unit of Hadoop it allows

parallel processing of data that is

stored across the HDFS.

...................................

now let us dig

deep in HDFS and understand it better

yeah so HDFS creates an abstraction of

resources let me simplify it for you so

similar to virtualization you can see

HDFS logically as a single unit for

storing big data but actually restoring

your data across multiple systems or you

can say in a distributed fashion so here

you have a master/slave architecture in


which the name node is a master node and

the data nodes are slaves and the name

node contains the metadata about the

data that is stored in the data

like which data block is stored in which

data node where are the replications of

the data block captain etc etc so the

actual data is stored in the data nodes

and I also want to add that we actually

replicate the data blocks that is

present in the data nodes and by default

the replication factor is 3 so it means

that there are 3 copies of each file so

so I'm going to tell us why do we need

that replication shell since we

are using commodity Hardware right and

we know failure rate of these hardware's

are pretty high so if one of the data

loads fail I won't have that data block

and that's the reason we need to

replicate the data block.

.....................................................

now let us understand

how actually Hadoop provided the

solution to the big data problems that

we have discussed so can you

remember what was the first problem yeah

it was storing the big data

so how HDFS solved it let's discuss it

so HDFS provides a distributed way to

store big data we've already told you


that so your data is stored in blocks in

data nodes and you then specify the size

of each block so basically if you have a

512 MB of data and you've configured

HDFS such that it will create 128

megabytes of data blocks so HDFS will so

HDFS will divide the data in four blocks

because 512 divided by 128 is 4 and it

restored across different data nodes and

it will also replicate the data blocks

on the different data nodes so now we

are using commodity Hardware and storing

is not a challenge and it also solves the

scaling problem it focuses on horizontal

scaling instead of vertical now you can

always add some extra data nodes to your

HDFS cluster as and when required

instead of scaling the resources of your

data nodes so you're not actually

increasing the resources of your data

nodes you're just adding few more data

nodes when you require

let me summarize it for you so basically

for storing 1 TB of data I don't need a

1 TB system I can instead do it on

multiple 128 GB systems or even less.

...................................................

now one of the second challenge

with big data so the next problem was

storing variety of data and that problem


was also addressed by HDFS

so with HDFS you can store all kinds of

data whether it's structured

semi-structured or unstructured it is

because in HDFS there is no pre dumping

schema validation so you can just dump

all the kinds of data that you have in

one place and it also follows a right

ones and read many model and due to this

you can just write the data once and you

can read it multiple times for finding

out insights .

.............................................

and if you can recall the

third challenge was accessing the data

faster and this is one of the major

challenge with big data and in order to

solve it we're moving processing to data

and not data to processing so what it

means s let me

explain you what do you mean by actually

moving process to data so consider this

as a master and these are our slaves so

the data is stored in these slaves so

what happens one way of processing this

data is what I can do is I can send this

data to my master node and I can process

it over here but what will happen if all

of my slaves will send the data to my

master node it will cause network

congestion plus input output channel


congestion and at the same time my

master node will take a lot of time in

order to process this user monitored so

what I can do I can send this process to

data that means I can send the logic to

all these slaves which actually contain

the data and perform processing in the

maze itself so after that what will

happen the small chunks of the result

that will come out will be sent to our

name node so in that way there won't be

any network congestion or input-output

congestion and it will take

comparatively very less time so this is

what actually means sending process to

data.

.......................................

so

let's move forward and focus on few

components of Hadoop so we look at the

Hadoop ecosystem so you can see that

this is the entire Hadoop eco

system.

Hadoop Ecosystem is neither a programming language nor a service,

it is a platform or framework which solves big data problems.

You can consider it as a suite which encompasses a number of services

(ingesting, storing, analyzing and maintaining) inside it. Let us discuss

and get a brief idea about how the services work individually and in collaboration.

..........................................................................................
HDFS

Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop
Ecosystem.

HDFS is the one, which makes it possible to store different types of

large data sets (i.e. structured, unstructured and semi structured data).

HDFS has two core components, i.e. NameNode and DataNode.

The NameNode is the main node and it doesn’t store the actual data.

It contains metadata, just like a log file or you can say as a table of content.

Therefore, it requires less storage and high computational resources.

On the other hand, all your data is stored on the DataNodes and hence it requires more storage
resources.

These DataNodes are commodity hardware (like your laptops and desktops) in the distributed
environment.

That’s the reason, why Hadoop solutions are very cost effective.

YARN

Consider YARN as the brain of your Hadoop Ecosystem.

It performs all your processing activities by allocating resources and scheduling tasks.

It has two major components, i.e. ResourceManager and NodeManager.

ResourceManager is again a main node in the processing department.

It receives the processing requests, and then passes the parts of requests

to corresponding NodeManagers accordingly, where the actual processing takes place.

NodeManagers are installed on every DataNode. It is responsible for execution of task on every
single DataNode.

...............................................

MAPREDUCE

Apache Mapreduce is the core component of processing in a Hadoop Ecosystem as it provides the
logic of processing.
In other words, MapReduce is a software framework which helps in writing applications that
processes large data sets

using distributed and parallel algorithms inside Hadoop environment.

In a MapReduce program, Map() and Reduce() are two functions.

The Map function performs actions like filtering, grouping and sorting.

While Reduce function aggregates and summarizes the result produced by map function.

The result generated by the Map function is a key value pair (K, V) which acts as the input for
Reduce function.

Let us take the above example to have a better understanding of a MapReduce program.

We have a sample case of students and their respective departments.

We want to calculate the number of students in each department.

Initially, Map program will execute and calculate the students appearing in each department,

producing the key value pair as mentioned above. This key value pair is the input to the Reduce
function.

The Reduce function will then aggregate each department and calculate the total

number of students in each department and produce the given result.

APACHE PIG

PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment.

You can better understand it as Java and JVM.

It supports pig latin language, which has SQL like command structure.

As everyone does not belong from a programming background. So, Apache PIG relieves them.

You might be curious to know how?


Well, I will tell you an interesting fact:

10 line of pig latin = approx. 200 lines of Map-Reduce Java code

But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.

The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs,

and that’s an abstraction (which works like black box).

PIG was initially developed by Yahoo.

It gives you a platform for building data flow for ETL (Extract, Transform and Load), processing

and analyzing huge data sets.

How Pig works?

In PIG, first the load command, loads the data. Then we perform various functions on it

like grouping, filtering, joining, sorting, etc. At last, either you can dump the data

on the screen or you can store the result back in HDFS.

APACHE HIVE

Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel

at home while working in a Hadoop Ecosystem.

Basically, HIVE is a data warehousing component which performs reading, writing and

managing large data sets in a distributed environment using SQL-like interface.

HIVE + SQL = HQL

The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
APACHE MAHOUT

Now, let us talk about Mahout which is renowned for machine learning.

Mahout provides an environment for creating machine learning applications which are scalable.

What Mahout does?

It performs collaborative filtering, clustering and classification.

Mahout provides a command line to invoke various algorithms.

It has a predefined set of library which already contains different inbuilt algorithms for different
use cases.

APACHE SPARK

Apache Spark is a framework for real time data analytics in a distributed computing environment.

The Spark is written in Scala.

It executes in-memory computations to increase speed of data processing over Map-Reduce.

It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations

and other optimizations. Therefore, it requires high processing power than Map-Reduce.

As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python,
Scala, Java etc.

This is a very common question in everyone’s mind:

“Apache Spark: A Killer or Saviour of Apache Hadoop?” –

The Answer to this –. Apache Spark best fits for real time processing, whereas Hadoop was
designed

to store unstructured data and execute batch processing over it.


When we combine, Apache Spark’s ability, i.e. high processing speed, advance analytics and
multiple

integration support with Hadoop’s low cost operation on commodity hardware, it gives the best
results.

That is the reason why, Spark and Hadoop are used together by many companies for processing
and

analyzing their Big Data stored in HDFS.

APACHE HBASE

HBase is an open source, non-relational distributed database. In other words, it is a NoSQL


database.

It supports all types of data and that is why, it’s capable of handling anything and everything
inside a Hadoop ecosystem.

For better understanding, let us take an example. You have billions of customer emails and you
need to

find out the number of customers who has used the word complaint in their emails.

The request needs to be processed quickly (i.e. at real time).

So, here we are handling a large data set while retrieving a small amount of data.

For solving these kind of problems, HBase was designed.

APACHE ZOOKEEPER
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various
services in a Hadoop Ecosystem.

Apache Zookeeper coordinates with various services in a distributed environment.

Before Zookeeper, it was very difficult and time consuming to coordinate between different
services

in Hadoop Ecosystem. The services earlier had many problems with interactions like common
configuration

while synchronizing data. Even if the services are configured, changes in the configurations of the
services

make it complex and difficult to handle. The grouping and naming was also a time-consuming
factor.

Due to the above problems, Zookeeper was introduced. It saves a lot of time by performing
synchronization,

configuration maintenance, grouping and naming.

Although it’s a simple service, it can be used to build powerful solutions.

APACHE OOZIE

Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.

For Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them
together as one logical work.

There are two kinds of Oozie jobs:

Oozie workflow: These are sequential set of actions to be executed.

You can assume it as a relay race. Where each athlete waits for the last one to complete his part.

Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available
to it.

Think of this as the response-stimuli system in our body. In the same manner as we respond to an
external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.

APACHE FLUME

Ingesting data is an important part of our Hadoop Ecosystem.

The Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS.

It gives us a solution which is reliable and distributed and helps us in collecting, aggregating and
moving large amount of data sets.

It helps us to ingest online streaming data from various sources like network traffic, social media,
email messages, log files etc. in HDFS.

APACHE SQOOP

Now, let us talk about another data ingesting service i.e. Sqoop. The major difference between
Flume and Sqoop is that:

Flume only ingests unstructured data or semi-structured data into HDFS.

While Sqoop can import as well as export structured data from RDBMS or Enterprise data
warehouses to HDFS or vice versa.

You might also like