Big Data Tutorial

Hello everyone, my name is Nikita Kukreja.
And today i am going to host the Knowledge Sharing

Session.
Today I will focus on Big data.
so let's have a look at
the agenda so this is what we'll be
discussing today we'll begin by
understanding how data evolved that is
how big data came into existence then
we'll see what exactly is big data we'll
understand what sort of data can be
considered as big data after that we'll
see how big data can be an opportunity
now obviously big data is an opportunity
but we know that there are no free
lunches in life so we'll focus on the
various problems associated in encasing
this opportunity and it won't be fair if
I tell you only about the problems and
not the solution therefore we'll see how
Hadoop solve this problems and we'll dig
a bit deep and understand about few
components of this Hadoop framework.
.......................................................
now I think it's the best
time to tell the story about how data
evolved and see how
technology has evolved earlier we had
landline phones but now we have
smartphones we have Android we have iOS

that are making a life smarter as other
phone smarter
apart from that we were also using bulky
desktops for processing MB of data now
if you can remember we were using
floppies and you know how much data it
can store right then came hard disk for
storing Tbs of data and now we can store
data on cloud as well and similarly
nowadays even self-driving cars have
come up I know you must be thinking why
are we telling that now if you notice
due to this enhancement of technology
we're generating a lot of data so let's
take the example of your phones have you
ever noticed how much data is generated
due to your fancy smart phones your
every action even one video that is sent
through whatsapp or any other messenger
app that generates data
now this is just an example you have no
idea how much data you're generating
because of every action you do and also this data is not in a format that
our relational database can handle and
apart from that even the volume of data
has also increased exponentially now I
was talking about self-driving cars so
basically this cars have sensors that
records every new details like the
size of the obstacle the distance from
the obstacle and many more and then it

decides how to react now you can imagine
how much data is generated for each
kilometer that you drive on that car.
.........................................................
Let see now
what exactly is Big Data
how do we consider a data as big data so
let's move forward and understand what
exactly it is okay now let us look at
the proper definition of Big Data so Big Data is a term for
collection of data set so large and
complex that it becomes difficult to
process using on-hand database system
tools or traditional data processing
applications, the real problem is there is
too much data to process when the
traditional systems were invented in the
beginning we never anticipated that we
would have to deal with such enormous
model of the data.
.........................................................
There are a lot of problems
in dealing with big data but there are
always different ways to look at
something so let us get some positivity
in the environment now and let us
understand how can we use big data as an
opportunity yes so let us go through the
fields where we can use big data as a

boon and there are certain unknown
problem solved only cuz we started
dealing with big data and the boon that
you're talking about is big data
analytics first thing with big data we
figured out how to store our data
cost-effectively we were spending too
much money on storage before until big
data came into the picture we never
thought of using commodity hardware to
store and manage the data which is both
reliable and feasible as compared to the
costly servers now let me give you a few
examples in order to show you how
important big data analytics is nowadays
so when you go to a website like Amazon
or YouTube or Pandora Netflix any other
website so they'll actually they'll recommend
some products or some videos or some
movies or some songs for you right so
how do you think they do that so
basically whatever data that you are
generating on these kind of websites
they make sure that they analyze it
properly and let me tell you guys that
data is not small it is actually big
data now they analyze that big data and
they make sure that whatever you like or
whatever your preferences are
accordingly they generate
recommendations for you so

how do you think it happens it happen only
because of big data analytics.
....................................................
As we have seen Big data is an opportunity
but there are many problems to encase
this opportunity right so let us focus
on those problems one by one so the
first problem is storing large amount
of data so storing this huge data in
traditional system is not possible the
reason is obvious the storage will be
limited for one system for example you
have a server with a storage limit of 10
terabytes but your company is growing
really fast and data is exponentially
increasing now what do you do now at one
point you exhaust all the storage so
investing in this servers is definitely
not a cost-effective solution so
what do you think what can be the
solution to this problem according to me
a distributed file system will be a
better way to store this huge data
because with this we'll be saving a lot
of money let me tell you how because due
to this distributed system you can
actually store your data in commodity
hardware instead of spending money on
high-end servers.
.............................................
let's see few more okay so since we saw
that the data is not only huge but it is
present in various formats as well like
unstructured semi-structured and
structured so you not only need to store
this huge data but you also need to make
sure that a system is present to store
this varieties of data generated from
various sources
.....................................................
and now let's focus on
the next problem now let's focus on the
diagram so over here you can notice that
the hard disk capacity is increasing but
the distance of performance or speed is
not increasing at that rate let me
explain you this with an example if you
have only 100 Mbps input-output channel
and you're processing say one terabyte
of data now how much time will it take
maybe calculate it'll be somewhere
around two point nine one hours right so
will be somewhere around two point nine
Menards and I have taken an example of
one terabytes what if you're processing
some zettabyte of data so you can
imagine how much time will it take now
what if you have a four input output
channels for the same amount of data
then it will take approximately 0.72

hours or convert into minutes so it'll
be around 43 minutes approximately right
...................................................................
so basically it won't be fair if we don't
discuss the solution to these problems
so what is the solution
Hadoop Hadoop is a solution so let's
introduce Hadoop now ok so now what is
Hadoop so Hadoop is a framework that
allows you to first store big data in a
distributed environment so that it can
process it parallely there are basically
two parts one is HDFS that is Hadoop
distributed file system for storage it
allows you to store data of various
formats across a cluster and the second
part is MapReduce now it is nothing but
a processing unit of Hadoop it allows
parallel processing of data that is
stored across the HDFS.
...................................
now let us dig
deep in HDFS and understand it better
yeah so HDFS creates an abstraction of
resources let me simplify it for you so
similar to virtualization you can see
HDFS logically as a single unit for
storing big data but actually restoring
your data across multiple systems or you
can say in a distributed fashion so here
you have a master/slave architecture in

which the name node is a master node and
the data nodes are slaves and the name
node contains the metadata about the
data that is stored in the data
like which data block is stored in which
data node where are the replications of
the data block captain etc etc so the
actual data is stored in the data nodes
and I also want to add that we actually
replicate the data blocks that is
present in the data nodes and by default
the replication factor is 3 so it means
that there are 3 copies of each file so
so I'm going to tell us why do we need
that replication shell since we
are using commodity Hardware right and
we know failure rate of these hardware's
are pretty high so if one of the data
loads fail I won't have that data block
and that's the reason we need to
replicate the data block.
.....................................................
now let us understand
how actually Hadoop provided the
solution to the big data problems that
we have discussed so can you
remember what was the first problem yeah
it was storing the big data
so how HDFS solved it let's discuss it
so HDFS provides a distributed way to
store big data we've already told you

that so your data is stored in blocks in
data nodes and you then specify the size
of each block so basically if you have a
512 MB of data and you've configured
HDFS such that it will create 128
megabytes of data blocks so HDFS will so
HDFS will divide the data in four blocks
because 512 divided by 128 is 4 and it
restored across different data nodes and
it will also replicate the data blocks
on the different data nodes so now we
are using commodity Hardware and storing
is not a challenge and it also solves the
scaling problem it focuses on horizontal
scaling instead of vertical now you can
always add some extra data nodes to your
HDFS cluster as and when required
instead of scaling the resources of your
data nodes so you're not actually
increasing the resources of your data
nodes you're just adding few more data
nodes when you require
let me summarize it for you so basically
for storing 1 TB of data I don't need a
1 TB system I can instead do it on
multiple 128 GB systems or even less.
...................................................
now one of the second challenge
with big data so the next problem was
storing variety of data and that problem

was also addressed by HDFS
so with HDFS you can store all kinds of
data whether it's structured
semi-structured or unstructured it is
because in HDFS there is no pre dumping
schema validation so you can just dump
all the kinds of data that you have in
one place and it also follows a right
ones and read many model and due to this
you can just write the data once and you
can read it multiple times for finding
out insights .
.............................................
and if you can recall the
third challenge was accessing the data
faster and this is one of the major
challenge with big data and in order to
solve it we're moving processing to data
and not data to processing so what it
means s let me
explain you what do you mean by actually
moving process to data so consider this
as a master and these are our slaves so
the data is stored in these slaves so
what happens one way of processing this
data is what I can do is I can send this
data to my master node and I can process
it over here but what will happen if all
of my slaves will send the data to my
master node it will cause network
congestion plus input output channel

congestion and at the same time my
master node will take a lot of time in
order to process this user monitored so
what I can do I can send this process to
data that means I can send the logic to
all these slaves which actually contain
the data and perform processing in the
maze itself so after that what will
happen the small chunks of the result
that will come out will be sent to our
name node so in that way there won't be
any network congestion or input-output
congestion and it will take
comparatively very less time so this is
what actually means sending process to
data.
.......................................
so
let's move forward and focus on few
components of Hadoop so we look at the
Hadoop ecosystem so you can see that
this is the entire Hadoop eco
system.
Hadoop Ecosystem is neither a programming language nor a service,
it is a platform or framework which solves big data problems.
You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it. Let us discuss
and get a brief idea about how the services work individually and in collaboration.
..........................................................................................
HDFS
Hadoop Distributed File System is the core component or you can say, the backbone of Hadoop
Ecosystem.
HDFS is the one, which makes it possible to store different types of
large data sets (i.e. structured, unstructured and semi structured data).
HDFS has two core components, i.e. NameNode and DataNode.
The NameNode is the main node and it doesn’t store the actual data.
It contains metadata, just like a log file or you can say as a table of content.
Therefore, it requires less storage and high computational resources.
On the other hand, all your data is stored on the DataNodes and hence it requires more storage
resources.
These DataNodes are commodity hardware (like your laptops and desktops) in the distributed
environment.
That’s the reason, why Hadoop solutions are very cost effective.
YARN
Consider YARN as the brain of your Hadoop Ecosystem.
It performs all your processing activities by allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
ResourceManager is again a main node in the processing department.
It receives the processing requests, and then passes the parts of requests
to corresponding NodeManagers accordingly, where the actual processing takes place.
NodeManagers are installed on every DataNode. It is responsible for execution of task on every
single DataNode.
...............................................
MAPREDUCE
Apache Mapreduce is the core component of processing in a Hadoop Ecosystem as it provides the
logic of processing.
In other words, MapReduce is a software framework which helps in writing applications that
processes large data sets
using distributed and parallel algorithms inside Hadoop environment.
In a MapReduce program, Map() and Reduce() are two functions.
The Map function performs actions like filtering, grouping and sorting.
While Reduce function aggregates and summarizes the result produced by map function.
The result generated by the Map function is a key value pair (K, V) which acts as the input for
Reduce function.
Let us take the above example to have a better understanding of a MapReduce program.
We have a sample case of students and their respective departments.
We want to calculate the number of students in each department.
Initially, Map program will execute and calculate the students appearing in each department,
producing the key value pair as mentioned above. This key value pair is the input to the Reduce
function.
The Reduce function will then aggregate each department and calculate the total
number of students in each department and produce the given result.
APACHE PIG
PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment.
You can better understand it as Java and JVM.
It supports pig latin language, which has SQL like command structure.
As everyone does not belong from a programming background. So, Apache PIG relieves them.
You might be curious to know how?

Well, I will tell you an interesting fact:
10 line of pig latin = approx. 200 lines of Map-Reduce Java code
But don’t be shocked when I say that at the back end of Pig job, a map-reduce job executes.
The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs,
and that’s an abstraction (which works like black box).
PIG was initially developed by Yahoo.
It gives you a platform for building data flow for ETL (Extract, Transform and Load), processing
and analyzing huge data sets.
How Pig works?
In PIG, first the load command, loads the data. Then we perform various functions on it
like grouping, filtering, joining, sorting, etc. At last, either you can dump the data
on the screen or you can store the result back in HDFS.
APACHE HIVE
Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them feel
at home while working in a Hadoop Ecosystem.
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
APACHE MAHOUT
Now, let us talk about Mahout which is renowned for machine learning.
Mahout provides an environment for creating machine learning applications which are scalable.
What Mahout does?
It performs collaborative filtering, clustering and classification.
Mahout provides a command line to invoke various algorithms.
It has a predefined set of library which already contains different inbuilt algorithms for different
use cases.
APACHE SPARK
Apache Spark is a framework for real time data analytics in a distributed computing environment.
The Spark is written in Scala.
It executes in-memory computations to increase speed of data processing over Map-Reduce.
It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations
and other optimizations. Therefore, it requires high processing power than Map-Reduce.
As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python,
Scala, Java etc.
This is a very common question in everyone’s mind:
“Apache Spark: A Killer or Saviour of Apache Hadoop?” –
The Answer to this –. Apache Spark best fits for real time processing, whereas Hadoop was
designed
to store unstructured data and execute batch processing over it.

When we combine, Apache Spark’s ability, i.e. high processing speed, advance analytics and
multiple
integration support with Hadoop’s low cost operation on commodity hardware, it gives the best
results.
That is the reason why, Spark and Hadoop are used together by many companies for processing
and
analyzing their Big Data stored in HDFS.
APACHE HBASE
HBase is an open source, non-relational distributed database. In other words, it is a NoSQL

database.
It supports all types of data and that is why, it’s capable of handling anything and everything
inside a Hadoop ecosystem.
For better understanding, let us take an example. You have billions of customer emails and you
need to
find out the number of customers who has used the word complaint in their emails.
The request needs to be processed quickly (i.e. at real time).
So, here we are handling a large data set while retrieving a small amount of data.
For solving these kind of problems, HBase was designed.
APACHE ZOOKEEPER
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various
services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between different
services
in Hadoop Ecosystem. The services earlier had many problems with interactions like common
configuration
while synchronizing data. Even if the services are configured, changes in the configurations of the
services
make it complex and difficult to handle. The grouping and naming was also a time-consuming
factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by performing
synchronization,
configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
APACHE OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.
For Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them
together as one logical work.
There are two kinds of Oozie jobs:
Oozie workflow: These are sequential set of actions to be executed.
You can assume it as a relay race. Where each athlete waits for the last one to complete his part.
Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available
to it.
Think of this as the response-stimuli system in our body. In the same manner as we respond to an
external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.
APACHE FLUME
Ingesting data is an important part of our Hadoop Ecosystem.
The Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS.
It gives us a solution which is reliable and distributed and helps us in collecting, aggregating and
moving large amount of data sets.
It helps us to ingest online streaming data from various sources like network traffic, social media,
email messages, log files etc. in HDFS.
APACHE SQOOP
Now, let us talk about another data ingesting service i.e. Sqoop. The major difference between
Flume and Sqoop is that:
Flume only ingests unstructured data or semi-structured data into HDFS.
While Sqoop can import as well as export structured data from RDBMS or Enterprise data
warehouses to HDFS or vice versa.

Big Data Tutorial

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Tutorial

Uploaded by

Copyright:

Available Formats

Hello everyone, my name is Nikita Kukreja.

And today i am going to host the Knowledge Sharing

Today I will focus on Big data.

so let's have a look at

the agenda so this is what we'll be

discussing today we'll begin by

understanding how data evolved that is

how big data came into existence then

we'll see what exactly is big data we'll

understand what sort of data can be

considered as big data after that we'll

see how big data can be an opportunity

now obviously big data is an opportunity

but we know that there are no free

lunches in life so we'll focus on the

various problems associated in encasing

this opportunity and it won't be fair if

I tell you only about the problems and

not the solution therefore we'll see how

Hadoop solve this problems and we'll dig

a bit deep and understand about few

components of this Hadoop framework.

now I think it's the best

time to tell the story about how data

evolved and see how

technology has evolved earlier we had

landline phones but now we have

smartphones we have Android we have iOS

apart from that we were also using bulky

desktops for processing MB of data now

if you can remember we were using

floppies and you know how much data it

can store right then came hard disk for

storing Tbs of data and now we can store

data on cloud as well and similarly

nowadays even self-driving cars have

come up I know you must be thinking why

are we telling that now if you notice

due to this enhancement of technology

we're generating a lot of data so let's

take the example of your phones have you

ever noticed how much data is generated

due to your fancy smart phones your

every action even one video that is sent

through whatsapp or any other messenger

app that generates data

now this is just an example you have no

idea how much data you're generating

our relational database can handle and

apart from that even the volume of data

has also increased exponentially now I

was talking about self-driving cars so

basically this cars have sensors that

records every new details like the

size of the obstacle the distance from

the obstacle and many more and then it

how much data is generated for each

kilometer that you drive on that car.

Let see now

what exactly is Big Data

how do we consider a data as big data so

let's move forward and understand what