Session 17 - 07thjan2023 - Big Data Hadoop

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Module 2 Started from this Session 17, we will cover BigData, Databricks, Spark

Till now we were focusing on Azure specific things in Module 1 till session 16.

 Big Data – we are now moving from digitization to big data world. Digitization means project
like online registration, ticket booking like common app. In last 10 years majority of the
application are online. Somehow all this applications are generating data. Now, how we can
scale this up is the problem number one, as the number of users are increasing. Second
thing is there is lot of data gets generated, every device generate a huge amount of data.
Data generation is fine, website is running smoothly. Now, the idea is to take the insights of
this data or to do the analytics of this data.
 If there are millions of users and logs file is getting generated for all due to there online
transactions on any applications. We need the analysis, how many users just visiting the
website, how many actually buying the products, who are the repetitive customers, more
information like this.
 ATM example – If one ATM is always empty and we can get cash anytime out of it. We also
find out that some other ATM have long queue. Some ATM near to office are always
crowded for the first 5 days of the month and for last 5 days this ATM is very less crowded.
ATM near to the mall is crowded more on the weekend and on weekdays there is less traffic.
There can be number of such patterns. We have all the data from all these ATMs, then we
can create a pattern to deploy cash in such a way that we will get more customer
satisfaction and more business to bank.
 Similar thing can be applied to medical, telecom, FMCG etc.
 This hug amount of information is used for data insights and for taking right business
decisions based on this insights.
 How we are going to do this analytics? We need big data technology for this.
 Many organization want to manage the data and do data analytics.
 Data Engineer have to do this data analytics using spark code etc.
 Projects in Data Engineer are more than Data Scientist. That’s why its perfect time to be a
data engineer.
 This big data doesn’t have any specific definition. This is just hug amount of data. The size or
limitation is not mentioned anywhere.
 ----------------------------------------------------------------------------------------------------------------------
 Data could of three types: Structured, Semi-Structured and Un-structured
 Structured Data – Data in form of table(row and columns). For every row data will remain
the same. All row will have exact number of column and same datatype as well. All the
analysis of data is easy if we have all data in structured format.
 Semi-structured Data – It is the type of data which has a little flexible schema. This is not a
fixed format. For example CSV file, JSON file, XML file are example of semi-structured data.
 Un-Structured Data – It is data which don’t have any structured. Like log file, video, audio
etc.

 ------------------------------------------------------------------------------------------------------------------------
 If we are getting data in structured format we can analyze data using SQL queries.
 How to analyse semi-structured or unstructured data. Python, Java code we can use to read
the file and then we write some logic to extract and analys the data. But the problem with
this approach is the file size. If file size is 10mb we can go ahead with this. But if file size in
GB, we can’t go ahead with this way. It will take hug amount of time and efforts.
 For this we need a system where we store such a hug amount of data and to process this
data hug computation power is required.
 Imagine if we have 1 TB of data, only copying the data will take a lot of time.
 It is pretty clear that only 1 machine can’t solve this problem, it is have its own limitations.
 So, we need multiple machines. In our project when we get hug amount of requirement. We
ask for more people to distribute work load, so that we can close work on time and deliver
to client on decided limited timelines.
 That’s the reason we need Parallel Processing and Distributed Computing. Parallel
Processing means more than one task is processing at once parallaly. Distributed Computing
means this work is getting divided in pieces and then computation is happening to save time
and efforts.
 This is the reason Big Data Analytics System came in picture. Multiple systems are used to
solve a huge big problem.
 Here in BigData world, Node is nothing but one laptop/machine.
 Cluster is nothing but group of Nodes.

 Hadoop is big data framework which was developed in 200_.


 Over a period of time the space of this Hadoop is taken up by Spark.
 If any project if starting new fresh, they will use Spark only, instead of Haddop.
 Only the legacy/old support project will be using Hadoop.
 ------------------
 How this Hadoop work? Hadoop is important for the base of big data analytics.
 In computer world we can segregate everything in two things: compute and storage.
Everything revolves around it in cloud world.
 Compute means valuation and Storage means storing this data.
 Distributed Computing – will have multiple nodes, all these nodes will work parallel to give
result in less time.
 Storage – If we have 1 TB data, this can not be stored on 1 machine. At the start when we
get the file from client/sourceTeam. As soon as we get this data we will be storing it into
cluster. It is cluster responsibility to divide this data and to store it.
 Like in below example 1 TB file is getting divided into 250 GB pieces and stored into 4
different nodes.

 We are thinking how any big data or Hadoop framework is developed.


 This storage management is taken care by one system which is called as HDFS (Hadoop
distributed file system)
 Actually, in Hadoop there are two pieces: HDFS and Map Reduce
 HDFS stands for file management or we can say storage management.
 HDFS takes the file, divide it into pieces and keep/store information in different nodes.
 Name Node is the one which store all the information(metadata) of other nodes.
 Other nodes storing the data are called as Data nodes.

 Initial days cloud was not there, so this solution was developed without considering cloud in
picture. All machines was on-premises at that time. In On-Prem may be It is possible that
one machine can go down due to any reason. If this one machine goes down, all this data
will not be useful, because if we want to run algorithm on 1 TB of full data. As one node is
down we will get only three pieces out of 4 and this file will be incomplete.

 This was a very very common problem 15 years ago when cloud was not there.
 To get read of such problem, they said it’s better to do replication. Divide data into four
pieces is fine, but we have to do copy as well.

 If node 4 goes down which is having data D, node 3 also have D data replication. All this data
is stored inside Name Node.
 This is called as False Tolerance, means if one node goes down, our work will be up and
running.
 3 is the replication factory, it means every block/file will be copied 3 times. 3 replication of
any file will be there as a backup in case of any failure.
 Hadoop is so smart to do it using just one command copy. It will copy and will do replication
by itself.
 This Pieces are called as Block in the world of Hadoop.
 The block default size is 64MB or 128MB.
 This block size is configurable.
 Hadoop can read network topology. It understand where is our server is located. In server
room if we have Rack1, Rack2 etc. In one Rack number of servers are present. What if one
rack gets failed. Then Hadoop locate another rack and keep all files replica in Rack2 as well.
 So, not only in different nodes/machines, but also in different Racks the replica of a
block/piece of a file is present.
 This type of Fault Tolerance is maintained in Hadoop system. This is called as HDFS part.
 ---------------------------------------------------------------------------------------------------------------
 Now, what about analytics.
 All the algorithm and logic run in Hadoop using MapReduce. This is also popularly called as
MR jobs.
 We can run all our logic using MapReduce.
 There is some complexity of writing the code using MapReduce.
 Data Locality – Before the Hadoop got developed it was there. Suppose there is one master
machine having another slave machines.
 before the Hadoop all data was kept with master only. Take a very common example Word
Count. Imagine we have one file, we have to find the occurrence of every word in this file.
 Slave nodes are also called as Worker Nodes.

 Data locality is one of the reason of Hadoop popularity.


 Instead of accessing data, we can transfer the logic. Most of the big data is like write once
and read many times. Then there is no point if transferring data again and again.
 When someone write a code, the NameNode will send this code to specific ChildNode like B
and command it to process it.
 So, here the data is local. HDFC divide data and keep it in Data Nodes (Slave). When we have
to run code, Hadoop understand that which child node data it need for execution and it just
send code to that childnode and execute the program. We are not moving data, we are just
moving code. Because code size is very small as compared to data size. This is called as Data
Locality.
 MR stands for MapReduce. MR job is having two phase : Map Phase and Reduce Phase. We
are trying to solve the problem in two manners.
 Map Phase: Mapper work individually and
 Reduce Phase: Reducer assemble/integrate all the work done by Mappers

 We divide the data(A,B,C and D) and process it parallelly


 Take the result of everyone and aggregate it. This is called as Reducer/Output.

 Final result is the addition/aggregation of all mapper code to the Reducer side, which is
called as Reducer Code.
 MapReduce we can write in mainly two languages: Java and Python.
 Java is most preferred, because MapReduce and Hadoop is developed using java only.
 Facebook came up with a solution for the MapReduce complex program problem. That is
Hive. in Hive we can write a query using SQL. Hive convert this query into MapReduce
format. Run that MapReduce code and give the result to us.
 Now Hive is also a part of ecosystem of Hadoop.
 Yahoo also done some work on this issue. They developed something called Pig. In pig we
can write SQL. Pig covert this query into MapReduce format. Run that MapReduce code and
give the result to us.
 Another one is Sqoop. It is developed to copy data from HDFS to MySQL database.
 Hadoop installation takes a lot of time. And had many installation problem. For this two
companies came forward. Cloudera and hortonworks provide the Hadoop readymade
solutions. They both got merged now. We just have to install cloudera and takes it services.
Cloudera is nothing but Hadoop distribution. Cloudera place taken by DataBricks now.

You might also like