Professional Documents
Culture Documents
Session 17 - 07thjan2023 - Big Data Hadoop
Session 17 - 07thjan2023 - Big Data Hadoop
Session 17 - 07thjan2023 - Big Data Hadoop
Till now we were focusing on Azure specific things in Module 1 till session 16.
Big Data – we are now moving from digitization to big data world. Digitization means project
like online registration, ticket booking like common app. In last 10 years majority of the
application are online. Somehow all this applications are generating data. Now, how we can
scale this up is the problem number one, as the number of users are increasing. Second
thing is there is lot of data gets generated, every device generate a huge amount of data.
Data generation is fine, website is running smoothly. Now, the idea is to take the insights of
this data or to do the analytics of this data.
If there are millions of users and logs file is getting generated for all due to there online
transactions on any applications. We need the analysis, how many users just visiting the
website, how many actually buying the products, who are the repetitive customers, more
information like this.
ATM example – If one ATM is always empty and we can get cash anytime out of it. We also
find out that some other ATM have long queue. Some ATM near to office are always
crowded for the first 5 days of the month and for last 5 days this ATM is very less crowded.
ATM near to the mall is crowded more on the weekend and on weekdays there is less traffic.
There can be number of such patterns. We have all the data from all these ATMs, then we
can create a pattern to deploy cash in such a way that we will get more customer
satisfaction and more business to bank.
Similar thing can be applied to medical, telecom, FMCG etc.
This hug amount of information is used for data insights and for taking right business
decisions based on this insights.
How we are going to do this analytics? We need big data technology for this.
Many organization want to manage the data and do data analytics.
Data Engineer have to do this data analytics using spark code etc.
Projects in Data Engineer are more than Data Scientist. That’s why its perfect time to be a
data engineer.
This big data doesn’t have any specific definition. This is just hug amount of data. The size or
limitation is not mentioned anywhere.
----------------------------------------------------------------------------------------------------------------------
Data could of three types: Structured, Semi-Structured and Un-structured
Structured Data – Data in form of table(row and columns). For every row data will remain
the same. All row will have exact number of column and same datatype as well. All the
analysis of data is easy if we have all data in structured format.
Semi-structured Data – It is the type of data which has a little flexible schema. This is not a
fixed format. For example CSV file, JSON file, XML file are example of semi-structured data.
Un-Structured Data – It is data which don’t have any structured. Like log file, video, audio
etc.
------------------------------------------------------------------------------------------------------------------------
If we are getting data in structured format we can analyze data using SQL queries.
How to analyse semi-structured or unstructured data. Python, Java code we can use to read
the file and then we write some logic to extract and analys the data. But the problem with
this approach is the file size. If file size is 10mb we can go ahead with this. But if file size in
GB, we can’t go ahead with this way. It will take hug amount of time and efforts.
For this we need a system where we store such a hug amount of data and to process this
data hug computation power is required.
Imagine if we have 1 TB of data, only copying the data will take a lot of time.
It is pretty clear that only 1 machine can’t solve this problem, it is have its own limitations.
So, we need multiple machines. In our project when we get hug amount of requirement. We
ask for more people to distribute work load, so that we can close work on time and deliver
to client on decided limited timelines.
That’s the reason we need Parallel Processing and Distributed Computing. Parallel
Processing means more than one task is processing at once parallaly. Distributed Computing
means this work is getting divided in pieces and then computation is happening to save time
and efforts.
This is the reason Big Data Analytics System came in picture. Multiple systems are used to
solve a huge big problem.
Here in BigData world, Node is nothing but one laptop/machine.
Cluster is nothing but group of Nodes.
Initial days cloud was not there, so this solution was developed without considering cloud in
picture. All machines was on-premises at that time. In On-Prem may be It is possible that
one machine can go down due to any reason. If this one machine goes down, all this data
will not be useful, because if we want to run algorithm on 1 TB of full data. As one node is
down we will get only three pieces out of 4 and this file will be incomplete.
This was a very very common problem 15 years ago when cloud was not there.
To get read of such problem, they said it’s better to do replication. Divide data into four
pieces is fine, but we have to do copy as well.
If node 4 goes down which is having data D, node 3 also have D data replication. All this data
is stored inside Name Node.
This is called as False Tolerance, means if one node goes down, our work will be up and
running.
3 is the replication factory, it means every block/file will be copied 3 times. 3 replication of
any file will be there as a backup in case of any failure.
Hadoop is so smart to do it using just one command copy. It will copy and will do replication
by itself.
This Pieces are called as Block in the world of Hadoop.
The block default size is 64MB or 128MB.
This block size is configurable.
Hadoop can read network topology. It understand where is our server is located. In server
room if we have Rack1, Rack2 etc. In one Rack number of servers are present. What if one
rack gets failed. Then Hadoop locate another rack and keep all files replica in Rack2 as well.
So, not only in different nodes/machines, but also in different Racks the replica of a
block/piece of a file is present.
This type of Fault Tolerance is maintained in Hadoop system. This is called as HDFS part.
---------------------------------------------------------------------------------------------------------------
Now, what about analytics.
All the algorithm and logic run in Hadoop using MapReduce. This is also popularly called as
MR jobs.
We can run all our logic using MapReduce.
There is some complexity of writing the code using MapReduce.
Data Locality – Before the Hadoop got developed it was there. Suppose there is one master
machine having another slave machines.
before the Hadoop all data was kept with master only. Take a very common example Word
Count. Imagine we have one file, we have to find the occurrence of every word in this file.
Slave nodes are also called as Worker Nodes.
Final result is the addition/aggregation of all mapper code to the Reducer side, which is
called as Reducer Code.
MapReduce we can write in mainly two languages: Java and Python.
Java is most preferred, because MapReduce and Hadoop is developed using java only.
Facebook came up with a solution for the MapReduce complex program problem. That is
Hive. in Hive we can write a query using SQL. Hive convert this query into MapReduce
format. Run that MapReduce code and give the result to us.
Now Hive is also a part of ecosystem of Hadoop.
Yahoo also done some work on this issue. They developed something called Pig. In pig we
can write SQL. Pig covert this query into MapReduce format. Run that MapReduce code and
give the result to us.
Another one is Sqoop. It is developed to copy data from HDFS to MySQL database.
Hadoop installation takes a lot of time. And had many installation problem. For this two
companies came forward. Cloudera and hortonworks provide the Hadoop readymade
solutions. They both got merged now. We just have to install cloudera and takes it services.
Cloudera is nothing but Hadoop distribution. Cloudera place taken by DataBricks now.