University Institute of Computing: Big Bata Analytics 22CAH-782

UNIVERSITY INSTITUTE OF COMPUTING
MASTER OF COMPUTER APPLICATIONS

Big Bata Analytics
22CAH-782
DISCOVER . LEARN . EMPOWER

1
Outlines
• History of Hadoop
• Apache Hadoop
• Analysing Data with Unix tools
• Analysing Data with Hadoop
• Hadoop Streaming, Hadoop Echo System
• IBM Big Data Strategy
2
Hadoop:Introduction
Hadoop is an open-source software framework that is used for storing and

processing large amount of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.
3
Hadoop:Introduction
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
4
What is Hadoop?
• Hadoop is a framework that uses distributed storage and parallel
processing to store and manage big data. It is the software most used
by data analysts to handle big data, and its market size continues to
grow. There are three components of Hadoop:
• Hadoop HDFS - Hadoop Distributed File System (HDFS) is the
storage unit.
• Hadoop MapReduce - Hadoop MapReduce is the processing unit.
• Hadoop YARN - Yet Another Resource Negotiator (YARN) is a
resource management unit.
5
What is Apache Hadoop
Apache Hadoop is an open source, Java-based software platform that
manages data processing and storage for huge data applications. The
platform works by distributing Hadoop big data and analytics jobs
across nodes in a computing cluster, breaking them down into smaller
workloads that can be run in parallel.
• Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, Map Reduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding
for HDFS. In April 2006 Hadoop 0.1.0 was released.
Doug Cutting
Mike Cafarella
History of Hadoop
Modules of Hadoop
• There are three core components of Hadoop as mentioned earlier.
They are HDFS, MapReduce, and YARN. These together form
the Hadoop framework architecture.
Modules of Hadoop:Description
• HDFS (Hadoop Distributed File System):
• It is a data storage system. Since the data sets are huge, it uses a
distributed system to store this data. It is stored in blocks where each
block is 128 MB. It consists of NameNode and DataNode. There can
only be one NameNode but multiple DataNodes.
• Features:
• The storage is distributed to handle a large data pool
• Distribution increases data security
• It is fault-tolerant, other blocks can pick up the failure of one block
• MapReduce:
• The MapReduce framework is the processing unit. All data is
distributed and processed parallelly. There is a MasterNode that
distributes data amongst SlaveNodes. The SlaveNodes do the
processing and send it back to the MasterNode.
• Features:
• Consists of two phases, Map Phase and Reduce Phase.
• Processes big data faster with multiples nodes working under one CPU
• YARN (yet another Resources Negotiator):
• It is the resource management unit of the Hadoop framework. The
data which is stored can be processed with help of YARN using data
processing engines like interactive processing. It can be used to fetch
any sort of data analysis.
• Features:
• It is a filing system that acts as an Operating System for the data
stored on HDFS
• It helps to schedule the tasks to avoid overloading any system
Analyzing data with UNIX
• To understand how to work with Unix, data — Weather Dataset is
used.
Weather sensors gather information consistently in numerous areas
over the globe and assemble an enormous volume of log information,
which is a decent possibility for investigation with MapReduce in light
of the fact that is required to process every one of the information, and
the information is record-oriented and semi-organized.The information
utilized is from the National Climatic Data Center or NCDC. The data
is in line-oriented ASCII format & and the line is a record. Data files
are organized by date and weather station.
Structure of NCDC record
Fig: Structure Of Weather data set.

(Credits: Hadoop The Definitive Guide,
Third Edition by Tom White)
• So now we’ll find out the highest recorded global temperature in the
dataset (for each year) using Unix. The classic tool for processing line-
oriented data is awk.
Small script to find the maximum temperature for each year in
NCDC data
To understand the script and its functionality:
1. The script begins with the shebang #!/usr/bin/env bash, which specifies the
interpreter to be used for executing the script (in this case, Bash).
2. The script uses a for loop to iterate over the files in the all/ directory. Each file
corresponds to a year's weather records.
3. Within the loop, echo is used to print the year by extracting it from the filename
using the basename command.
4. The gunzip -c $year command decompresses the file and outputs its contents to
the standard output.
5. The output of gunzip is then piped (|) to the awk command for further
processing.
6. The awk script inside the curly braces {} performs the data extraction and
analysis. It extracts two fields from each line of the data: the air temperature and the
quality code.
7. The extracted air temperature is converted into an integer by adding 0 (temp =
substr($0, 88, 5) + 0).
8. Conditions are checked to determine if the temperature is valid and the quality
code indicates a reliable reading. Specifically, it checks if the temperature is not
equal to 9999 (which represents a missing value) and if the quality code matches the
pattern [01459].
9. If the temperature passes the validity check and is greater than the current
maximum temperature (temp > max), the max variable is updated with the new
maximum value.
10. After processing all the lines in the file, the END block is executed, and it prints
the maximum temperature found (print max).
The temperature values in the source file are scaled by a factor of
10, so a value of 317 corresponds to a maximum temperature of
31.7°C for the year 1901.
→ This script serves as a baseline for performance comparison and
demonstrates how Unix tools like awk can be utilized for data
analysis tasks without relying on Hadoop or other distributed
computing frameworks.
→ The script’s author mentions that the complete run for the entire
century took 42 minutes on a single EC2 High-CPU Extra Large
Instance.
Analysing Data with Hadoop
• To take advantage of the parallel processing that Hadoop provides, we
need to express our query as a MapReduce job. After some local,
small-scale testing, we will be able to run it on a cluster of machines.
• MapReduce works by breaking the processing into two phases: the
map phase and the reduce phase.
Each phase has key-value pairs as input and output, the

types of which may be chosen by the programmer. The
programmer also specifies two functions: the map function
and the reduce function. The input to our map phase is the
raw NCDC data. We choose a text input format that gives
us each line in the dataset as a text value. The key is the
offset of the beginning of the line from the beginning of the
file, but as we have no need for this, we ignore it.
• To visualize the way the map works, consider the following sample
lines of input data (some unused columns have been dropped to fit the
page, indicated by ellipses):
• 0067011990999991950051507004…9999999N9+00001+99999999999…
0043011990999991950051512004…9999999N9+00221+99999999999…
0043011990999991950051518004…9999999N9–00111+99999999999…
0043012650999991949032412004…0500001N9+01111+99999999999…
0043012650999991949032418004…0500001N9+00781+99999999999…
• These lines are presented to the map function as the key-value
pairs:
• (0, 0067011990999991950051507004…9999999N9+00001+99999999999…)
(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)
(212, 0043011990999991950051518004…9999999N9–00111+99999999999…)
(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)
(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)
• The keys are the line offsets within the file, which we ignore in our
map function. The map function merely extracts the year and the air
temperature (indicated in bold text), and emits them as its output (the
temperature values have been interpreted as integers):
• (1950, 0)
• (1950, 22)
• (1950, −11)
• (1949, 111)
• (1949, 78)
• The output from the map function is processed by the MapReduce framework
before being sent to the reduce function. This processing sorts and groups the key-
value pairs by key. So, continuing the example, our reduce function sees the
following input:
• (1949, [111, 78])
• (1950, [0, 22, −11])
• Each year appears with a list of all its air temperature readings. All the reduce
function has to do now is iterate through the list and pick up the maximum
reading:
• (1949, 111)
• (1950, 22)
This is the final output: the maximum global temperature recorded in each year.
Credits: Hadoop: The Definitive Guide, Third Edition by Tom White)

Introduction to Hadoop Streaming
• Hadoop Streaming uses UNIX standard streams as the interface between
Hadoop and your program so you can write MapReduce program in any
language which can write to standard output and read standard input.
Hadoop offers a lot of methods to help non-Java development.
• The primary mechanisms are Hadoop Pipes which gives a native C++
interface to Hadoop and Hadoop Streaming which permits any program
that uses standard input and output to be used for map tasks and reduce
tasks.
• With this utility, one can create and run MapReduce jobs with any
executable or script as the mapper and/or the reducer.
Hadoop Streaming
Hadoop Streaming
• By default, the Hadoop MapReduce framework is written in Java and
provides support for writing map/reduce programs in Java only. But
Hadoop provides API for writing MapReduce programs in languages
other than Java.
• Hadoop Streaming is the utility that allows us to create and run
MapReduce jobs with any script or executable as the mapper or the
reducer. It uses Unix streams as the interface between the Hadoop and
our MapReduce program so that we can use any language which can
read standard input and write to standard output to write for writing
our MapReduce program.
How Hadoop Streaming works.
• The mapper and the reducer (in the above example) are the scripts that read the
input line-by-line from stdin and emit the output to stdout.
• The utility creates a Map/Reduce job and submits the job to an
appropriate cluster and monitor the job progress until its completion.
• When a script is specified for mappers, then each mapper task
launches the script as a separate process when the mapper is
initialized.
• The mapper task converts its inputs (key, value pairs) into lines and
pushes the lines to the standard input of the process. Meanwhile, the
mapper collects the line oriented outputs from the standard output and
converts each line into a (key, value pair) pair, which is collected as
the result of the mapper.
• When reducer script is specified, then each reducer task launches the script as a
separate process, and then the reducer is initialized.
• As reducer task runs, it converts its input key/values pairs into lines and feeds the
lines to the standard input of the process. Meantime, the reducer gathers the line-
oriented outputs from the stdout of the process and converts each line collected
into a key/value pair, which is then collected as the result of the reducer.
• For both mapper and reducer, the prefix of a line until the first tab character is the
key, and the rest of the line is the value except the tab character. In the case of no
tab character in the line, the entire line is considered as key, and the value is
considered null. This is customizable by setting -inputformat command option for
mapper and -outputformat option for reducer
Hadoop Ecosystem
• Overview: Apache Hadoop is an open source framework intended to
make interaction with big data easier, However, for those who are not
acquainted with this technology, one question arises that what is big
data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the
industries and companies that need to work on large data sets which
are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of
clusters. Being a framework, Hadoop is made up of several modules
that are supported by a large ecosystem of technologies.
Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and
various commercial tools and solutions. There are four major elements
of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these
major elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop
ecosystem:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
Hadoop Ecosystem
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Hadoop Ecosystem
Note: Apart from the above-mentioned components, there are many other components too that are part of the
Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it
revolves around data and hence making its synthesis easier.
Hadoop Ecosystem
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
Hadoop Ecosystem
• YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs scheduling
and resource allocation for the Hadoop System.
• Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
• Resource manager has the privilege of allocating resources for the applications in
a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
Hadoop Ecosystem
• MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it
possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
• Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
• Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples into
smaller set of tuples.
Hadoop Ecosystem
• PIG:
• Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data
sets.
• Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Hadoop Ecosystem
• HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
• It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the processing
of queries.
Hadoop Ecosystem
• Mahout:
• Mahout, allows Machine Learnability to a system or

application. Machine Learning, as the name suggests helps the system
to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts
of Machine learning. It allows invoking algorithms as per our need
with the help of its own libraries.
Hadoop Ecosystem
• Apache Spark:
• It’s a platform that handles all the process consumptive tasks like
batch processing, interactive or iterative real-time processing, graph
conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than the
prior in terms of optimization.
• Spark is best suited for real-time data whereas Hadoop is best suited
for structured data or batch processing, hence both are used in most of
the companies interchangeably.
Hadoop Ecosystem
• Apache HBase:
• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
• At times where we need to search or retrieve the occurrences of
something small in a huge database, the request must be processed
within a short quick span of time. At such times, HBase comes handy
as it gives us a tolerant way of storing limited data
What is IBM?
● International Business Machines Corporation is an American
multinational technology corporation headquartered in Armonk, New
York, with operations in over 171 countries.
● CEO: Arvind Krishna (6 Apr 2020)
● Headquarters: Armonk, New York, United States
● Founder: Charles Ranlett Flint
What is the IBm Big Data Strategy?
❖IBM, a US-based computer hardware and software manufacturer, had
implemented a
Big Data strategy:
• Where the company offered solutions to store, manage, and analyze
the huge amounts of data generated daily and equipped large and small
companies to make informed business decisions.
❖The company believed that its Big Data and analytics products and
services would help its clients become more competitive and drive
growth.
Big Data Strategy
Big Data Strategy
❖Issues :Understand the concept of Big Data and its importance to
large, medium, and small companies in the current industry scenario.
❖Analyze the Big Data strategy of IBM.
❖Explore ways in which IBM9s Big Data strategy could be improved
further
IBM Big Data Strategy:Brief
IBM Big Data Strategy: Move the Analytics closer to the Data
• New Analytic applications drive the requirements for a big data
platform
• Integrate and manage the full variety, velocity and volume of data
• Apply advanced analytics to information in its native form
• Visualize all available data for adhoc analysis
• Development environment for building new analytic applications
Workload optimization and scheduling Security and Governance
InfoSphere BigInsights
• It enables companies to turn complex information sets into insight and to do so at
Internet scale. InfoSphere BigInsights is an analytics platform that provides
unique capabilities from IBM emerging technologies, IBM research technologies
and IBM software built on top of an Apache Hadoop open-source framework to
deliver a platform that is business-ready to accelerate the time to value. In
addition to core capabilities for installation, configuration and management,
InfoSphere BigInsights includes advanced analytics and user interfaces for the
non-developer business analyst. Flexible enough to be used for unstructured or
semi-structured information, the solution does not require schema definitions or
data preprocessing and allows for structure and associations to be added on the fly
across information types. The platform runs on commonly available, low-cost
hardware in parallel, supporting linear scalability; as information grows, you
simply add more commodity hardware. Complements existing solutions
InfoSphere BigSheets
• BigSheets is a spreadsheet-style tool for business analysts provided with IBM
InfoSphere BigInsights, a platform based on the open source Apache Hadoop
project. BigSheets enables non-programmers to iteratively explore, manipulate,
and visualize data stored in your distributed file system.
• BigSheets translates user commands, expressed through a graphical interface, into
Pig scripts executed against a subset of the underlying data. In this manner, an
analyst can iteratively explore various transformations efficiently. When satisfied,
the user can save and run the workbook, which causes BigSheets to initiate
MapReduce jobs over the full set of data, write the results to the distributed file
system, and display the contents of the new workbook.” [1]
• BigSheets can process huge amount of data due to the fact that user
commands, expressed through a graphical interface, are translated into Pig
scripts and can be run as MapReduce jobs in parallel on many nodes.
THANK YOU
51

University Institute of Computing: Big Bata Analytics 22CAH-782

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

University Institute of Computing: Big Bata Analytics 22CAH-782

Uploaded by

Copyright:

Available Formats

UNIVERSITY INSTITUTE OF COMPUTING

MASTER OF COMPUTER APPLICATIONS

DISCOVER . LEARN . EMPOWER

Hadoop is an open-source software framework that is used for storing and

Fig: Structure Of Weather data set.

Each phase has key-value pairs as input and output, the

Credits: Hadoop: The Definitive Guide, Third Edition by Tom White)

• Mahout, allows Machine Learnability to a system or

You might also like