Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Hadoop Cluster

 A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike other computer clusters,
Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data In a distributed computing environment.
Further distinguishing Hadoop ecosystems from other computer clusters are their unique
structure and architecture. Hadoop clusters consist of a network of connected master and slave
nodes that utilize high availability, low-cost commodity hardware. The ability to linearly scale
and quickly add or subtract nodes as volume demands makes them well-suited to 
big data analytics jobs with data sets highly variable in size.
Hadoop Cluster Architecture

Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker,
with each running on a separate machine. The workers consist of virtual machines, running both
DataNode and TaskTracker services on commodity hardware, and do the actual work of storing
and processing the jobs as directed by the master nodes. The final part of the system are the
Client Nodes, which are responsible for loading the data and fetching the results.
Master nodes are responsible for storing data in HDFS and overseeing key operations, such as
running parallel computations on the data using MapReduce.
The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed, and then fetch the results once
the processing is finished.
Advantages of a Hadoop Cluster

oHadoop clusters can boost the processing speed of many big data analytics jobs, given their ability to
break down large computational tasks into smaller tasks that can be run in a parallel, distributed
fashion.
oHadoop clusters are easily scalable and can quickly add nodes to increase throughput, and maintain
processing speed, when faced with increasing data blocks.
oThe use of low cost, high availability commodity hardware makes Hadoop clusters relatively easy and
inexpensive to set up and maintain.
oHadoop clusters replicate a data set across the distributed file system, making them resilient to data
loss and cluster failure.
oHadoop clusters make it possible to integrate and leverage data from multiple different source systems
and data formats.
oIt is possible to deploy Hadoop using a single-node installation, for evaluation purposes.
Configuration modes:

Single Node (Local Mode or Standalone Mode)


Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for
debugging where you don’t really use HDFS.
You can use input and output both as a local file system in standalone mode.
You also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml,
hdfs-site.xml.
Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the
input and output.
Pseudo-distributed Mode
The pseudo-distributed mode is also known as a single-node cluster where both NameNode and
DataNode will reside on the same machine.
Continue..
In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources and
other users sharing the resource.
In this architecture, a separate JVM is spawned for every Hadoop components as they could
communicate across network sockets, effectively producing a fully functioning and optimized
mini-cluster on a single host.
Cluster Types
Cluster load balancing: Load balancing clusters are employed in the situations of augmented
network and internet utilization and these clusters perform as the fundamental factor. This type of
clustering technique offers the benefits of increased network capacity and enhanced performance.
Here the entire nodes stay as cohesive with all the instance where the entire node objects are
completely attentive of the requests those are present in the network.
High–Availability clusters: High-availability clusters (also known as HA clusters , fail-over
clusters or Metroclusters Active/Active) are groups of computers that support server applications
 that can be reliably utilized with a minimum amount of down-time. They operate by using 
high availability software to harness redundant computers in groups or clusters that provide
continued service when system components fail. Without clustering, if a server running a particular
application crashes, the application will be unavailable until the crashed server is fixed. HA
clustering remedies this situation by detecting hardware/software faults, and immediately
restarting the application on another system without requiring administrative intervention, a
process known as failover.
High-performance clusters: These are also termed as failover clusters. Computers so often faces
failure issues. So, High Availability comes in line with the augmenting dependency of computers
as computers hold crucial responsibility in many of the organizations and applications. In this
approach, redundant computer systems are utilized in the situation of any component
malfunction. So, when there is a single point malfunction, the system seems to be completely
reliable as the network has redundant cluster elements
HIVE
What Is Hive
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive
scale out and fault tolerance capabilities for data storage and processing on commodity
hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large
volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and
data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their
own functionality to do custom analysis, such as User Defined Functions (UDFs).  
Built In Operators
Relational Operators—The following operators compare the passed operands and generate a
TRUE or FALSE value, depending on whether the comparison between the operands holds or
not.
Arithmetic Operators—The following operators support various common arithmetic operations
on the operands. All of them return number types.
Logical Operators — The following operators provide support for creating logical expressions. All
of them return boolean TRUE or FALSE depending upon the boolean values of the operands.
Operators on Complex Types—The following operators provide mechanisms to access elements
in Complex Types
Data Units

In the order of granularity - Hive data is organized into:


•Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so
on.  Databases can also be used to enforce security for a user or group of users.
•Tables: Homogeneous units of data which have the same schema. An example of a table could be
page_views table, where each row could comprise of the following columns (schema):
•timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was
viewed.
•userid —which is of BIGINT type that identifies the user who viewed the page.
•page_url—which is of STRING type that captures the location of the page.
•referer_url—which is of STRING that captures the location of the page from where the user arrived at
the current page.
•IP—which is of STRING type that captures the IP address from where the page request was made.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored.
Partitions—apart from being storage units—also allow the user to efficiently identify the rows that
satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type
STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US"
data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the
"US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby
speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23
does not mean that it contains all or only data from that date; partitions are named after dates for
convenience; it is the user's job to guarantee the relationship between partition name and data content!
Partition columns are virtual columns, they are not part of the data itself but are derived on load.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a
hash function of some column of the Table. For example the page_views table may be bucketed by
userid, which is one of the columns, other than the partitions columns, of the page_view table. These
can be used to efficiently sample the data.
Language Capabilities
Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:
Ability to filter rows from a table using a WHERE clause.
Ability to select certain columns from the table using a SELECT clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (for example,, nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
Usage and Examples

Creating, Showing, Altering, and Dropping Tables


Loading Data
Querying and Inserting Data
Apache Oozie
Apache Oozie is the tool in which all sort of programs can be pipelined in a desired order to
work in Hadoop’s distributed environment. Oozie also provides a mechanism to run the job at a
given schedule.
One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack supporting
various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java and Shell.
Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution engine
to actually execute the task. Hence, Oozie is able to leverage the existing Hadoop machinery for
load balancing, fail-over, etc.
Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, and notifies that URL when it is complete. If
the task fails to invoke the callback URL, Oozie can poll the task for completion.
Following three types of jobs are
common in Oozie −
Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a
sequence of actions to be executed.
Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability.
Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow
jobs.
What is Flume?

Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and


transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Applications of Flume

Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher
speed.
Advantages of Flume

Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized stores and
provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and one
receiver) are maintained for each message. It guarantees reliable message delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase)
efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data produced by
social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and
Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.

You might also like