Professional Documents
Culture Documents
Describe The Functions and Features of HDP
Describe The Functions and Features of HDP
The Ambari Metrics System ("AMS") is a system for collecting, aggregating and
serving Hadoop and system metrics in Ambari-managed clusters.
Ambari User Interface, which is a web-based interface that allows users to easily interact with the system.
Ambari Architecture
Ambari Server : contains or interacts with the following components:
Ambari terminology
Service: Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are examples of services
Component: A service consists of one or more components. For example, HDFS has 3 components: NameNode,
DataNode and Secondary NameNode.
Node/Host: Node refers to a machine in the cluster. Node and host are used interchangeably in this document.
Node-Component: Node-component refers to an instance of a component on a particular node.
Operation: An operation refers to a set of changes or actions performed on a cluster to satisfy a user request or to
achieve a desirable state change in the cluster.
Task: Task is the unit of work that is sent to a node to execute. A task is the work that node has to carry out as part
of an action.
Stage: A stage refers to a set of tasks that are required to complete an operation and are independent of each other;
all tasks in the same stage can be run across different nodes in parallel.
Action: An 'action' consists of a task or tasks on a machine or a group of machines. Each action is tracked by an
action id and nodes report the status at least at the granularity of the action.
Stage Plan: An operation typically consists of multiple tasks on various machines and they usually have
dependencies requiring them to run in a particular order.
Manifest: Manifest refers to the definition of a task which is sent to a node for execution.
Role: A role maps to either a component (for example, NameNode, DataNode) or an action (for example, HDFS
rebalancing, HBase smoke test, other admin commands, etc.)
MapReduce and YARN
The Distributed File System (DFS)
• Driving principles
data is stored across the entire cluster
programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
the entire cluster participates in the file system
blocks of a single file are distributed across the cluster
a given block is typically replicated as well for resiliency
Shuffle
Shuffle phase
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key
• All of this movement (shuffle) of data is transparently orchestrated by MapReduce
Reduce
Reducers
Small programs (typically) that aggregate all of the values for the key
that they are responsible for
Each reducer writes output to its own file
Combiner
Combiner (Optional)
• The data that will go to each reduce node is sorted and merged before going to the reduce node, pre-doing some of
the work of the receiving reduce node in order to minimize network traffic between map and reduce nodes.
1. The first step is the MapReduce program you've written tells the Job Client to run a MapReduce job.
2. This sends a message to the JobTracker which produces a unique ID for the job.
3. The Job Client copies job resources, such as a jar file containing a Java code
you have written to implement the map or the reduce task, to the shared file system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5. The JobTracker does its own initialization for the job. It calculates how to split the data so that it can send each
"split" to a different mapper process to maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.
7. The TaskTrackers are continually sending heartbeat messages to the JobTracker. Now that the JobTracker has
work for them, it will return a map task or a reduce task as a response to the heartbeat.
8. The TaskTrackers need to obtain the code to execute, so they get it from the shared file system.
9. Then they can launch a Java Virtual Machine with a child process running in it and this child process runs your
map code or your reduce code. The result of the map operation will remain in the local disk for the given
TaskTracker node (not in HDFS).
10. The output of the Reduce task is stored in HDFS file system using the number of copies specified by replication
factor.
Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
InputSplitter dividing a file into splits
-Splits are normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
RecordReader takes a split and reads the files into records
-For example, one record per line (LineRecordReader)
-But note that a record can be spit across splits
InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task
The Scheduler is responsible for allocating resources to the various running applications subject to familiar
constraints of capacities, queues
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for
executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster
container on failure.
The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource
usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from
the Scheduler, tracking their status and monitoring for progress.
YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability
---------------------------------------------------------------------------------------------------------------------
Apache Spark
List the purpose of Apache Spark in the Hadoop ecosystem
Faster results from analytics has become increasingly important
Apache Spark is a computing platform designed to be fast and generalpurpose, and
easy to use
Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive variant of SQL).
Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for developers to move
between applications that processes data stored in memory vs arriving in real-time.
MLlib is the machine learning library that provides multiple types of machine learning algorithms.
GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised of vertices and edges
connecting them.
Spark SQL
• Allows relational queries expressed in
SQL
HiveQL
Scala
• SchemaRDD
Row objects
Schema
Created from:
-Existing
RDD
-Parquet
file
-JSON dataset
-HiveQL against Apache Hive