Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

1.

Describe the functions and features of HDP

Hortonworks Data Platform (HDP)


• HDP is platform for data-at-rest
• Secure, enterprise-ready open source Apache Hadoop distribution
based on a centralized architecture (YARN)
• HDP is:
 Open
 Central
 Interoperable
 Enterprise ready
Apache Ambari
• For provisioning, managing, and monitoring Apache Hadoop clusters.
• Provides intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs

Ambari REST APIs


 Allows application developers and system integrators to easily integrate Hadoop provisioning, management, and
monitoring capabilities to their own applications

Functionality of Apache Ambari


Ambari enables System Administrators to:
• Provision a Hadoop cluster
 Ambari provides wizard for installing Hadoop services across any number of hosts
• Manage a Hadoop cluster
 Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster
• Monitor a Hadoop cluster
 Ambari provides dashboard for monitoring health and status of the Hadoop cluster
 Ambari leverages Ambari Metrics System ("AMS") for metrics collection
• Ambari enables application developers and system integrators to:
 Easily integrate Hadoop provisioning, management, and monitoring capabilities to
their own applications with the Ambari REST APIs

Ambari Metrics System ("AMS")


• System for collecting, aggregating and serving Hadoop and system
metrics in Ambari-managed clusters. The AMS works as follows:
1. Metrics Monitors run on each host and send system-level metrics to the AMS (which is a daemon).
2. Hadoop Sinks run on each host and send Hadoop-level metrics to the Collector.
3. The Metrics Collector stores and aggregates metrics. The Collector can store data either on the local
filesystem ("embedded mode") or can use an external HDFS for storage ("distributed mode").
4. Ambari exposes a REST API, which makes metrics retrieval easy.
5. Ambari REST API feeds the Ambari Web UI.

The Ambari Metrics System ("AMS") is a system for collecting, aggregating and
serving Hadoop and system metrics in Ambari-managed clusters.

Ambari User Interface, which is a web-based interface that allows users to easily interact with the system.
Ambari Architecture
Ambari Server : contains or interacts with the following components:

 Postgres RDBMS (default) stores the cluster configurations


 Authorization Provider integrates with an organization's authentication/authorization provider such as the
LDAP service (By default, Ambari uses an internal database as the user store for authentication and
authorization)
 Ambari Alert Framework supports alerts and notifications
 REST API integrates with the web-based front-end Ambari Web. This REST API can also be used by
custom applications.

How Ambari manages hosts in a cluster


• Ambari provides the following actions using the Hosts tab:
 Working with Hosts
 Determining Host Status
 Filtering the Hosts List
 Performing Host-Level Actions
 Viewing Components on a Host
 Decommissioning Masters and Slaves
 Deleting a Host from a Cluster
 Setting Maintenance Mode
 Adding Hosts to a Cluster

Ambari terminology
Service: Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are examples of services

Component: A service consists of one or more components. For example, HDFS has 3 components: NameNode,
DataNode and Secondary NameNode.

Node/Host: Node refers to a machine in the cluster. Node and host are used interchangeably in this document.
Node-Component: Node-component refers to an instance of a component on a particular node.

Operation: An operation refers to a set of changes or actions performed on a cluster to satisfy a user request or to
achieve a desirable state change in the cluster.

Task: Task is the unit of work that is sent to a node to execute. A task is the work that node has to carry out as part
of an action.

Stage: A stage refers to a set of tasks that are required to complete an operation and are independent of each other;
all tasks in the same stage can be run across different nodes in parallel.

Action: An 'action' consists of a task or tasks on a machine or a group of machines. Each action is tracked by an
action id and nodes report the status at least at the granularity of the action.

Stage Plan: An operation typically consists of multiple tasks on various machines and they usually have
dependencies requiring them to run in a particular order.

Manifest: Manifest refers to the definition of a task which is sent to a node for execution.

Role: A role maps to either a component (for example, NameNode, DataNode) or an action (for example, HDFS
rebalancing, HBase smoke test, other admin commands, etc.)
MapReduce and YARN
The Distributed File System (DFS)
• Driving principles
 data is stored across the entire cluster
 programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
 the entire cluster participates in the file system
 blocks of a single file are distributed across the cluster
 a given block is typically replicated as well for resiliency

Describe the MapReduce model v1

Hadoop computational model


 Data stored in a distributed file system spanning many inexpensive computers
 Bring function to the data
 Distribute application to the compute resources where the data is stored
The MapReduce programming model
"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to
worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node
processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to
get the output - the answer to the problem it was originally trying to solve.

The MapReduce execution environments


• APIs vs. Execution Environment
 APIs are implemented by applications and are largely independent of
execution environment
 Execution Environment defines how MapReduce jobs are executed
• MapReduce APIs
 org.apache.mapred:
- Old API, largely superseded some classes still used in new API
- Not changed with YARN
 org.apache.mapreduce:
- New API, more flexibility, widely used
- Applications may have to be recompiled to use YARN (not binary compatible)
• Execution Environments
 Classic JobTracker/TaskTracker from Hadoop v1.
MapReduce phases
 Map
Mappers
 Small program (typically), distributed across the cluster, local to data
 Handed a portion of the input data (called a split)
 Each mapper parses, filters, or transforms its input

 Shuffle

Shuffle phase
• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key
• All of this movement (shuffle) of data is transparently orchestrated by MapReduce
 Reduce

Reducers
 Small programs (typically) that aggregate all of the values for the key
that they are responsible for
 Each reducer writes output to its own file

 Combiner

Combiner (Optional)
• The data that will go to each reduce node is sorted and merged before going to the reduce node, pre-doing some of
the work of the receiving reduce node in order to minimize network traffic between map and reduce nodes.

The process of running a MapReduce job on Hadoop consists of 10 major steps:

1. The first step is the MapReduce program you've written tells the Job Client to run a MapReduce job.
2. This sends a message to the JobTracker which produces a unique ID for the job.
3. The Job Client copies job resources, such as a jar file containing a Java code
you have written to implement the map or the reduce task, to the shared file system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5. The JobTracker does its own initialization for the job. It calculates how to split the data so that it can send each
"split" to a different mapper process to maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.

7. The TaskTrackers are continually sending heartbeat messages to the JobTracker. Now that the JobTracker has
work for them, it will return a map task or a reduce task as a response to the heartbeat.
8. The TaskTrackers need to obtain the code to execute, so they get it from the shared file system.
9. Then they can launch a Java Virtual Machine with a child process running in it and this child process runs your
map code or your reduce code. The result of the map operation will remain in the local disk for the given
TaskTracker node (not in HDFS).
10. The output of the Reduce task is stored in HDFS file system using the number of copies specified by replication
factor.

Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
 InputSplitter dividing a file into splits
-Splits are normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
 RecordReader takes a split and reads the files into records
-For example, one record per line (LineRecordReader)
-But note that a record can be spit across splits
 InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task

Limitations of classic MapReduce (MRv1)


The most serious limitations of classical MapReduce are:
 Scalability
 Resource utilization
 Support of workloads different from MapReduce.
• In the MapReduce framework, the job execution is controlled by two
types of processes:
 A single master process called JobTracker, which coordinates all jobs
running on the cluster and assigns map and reduce tasks to run on the TaskTrackers
 A number of subordinate processes called TaskTrackers, which run assigned tasks and periodically report the
progress to the JobTracker

YARN overhauls MRv1


• MapReduce has undergone a complete overhaul with YARN, splitting
up the two major functionalities of JobTracker (resource management
and job scheduling/monitoring) into separate daemons
• ResourceManager (RM)
 The global ResourceManager and per-node slave, the NodeManager (NM),
form the data-computation framework
 The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system
• ApplicationMaster (AM)
 The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating
resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks
 An application is either a single job in the classical sense of Map-Reduce jobs or a directed acyclic graph (DAG)
of jobs

The Scheduler is responsible for allocating resources to the various running applications subject to familiar
constraints of capacities, queues

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for
executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster
container on failure.

The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource
usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from
the Scheduler, tracking their status and monitoring for progress.

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability

YARN major features summarized


• Multi-tenancy
 YARN allows multiple access engines (either open-source or proprietary) to use Hadoop as the common standard
for batch, interactive, and real-time engines that can simultaneously access the same data sets
 Multi-tenant data processing improves an enterprise's return on its Hadoop investments.
• Cluster utilization
 YARN's dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in
early versions of Hadoop
• Scalability
 Data center processing power continues to rapidly expand. YARN's ResourceManager focuses exclusively on
scheduling and keeps pace as clusters expand to thousands of nodes managing petabytes of data.
• Compatibility
 Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruption to existing
processes that already work.

List the phases in a MR job.


 Map, Shuffle, Reduce, Combiner
2. What are the limitations of MR v1?
 Centralized handling of job control flow
 Tight coupling of a specific programming model with the resource management
infrastructure
 Hadoop is now being used for all kinds of tasks beyond its original design
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
 ResourceManager
 ApplicationMaster
4.What are the major features of YARN?
 Multi-tenancy
 Cluster utilization
 Scalability
 Compatibility

---------------------------------------------------------------------------------------------------------------------
Apache Spark
List the purpose of Apache Spark in the Hadoop ecosystem
 Faster results from analytics has become increasingly important
 Apache Spark is a computing platform designed to be fast and generalpurpose, and
easy to use

Who uses Spark and why?


• Parallel distributed processing, fault tolerance on commodity hardware,
scalability, in-memory computing, high level APIs, etc.
• Data scientist
 Analyze and model the data to obtain insight using ad-hoc analysis
 Transforming the data into a useable format
 Statistics, machine learning, SQL
• Data engineers
 Develop a data processing system or application
 Inspect and tune their applications
 Programming with the Spark's API
• Everyone else
 Easeof use
 Wide variety of functionality
 Mature and reliable.
List and describe the architecture and components of the Spark unified stack

 Spark SQL is designed to work with the Spark via SQL and HiveQL (a Hive variant of SQL).
 Spark Streaming provides processing of live streams of data. The Spark
Streaming API closely matches that of the Sparks Core's API, making it easy for developers to move
between applications that processes data stored in memory vs arriving in real-time.
 MLlib is the machine learning library that provides multiple types of machine learning algorithms.
 GraphX is a graph processing library with APIs to manipulate graphs and
performing graph-parallel computations. Graphs are data structures comprised of vertices and edges
connecting them.

Describe the role of a Resilient Distributed Dataset (RDD)


Resilient Distributed Datasets (RDDs)
• Spark's primary abstraction: Distributed collection of elements, parallelized across the cluster
• Two types of RDD operations:
 Transformations
-Creates a directed acyclic graph (DAG)
-Lazy evaluations
-No return value
 Actions
-Performs the transformations
-The action that follows returns a value
• RDD provides fault tolerance
• Has in-memory caching (with overflow to disk).

Resilient Distributed Dataset (RDD)


• Fault-tolerant collection of elements that can be operated on in parallel
• RDDs are immutable
• Three methods for creating RDD
 Parallelizing
an existing collection
 Referencing a dataset
 Transformation from an existing RDD
• Two types of RDD operations
 Transformations
 Actions
• Dataset from any storage supported by Hadoop
 HDFS, Cassandra, HBase, Amazon S3, etc.
• Types of files supported:
 Text files, SequenceFiles, Hadoop InputFormat, etc.

RDD operations: Transformations


• These are some of the transformations available - the full set can be found on
Spark's website.
• Transformations are lazy evaluations
• Returns a pointer to the transformed RDD

RDD operations: Actions


RDD persistence
• Each node stores partitions of the cache that it computes in memory
• Reuses them in other actions on that dataset (or derived datasets)

Spark SQL
• Allows relational queries expressed in
 SQL
 HiveQL
 Scala
• SchemaRDD
 Row objects
 Schema
 Created from:
-Existing
RDD
-Parquet
file
-JSON dataset
-HiveQL against Apache Hive

• Supports Scala, Java, R, and Python


MLlib
• MLlib for machine learning library - under active development
• Provides, currently, the following common algorithm and utilities
 Classification
 Regression
 Clustering
 Collaborativefiltering
 Dimensionality reduction
Advantages and disadvantages of Hadoop
• Hadoop is good for:
 processing massive amounts of data through parallelism
 handling a variety of data (structured, unstructured, semi-structured)
 using inexpensive commodity hardware
• Hadoop is not good for:
 processing transactions (random access)
 when work cannot be parallelized
 low latency data access
 processing lots of small files
 intensive calculations with small amounts of data

You might also like