Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

HADOOP Internals

Training Contents

Description
Intended Audience
Key Skills
Prerequisites
Instructional Method
Course contents

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

HADOOP Internals
Course Contents

Hadoop Introduction Day1


MapReduce
Distributing Data with HDFS
Understanding Hadoop I/O Day2
Advanced MapReduce
Writing Map-Reduce Applications
Map-Reduce Internals Day3
Managing Hadoop
Map-Reduce Ecosystem

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Description:

This training will Introduce attendees to the core concepts of Hadoop. Deep dive
into the critical architecture paths of HDFS, MapReduce and HBase.Teach the basics
of how to effectively write Pig and Hive scripts.Explain how to choose the correct
use cases for Hadoop

Intended Audience:

Engineers, Programmers, Networking specialists, Managers, Executives

Key Skills:

Advanced Map Reduce Concepts & Algorithms


Big Data & Hadoop Ecosystem

Hadoop Best Practices & Tip and


Techniques Importing and exporting data

Hadoop Distributed File System HDFS

To use Map Reduce API and write common algorithms.

Best practices for developing and debugging map reduce programs


The attendees will learn:

Managing and Monitoring Hadoop Cluster

Prerequisites:

The participants should have basic understanding or knowledge of java and linux.

Instructional Method:

This is an instructor led course which provides lecture topics and the practical
application of Hadoop and the underlying technologies. It pictorially presents most
concepts and there is a detailed case study that strings together the technologies,
patterns and design.

HADOOP Internals

Hadoop Introduction

Move computation not data.


Volunteer Computing
Hadoop Releases
Hadoop performance and data scale facts.
The Apache Hadoop Project.
Grid Computing
Hadoop in the context of other data stores.
The Hadoop Ecosystem.
Apache Hadoop and the Hadoop Ecosystem
A Brief History of Hadoop
Hadoop an inside view: MapReduce and HDFS.
What about NoSQL?
RDBMS
Comparison with Other Systems

MapReduce

Constructing the basic template of a MapReduce program


Running a Distributed MapReduce Job
Data FlowCombiner Functions
Java MapReduceScaling Out
Counting things
Analyzing the Data with Hadoop
Map and Reduce
Hadoop Pipes
Adapting for Hadoops API changes
Improving performance with combiners

Hadoop Streaming

Ruby
Python

Streaming in Hadoop
m

Distributing Data with HDFS

Interfaces
Hadoop Filesystems
The Design of HDFS

Using Hadoop Archives

Anatomy of a File Write


Anatomy of a File Read
Coherency Model

The Command-Line Interface

Keeping an HDFS Cluster Balanced


Hadoop Archives

Data Flow

Limitations

Parallel Copying with distcp

Basic Filesystem Operations

The Java Interface

Streaming with key/value pairs


Streaming with Unix commands
Streaming with the Aggregate package
Streaming with scripts

Querying the Filesystem


Reading Data Using the FileSystem API
Directories
Deleting Data
Reading Data from a Hadoop URL
Writing Data

Understanding Hadoop I/O


File-Based Data Structures

MapFile
SequenceFile

Serialization

Compression

Codecs
Using Compression in MapReduce
Compression and Input Splits

Data Integrity

Implementing a Custom Writable


Serialization Frameworks
The Writable Interface
Writable Classes
Avro

ChecksumFileSystem
LocalFileSystem
Data Integrity in HDFS

Advanced MapReduce
Chaining MapReduce jobs

Creating a Bloom filter

What does a Bloom filter do?


Bloom filter in Hadoop version 0.20+
Implementing a Bloom filter

Joining data from different sources

Chaining preprocessing and postprocessing steps


Chaining MapReduce jobs in a sequence
Chaining MapReduce jobs with complex dependency

Reduce-side joining
Replicated joins using DistributedCache
Semijoin: reduce-side join with map-side filtering

Writing Map-Reduce Applications

Hadoop in the Cloud


Cluster Setup and Installation
Hadoop Configuration
YARN Configuration
m

The Configuration API


Running Locally on Test Data
Configuring the Development Environment
Cluster Specs
Tuning
MapReduce Workflows
Monitoring and debugging on a production cluster
Tuning for performance
Benchmarking a Hadoop Cluster

Map-Reduce Internals
Failures

Anatomy of a MapReduce Job Run

Skipping Bad Records


Output Committers
The Task Execution Environment
Speculative Execution
Task JVM Reuse

Job Scheduling

The Reduce Side


The Map Side
Configuration Tuning

Task Execution

Classic MapReduce (MapReduce 1)


YARN (MapReduce 2)

Shuffle and Sort

Failures in YARN
Failures in Classic MapReduce

The Capacity Scheduler


The Fair Scheduler

Managing Hadoop

Setting permissions
m

Enabling trash
Adding DataNodes
Managing NameNode and Secondary NameNode
Designing network layout and rack awareness
Checking systems health
Managing quotas
Setting up parameter values for practical use
Removing DataNodes
Recovering from a failed NameNode

Map-Reduce Features

Counters
Sorting
Side Data Distribution
Map-Reduce Library
Joins

Map-Reduce Ecosystem
Hive

Installing and configuring Hive

HiveQL in details
Example queries
Hive Sum-up

Hbase

Intoduction
Clients
Concepts
Hbase vs RDBMS

Pig

Installing Pig

Running Pig

Learning Pig Latin through Grunt


Managing the Grunt shell
m

Thinking like a Pig

Data flow language


User-defined functions
Data types

Speaking Pig Latin

Execution optimization
Expressions and functions
Relational operators
Data types and schemas

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

You might also like