Professional Documents
Culture Documents
Training Structure
Training Structure
DAY 1
Session 1: Starting with Hadoop
Fundamental and Core concepts of Hadoop
Infrastructure and Architecture of HDFS
HDFS command line and Web interface
Lab: Going through the Hadoop VM
Concepts of MapReduce function
Architectural overview of MapReduce
MapReduce Types and Formats Managing and Scheduling jobs Concepts of
MapReduce Version2
Lab: Running MapReduce functions
Session 2:MapReduce Program
MapReduce Flow
Examining a Sample MapReduce Program
Basic MapReduce API Concepts
Driver Code
Mapper
Reducer
Streaming API
Using Eclipse for Rapid Development
New MapReduce API
Lab: Writing a MapReduce Program
DAY 2
Session 1: Hadoop APIs in depth
Tool Runner Testing with MRUnit
Reducing Intermediate Data with Combiners
Configuration and Close Methods for Map/Reduce Setup and Teardown
Writing Practitioners for Better Load Balancing
Directly Accessing HDFS Using the Distributed Cache
Lab: Implementing Combiner
Lab: Writing a Partitioner
Session 2: Practical Development Tips and Techniques
Debugging MapReduce Code
Using LocalJobRunner Mode for Easier Debugging
Retrieving Job Information with Counters
Logging Splittable
File Formats
Determining the Optimal Number of Reducers
Map-Only MapReduce Jobs
DAY 3
Hive (hands on)
Session 1:
What Is Hive?
Hive Schema and Data Storage
Hive Use Cases, Interacting with Hive
Relational Data Analysis with Hive
Hive Databases and Tables
Basic HiveQL Syntax
Data Types, Joining Data Sets
Common Built-in Functions
Hands-On Exercise: Running Hive Queries
Session 2:
Hive Data Management
Hive Data Formats
Creating Databases and Hive-Managed Tables
Loading Data into Hive
Altering Databases and Tables
Self-Managed Tables
Simplifying Queries with Views
Storing Query Results
Hands-On Exercise: Data Management with Hive
Session 3:
Text Processing with Hive
Sentiment Analysis and N-Grams
Hands-On Exercise : Gaining Insight with Sentiment Analysis
Hive Optimization
Partitioning
Bucketing
Indexing Data
DAY 4
Scala Basics
Values, functions, classes, methods, inheritance, try-catch-finally. Expression-oriented
programming
Case classes, objects, packages, apply, update,
Functions are Objects (uniform access principle), pattern matching.
Collections
Lists, Maps, functional combinators (map, foreach, filter, zip, folds)
Why Spark?
Problems with Traditional Large-Scale Systems
Introducing Spark
Spark Basics
What is Apache Spark?
Using the Spark Shell
Resilient Distributed Datasets (RDDs)
Functional Programming with Spark
Working with RDDs, RDD Operations
Key-Value Pair RDD
Map Reduce and Pair RDD Operations
Passing Functions to Spark
DAY 5
Storm Using Java
Features of Storm
Storm components, Nimbus, Supervisor nodes
The ZooKeeper cluster
The Storm data model
Definition of a Storm topology
Operation modes
Setting Up a Storm Cluster
Setting up a distributed Storm cluster
Deploying a topology on a remote Storm cluster
Deploying the sample topology on the remote cluster
Configuring the parallelism of a topology
The worker process
The executor
Tasks
Configuring parallelism at the code level
Distributing worker processes, executors, and tasks in the sample topology
Rebalancing the parallelism of a topology
Rebalancing the parallelism of the sample topology
Stream grouping, Shuffle grouping, Fields grouping
Storm and Kafka Integration
The Kafka architecture
The producer
Replication
Consumers
Brokers
Data retention
Setting up Kafka
Setting up a single-node Kafka cluster
A sample Kafka producer
Integrating Kafka with Storm
*****