Big Data Masters Program

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CURRICULUM

Big Data Masters Program


01 Week »» What is SPOF
»» FSimage & Edit Logs
Introduction to Big Data & HDFS Concepts »» Secondary Namenode
along with Linux Commands »» Name Node Recovery
»» Check Pointing
»» Introduction to Big Data »» Understanding Replication Factor
»» What is Big Data And Why Big Data »» What is Rack And Rack Failure
»» Big Data System Requirements »» Rack Awareness Mechanism
»» Monolithic vs Distributed System »» Block Report
»» Distributed System Architecture »» Namenode High Availability
»» What is Hadoop And Evolution of Hadoop »» Quorum Journal Manager & Quorum Journal Node
»» Google File System (GFS) »» Understanding Linux File System
»» Distributed Processing (MapReduce) »» List & Parameters of List Command
»» Hadoop 1.0 vs Hadoop 2.0 »» Touch, Mkdir, Rmdir & Other Linux Commands
»» What is Yarn »» HDFS Commands:
»» Core Components of Hadoop »» List Files & Directories
»» Hadoop Ecosystems Tools »» How HDFS Commands Work
»» Brief Introduction to Spark »» ‘ls’ Command With Various Parameters
»» Hadoop Cluster Vs Spark Cluster »» Create, Remove File/Directory
»» HDFS Architecture: »» Copy & Get Files/Folders From Local to HDFS &
»» What is Node And What is Cluster Vice Versa
»» Data Block & Block Size »» Move Files/Folders From HDFS to HDFS
»» Slave Node, Master Node, Data Node & Name Node »» Change Replication Factor Dynamically
»» Metadata And Replication Factor »» View File Metadata Information
»» Heart Beat & Fault Tolerance »» Week1: Quiz
»» Handling Namenode Failure »» Week1: Assignment
02 Week »» Realtime Use Case: Google Web Search
»» How Google Search Works
MapReduce - Distributed Computing »» MapReduce Programming
Framework »» MR Code Explanation
»» Introduction to MapReduce »» How to Write Map Reduce Code
»» What is MapReduce »» Mapper Code
»» Stages in MapReduce »» Reducer Code
»» What is Key-Value »» Main Code
»» What is Map & What is Reduce »» Finding the Frequency of Each Word in a File
»» Example to Undestand Map&Reduce »» Mapreduce Jars
»» Word Count Example in MapREduce »» MapReduce Practical Sessions
»» Record Reader »» Word Count Program - Practical Session1
»» Mapper Phase »» Jar Creation & Execution - Practical Session2:
»» Reducer Phase »» How to Create a Jar
»» MapReduce Shuffle & Sort »» How to Execute the Jar
»» Inside Map & Reduce Phase »» How to Track a Job
»» Wordcount Example in MapReduce »» How to Track All Previous Jobs
»» Typical MapReduce Flow »» MR Program Variations - Practical Session3:
»» Blocks in MapReduce »» How to Change Number of Reducers
»» Default Number of Mappers & Reducers »» Writing Custom Partitioner Logic
»» Understanding Number of Mappers/Reducers »» Changing Number of Reducers to Zero
»» MapReduce Framework Behind the Scenes »» Introducing Combiner
»» Role of Hash Function in MapReduce »» Writing Custom Combiner logic
»» Partitioning in MapReduce »» Week2: Quiz
»» How to Choose Number of Reducers »» Week2: Assignment
»» How Hash Function Works »» Week1 Assignment Solution
»» Understanding Shuffle & Sort
»» Example: Calculating Max Temperature in a Day
»» Combiner Function in MapReduce
»» Advantages of Combiners
03 Week
»» When to Use or Not to Use Combiner
Apache Sqoop - Data Ingestion to Hadoop
»» Example1: Filtering Data using MapReduce »» Sqoop Fundamentals
»» Example2: Finding Distinct Values »» Sqoop Basics
»» Example3: Finding Top 3 Most Influential users »» What is sqoop
»» Sqoop Workflow »» What is Hive
»» Key Features of Sqoop »» Hive Query Language (HQL)
»» Sqoop Import »» Understanding Hive Table
»» Sqoop Export »» Introduction to Hive Metadata
»» Connecting to MySQL »» Why Hive over traditional databases
»» Acessing MySQL Databases from Hadoop »» Transactional and Analytical Processing
»» Acessing MySQL Tables from Hadoop »» What is Data Warehouse
»» Sqoop Eval »» Hive Architecture
»» Sqoop Import Practicals »» Hive on top of Hadoop
»» Sqoop Export Practicals »» How Hive Works
»» Sqoop Job »» Transactional vs Analytical Processing
»» Sqoop Password Management »» Data Warehouse Concept
»» Sqoop Incremental Load »» The Hive Metastore
»» Sqoop Default Import »» Hive vs RDBMS
»» Sqoop Free-Frorm Query Import »» HQL vs SQL
»» Sqoop Direct import »» Hive Subqueries Views & Index
»» Importing Data Into Hive »» Transactional and Analytical Processing
»» Importing Data Into HBase »» What is Data Warehouse
»» Sqoop Validate »» Hive Architecture
»» When a Sqoop Export May Fail »» Hive on Hadoop
»» Week3: Quiz »» Hive Metastore
»» Week3: Assignment »» Hive vs. RDBMS
»» Week2 Assignment Solution »» Hive Complex Data Types
»» Hive Array, Map & Struct

04 Week »» Hive Built-in Functions


»» Hive UDF, UDAF & UDTF
Apache Hive Basics - Process Structure »» Hive Lateral Views
Data in Hadoop »» Hive Subqueries
»» Hive Overview: »» Hive Views
»» Transactional System and Analytical System »» Hive Normalization vs Denormalization
»» Examples of Transactional Systems »» Week4: Quiz
»» Examples of Analytical Systems »» Week3 Assignment Solution
05 Week 07 Week
Apache Hive Advance - Part 1 NoSQL Databases - HBase
»» Hive Structure Level Optimizations: »» Hbase Basics
»» Hive Partitioning
»» Key requirements of database
»» Hive Partitioning With 2 Columns
»» Limitations of Hadoop
»» Hive Bucketing
»» Google Bigtable concept for quick searching
»» Hive Partitioning With Bucketing
»» Hive Query Level Optimizations: »» Implementation of Bigtable as Hbase
»» Hive Join Optimizations »» Properties of Hbase
»» Hive Bucket Map Join Optimizations »» What Hbase can offer
»» Hive Window Functions »» Row based storage vs Columnar storage
»» Hive Ranking »» Advantages of columnar storage
»» Hive Sorting »» Normalization vs Denormalization
»» Week5: Quiz »» CRUD Operation
»» Week5: Assignment
»» RDBMS vs Hbase

06 Week
»» Hbase data model
»» 4-Dimensional data model

Apache Hive Advance - Part 2 »» CAP Theorem


»» Hbase Architecture
»» Hive File Format
»» Row vs Column File Formats »» Hbase Region Server
»» Specialized File Formats »» Region, Memstore, Wal & Block Cache
»» Internals of ORC File Formats »» Hfile
»» Internals of Parquet File Formats »» Zookeeper
»» ORC vs Parquet File Formats »» Hmaster & Meta Table
»» Hive Compression Techniques »» Hbase Architecture components in details
»» Hive Vectorization »» Hbase Read/Write operations
»» Changing the Hive Engine
»» Compaction
»» Hive Thrift Server
»» Hbase Data Update
»» Hive MSCK Repair
»» Hbase Data Deletion
»» Hive SCD
»» Week6: Quiz »» Handling Server Failures
»» Week6: Assignment »» Hbase Practicals
»» Week5: Assignment Solution »» Handling Hbase Failure Services
»» Create & List Table »» Var vs val
»» Insert Records in Table »» Type inference
»» Scan(view) & Get records from table »» Data types in Scala
»» Delete a column »» String Interpolation
»» Describe a table »» String Comparison
»» Check table exists or not »» Flow control: If else
»» Drop table - Understanding how it works »» Match Case
»» Parameters of get command »» For Loop
»» Parameters of scan command »» While loop
»» Hbase files structure in HDFS »» Scala Functional Programming
»» How to disable/enable a table »» How to define a function
»» Various filters in Hbase »» Higher order function
»» Count Records »» Anonymous function
»» Cassandra Overview »» Scala Collections
»» What is Cassandra »» Array
»» How Cassandra Cluster Look Like »» List
»» Tunable read/write Consistency »» Tuple
»» Hbase vs Cassandra »» Range
»» Integration with Hadoop (Mini Project) »» Set
»» Hbase-Hive Integration »» Map
»» Week7: Quiz »» Scala Functional Programming:
»» Week7: Assignment »» Why Scala
»» Week6 Assignment Solution »» Modes of writing Scala code
»» What is a functional programming

08 week »» What is a function


»» What is a pure function?
Learning Scala - A Guide to Functional »» First class function
Programming »» Higher order function
»» Why Scala »» Anonymous function
»» Where to Run Scala Code »» Immutability
»» Scala Code Using IDE »» Loop
»» Scala Basics »» Recursion
»» Tail recursion »» What is a diamond problem in Scala
»» Statement vs Expression »» What is a trait
»» Closure »» Why Scala is the top most choice for a big data
»» Scala type system developer over Python and Java

»» Scala operators »» What is Apache Spark

»» Anonymous function »» Understanding Spark cluster

»» Placeholder syntax »» Is Spark a replacement to Hadoop

»» Partially applied functions »» Why Spark is faster than MapReduce

»» Function currying »» How data store in Spark

»» Week8: Quiz »» What is RDD

»» Week8: Assignment »» What is DAG

»» Week7 Assignment Solution »» RDD Lineage


»» Resiliency

09 Week
»» Immutability
»» Transformation & Action
Apache Spark - General Purpose Cluster »» Lazy Evaluation
Computing Framework »» Word count program in Spark
»» Word count program in PySpark
»» Scala Interview Preparation Series
»» Word count problem real-time example
»» What is App class in Scala
»» Week9: Quiz
»» Default args, named args & variable args
»» Week9: Assignment
»» Difference between nil, null, none & nothing
»» Week8 Assignment Solution
»» What is option in Scala
»» What is unit in Scala
»» Dealing with nulls in Scala
»» What is yield
10 Week
»» What is vector Apache Spark - In Depth
»» Scala if guards & pattern guards »» Spark Real-Time Example
»» What is “for comprehensions” »» Broadcast Variable
»» Difference between “==” in java and Scala »» Accumulators
»» Difference between strict val vs lazy val »» How Spark Executes Program on the Cluster
»» What are default packages in Scala »» Spark Driver and Executors
»» What is Scala apply method »» Client Mode, Cluster Mode and Local Mode
»» Analyzing Log Messages - Hands on »» Spark Ecosystem
»» Narrow vs Wide Transformations »» Map vs Map Partitions
»» Stages in Spark »» Introduction to Spark Structured API
»» Difference Between reduceByKey & reduce »» Spark DataFrame
»» Difference Between groupByKey & reduceByKey »» Understanding SparkSession
»» Pair RDD »» SparkSession vs SparkContext
»» Pair RDD vs Map »» Dataframe with Various Transformations
»» Understanding Default Parallelism »» RDD vs DataFrame vs Datasets
»» Difference Between repartition & coalesce »» Challenges with DataFrame
»» When to Increase/Decrease Partitions »» Spark Dataset API
»» Spark on YARN Architecture »» Difference Between DataFrame and Dataset
»» Benefits of Dataset
YARN - Yet Another Resource Negotiator »» Creating Dataframe/Datasets from Various
File Formats
»» Limitations or Drawbacks of MR1
»» Read Modes & Schema
»» Resource Manager
»» Ways to Define the Schema
»» Node Manager
»» Defining a Explicit Schema
»» Application Master
»» Week11: Quiz
»» Containers
»» Week11: Assignment
»» Week10: Quiz
»» Week10 Assignment Solution
»» Week10: Assignment
»» Week9 Assignment Solution

11 Week 12 Week
Apache Spark - Structured API Part-1 Apache Spark - Structured API Part-2
»» Cache vs Persist »» Writing Output to Sink (spark.write)

»» Spark Storage Levels »» Spark File Layout

»» Difference Between DAG & Lineage »» Benefits of Repartitions

»» How to Submit a Spark Job »» partitionBy & bucketBy

»» Real-time example - Finding top movies based »» Saving file in Various file format
on ratings »» Introduction to SparkSql
»» Storing Data in Persistent Manner
»» Handling Spark Metadata 13 Week
»» Low & High level Transformations Apache Spark - Optimization Part-1
»» Refering to a Column in Dataframe/Dataset »» Level of Optimizations
»» Column String »» Resource level optimizations
»» Column Object »» Application level optimizations
»» Column Expression »» Cluster level optimizations
»» Spark UDF using Structured API »» How to calculate no of Executors
»» Adding Column in Dataframe »» Thin Executor
»» Dataframe to Dataset Using Case Class. »» Fat Executor
»» Dataset to DataFrame Conversion »» How to calculate no of Executors
»» Spark Catalog »» How to Calculate Memory allacation
»» Registring UDF with Driver »» How to Calculate No of Cores
»» Transformations Hands on Examples »» Heap Memory
»» Aggregate Transformations »» Off-Heap Memory
»» Simple Aggregations »» Hands on With Real-time cluster
»» Grouping Aggregations »» Understanding Cluster Configuarations
»» Window Aggregations »» Realtime Example:
»» Joins on DataFrame Moving ata to HDFS using a Edge node and
»» Simple Join (Shuffle Sort Merge Join) work around it in a realtime cluster

»» Broadcast Join »» Static Resource allocation

»» Dealing With Ambiguoes Column Names »» Dynamic Resource allocation

»» Dealing With Null’s »» Understanding Memory Usage in Spark

»» Internals of Join Operations »» Execution Memory

»» When to Use Simple Join When Use »» Storage Memory


Broadcast Join »» Practical Demonstration:
»» Grouping Aggregation Real-time Example Cache & Persist

»» Infering Data in SparkSQL »» Java Serializer vs Kryo Serializer

»» Week12: Quiz »» Week12: Quiz

»» Week12: Assignment »» Week12: Assignment

»» Week11 Assignment Solution »» Week11 Assignment Solution


14 Week »» Understanding Producer & Consumer
»» Practical on Real-time Processing
Apache Spark - Optimization Part-2 »» Stream Transformations

»» Broadcast Join Practical Demonstartions »» Stateless Transformations

»» Broadcast Join Using RDD »» Stateful Transformations

»» When to Use Broadcast Join »» Window Operations

»» Broadcast Join Using Dataframe »» Batch Interval

»» Visualizing Broadcast Join with Structured API »» Window Size

»» Practical Demo on Repartition vs Coalesce »» Sliding Interval

»» Client Mode vs Cluster Mode When using Spark »» Practical on Stateless Transformation
Submit »» Practical on Stateful Transformation
»» Spark Join Optimizations »» reduceByKey vs updateStateByKey
»» Spark Advance Optimizations: Sort Aggregate »» Working With Sliding Window
vs Hash Aggregate »» reduceByKeyAndWindow Transformation
»» Spark Catalyst Optimizer »» reduceByWindow Transformation
»» Week14: Quiz »» countByWindow Transformation
»» Week14: Assignment »» Week15: Quiz
»» Week13 Assignment Solution »» Week15: Assignment
»» Week14 Assignment Solution

15 Week 16 Week
Apache Spark - Streaming Part-1
Apache Spark - Streaming Part-2
»» Kind of Processing
»» What Is Structured Streaming
»» What is Real-tim Processing
»» Requirement Of Structure Streaming
»» The Importance of Real-time Processing
»» Limitations Of Spark Streaming
»» Batch processing vs Real-tim Stream Processing
»» Benefits Of Spark Structure Streaming
»» Spark Streaming Data
»» Practical - Wordcount Example On Structured
»» Spark discretized stream or DStream Streaming
»» Batch & Batch Interval »» Dynamically Setting The Shuffle Partitions
»» Do Spark is a real-time streaming engine »» Data Stream Writer Output Modes
»» Stream Processing in Spark »» Datastream Output Modes - append, update &
»» Transformed DStream complete
»» Spark Streaming Graceful Shutdown
»» How Does Spark Streaming Code Executes Internally 17 Week
»» How a Job Converted to Micro batches Apache Kafka - Distributed Event
»» Trigger Point For Micro Batches Streaming Platform
»» Types of Triggers - unspecified, time interval, »» Introduction To Kafka
one time, continuous
»» Kakfa Architecture
»» Types of Data Sources - Socket Source, Rate
Source, File Source, Kafka Source »» Kafka Key Concepts/Fundamentals

»» Limitations of socket source »» Overview Of Zookeeper And It’s Role In Kafka


Cluster
»» Practical on File Data Source
»» Cluster, Nodes, Brokers, Topics
»» Types of Spark Streaming Output Data Options
»» Consumer, Producers, Logs, Partitions
»» Fault Tolerance and Exactly Once Guarantee
»» Concept Of Consumer Groups
»» Understanding Checkpoint Location
»» Leader & Follower Partition
»» Stateful vs Stateless Transformations
»» Installing One Node Kafka Cluster On Local
»» Managed Stateful Operations vs UnManaged
Stateful Operations »» Installing Multinode Kafka Cluster On Local

»» Types of Aggregations - Continuous »» Command Line Producer And Consumer


Aggregations vs Time Bound Aggregations »» Replication Concept For Fault Tolerance
»» Window Tranformations »» How Data Is Stored In Brokers
»» updateStateByKey, reduceByKeyAndWindow, »» Log Segments, Message Offsets, Message
reduceByWindow, countByWindow Index
»» Types of windows - Tumbling Time Window, »» Isr List / Minimum Isr
Sliding Time Window »» Committed Vs Uncommited Messages
»» Dealing With Late Coming Records Using »» Writing A Kafka Producer In Java
Watermark
»» Writing A Kafka Consumer In Java
»» State Store Cleanup
»» Scaling Up The Kafka Cluster
»» Calculating the Watermark Boundary
»» Achieving Exactly Once Semantics
»» Streaming Joins
»» Integrating Kafka With Spark Structured
»» Streaming Dataframe to static dataframe Streaming.
»» Streaming Dataframe With Another Streaming »» Week16: Quiz
Dataframes
»» Week16: Assignment
»» Week16: Quiz
»» Week15 Assignment Solution
»» Week16: Assignment
»» Week15 Assignment Solution
18 Week AWS Athena:
»» What is Athena
Big Data on Cloud Part-1 »» When do we require Athena
AWS EMR (Elastic MapReduce): »» What problem Athena Solve

»» What is a VM (Virtual Machine) »» How Athena Works

»» On-Premise vs Cloud Setup »» Athena Pricing

»» Major Vendors of Hadoop Distribution »» Athena Practical Demonstration:

»» Why Cloud & Big Data on Cloud »» How to create a normal table manually on csv
data residing in s3
»» Major Cloud Providers of Bigdata
»» How to minimize data scanning in Athena
»» What is EMR
»» How to create partition table on Parquet file
»» Hdfs vs S3
»» Infering Schema automatically using AWS Glue
»» What Is S3
»» Glue Catalog
»» Important Instances in AWS
»» Week18: Quiz
»» Kinds of Nodes in Cluster
»» Week18: Assignment
»» Transient vs Long Running Cluster
»» Week17 Assignment Solution
»» Running Spark Code on Emr
»» How to Track Your Job
»» Copy File From S3 to Local
»» Zeppelin Notebook
19 Week
»» Types of EC2 Instances
Big Data on Cloud Part-2
»» How to Create a VM AWS Glue
»» What is a Keypair »» What is AWS Glue?
»» Elastic IP »» Introduction To Glue
»» AWS Storage, Networking & CLI »» Features of Glue
»» Instance Store »» AWS Glue Benefits
»» S3 & EBS »» AWS Glue Terminology
»» Public Ip Vs Private Ip »» Pointing to Specific Data Stores and Endpoints
»» Network Switches »» Glue Data Catalogue
»» Security Group »» Crawlers
»» Aws Command Line Interface »» Connecting to Your Data Store
»» Launch A Emr Cluster Using Advanced Options »» Using Crawlers for Catalogue Tables
»» Overview and Working of Glue Jobs »» Viewing The DAG In Ui-Graph View, Tree View,
»» Adding New Jobs in Glue Logs Viewing

»» Triggering Jobs and Their Scheduling »» Example Showcasing Bash Operators Usage
»» Setting Precedence Among Various Tasks
AWS Redshift
»» Lifecycle Of A Task-Understanding Various Stages
»» Database vs Data Warehouse vs Data Lake
»» About Trigger_rules & Understanding With Example
»» Introduction to Amazon Redshift
»» Airflow Artifact - More On Operators
»» Benefits of Amazon Redshift
»» Writing Our Own Custom Operators
»» Use Cases of Amazon Redshift
»» Walkthrough Of Airflow UI
»» Redshift Master Slave Architecture
»» Connections To Various Datastores & Variables
»» Types of Nodes
»» Working With Connections, Understanding
»» Redshift Spectrum Sensors – Demo
»» Redshift Fault Tolerance »» Building an end-to-end customer-360 pipeline
»» Redshift Sort Keys using Airflow involving data collection from
various sources, processing in spark, loading
»» Redshift Distribution Styles
the processed data in hive and uploading the
»» Practical Demonstration same to HBase and generating a notification
»» Week19: Quiz about success of the pipeline to the
downstream applications.
»» Week19: Assignment
»» Week18 Assignment Solution

20 Week Plus
One end-to-end pipeline PROJECT
Apache Airflow - Workflow Management involving all Major components like
Platform Sqoop, Hdfs, Hive, Hbase, Spark... etc.
»» Introduction To Airflow And Its Usage Interview Preparation Tips:
»» What Is Workflow
»» Cron-Job Creation Example
Sample Resume
»» Airflow Additional Features 15+ Mock Interview Recordings
»» Airflow Architecture And Components
Mock Interview QA
»» Airflow Installation Demo
»» Dags-Creating A Simple Helloworld Dag
Interview Questions
»» Introduction To Tasks And Operators How to Handle Managerial Round Qs
5 Star Google Rated
Big Data Course
LEARN FROM THE EXPERT

9108179578

You might also like