Download as xlsx, pdf, or txt
Download as xlsx, pdf, or txt
You are on page 1of 57

Sl.

NO

5
6

8
9

10
11

12
13

15
16

17

18

19

20

21
22
23

24

25
26

27
28

29

30

31

32

33
34
Topics

What is RDD in Spark

What is DataFrame in Spark

What is DataSet in Spark

Different operation in RDD and DataFrame

What is lazy execution


Speculative Execution

XGBoost and CATBoost

how region are split in hbase


how tables are split into reguions

Hbase region server,Regions Hmaster, ZooKeeper


What is Zoo Keeper

What is Region Server


HotSpot Problem

partitioning and bucketing in hive


Types of partitioning

How to Delete a partition

what is HiveRC File

What is MetaStore

Difference between internal(Managed table) and External Table

Can we print headers of the columns


Can we limit Query to certain row
What is map side join

Difference between orderBy and SortBy in Hive

Sqoop basics(Import,Export,Incremental Job Parameters)


Sqoop Incremental import and import all

Sqoop Export
Difference between input split and HDFS Block

Responsibilities of InputFormat,Default InputFormat

Default No. Of Mappers/Reducers

The Significance ofjob.setJarByClass

What is combiner class

In What cases you want to use a combiner


What is Partititoner ? Default Partitioner class? How are the keys Partitioned
ANSWER

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an im


distributed collection of objects. Each dataset in RDD is divided into logical partitions, w
be computed on different nodes of the cluster.

A DataFrame is a distributed collection of data, which is organized into named columns


Conceptually, it is equivalent to relational tables with good optimization techniques.

A DataFrame can be constructed from an array of different sources such as Hive tables
Structured Data files, external databases, or existing RDDs.

Dataset is a data structure in SparkSQL which is strongly typed and is a map to a relatio
schema. It represents structured queries with encoders. It is an extension to data fram
Spark Dataset provides both type safety and object-oriented programming interface.

Transformation:
In Spark, the role of transformation is to create a new dataset from an existing one. Th
transformations are considered lazy as they only computed when an action requires a r
be returned to the driver program. Eg- filter,union,intersection...
Action
In Spark, the role of action is to return a value to the driver program after running a com
on the dataset. Eg: count(),collect()

lazy evaluation in Spark means that the execution will not start until an action is trigge
Spark, the picture of lazy evaluation comes when Spark transformations occur.
Big data applications where dozens of cluster nodes are supposed to collectively work o
certain job but one machine has a hardware or software issue thus terribly slowing dow
whole process. Spark has a solution to this problem and it’s called speculative executio

Spark monitors the time needed to complete tasks in the stage. If some task(s) takes m
time (more on that later) than other ones in same stage, Spark will resubmit a new cop
task on another worker node. Now we have 2 identical tasks running in parallel and wh
them completes successfully, Spark will kill the other one and pick the output of the su
task and move on

Rather than training all of the models in isolation of one another, boosting trains mode
succession, with each new model being trained to correct the errors made by the previ
Models are added sequentially until no further improvements can be made.

The advantage of this iterative approach is that the new models being added are focus
correcting the mistakes which were caused by other models. In a standard ensemble m
where models are trained in isolation, all of the models might simply end up making th
mistakes!

Gradient Boosting involves creating and adding decision trees to an ensemble model
sequentially. New trees are created to correct the residual errors in the predictions fro
existing ensemble.

Whenever a region becomes large, it is divided into two child regions, as shown in the a
figure. Each region represents exactly a half of the parent region. Then this split is repo
the HMaster. This is handled by the same Region Server until the HMaster allocates the
new Region Server for load balancing.
A region contains all the rows between the start key and the end key assigned to that r
HBase tables can be divided into a number of regions in such a way that all the column
column family is stored in one region. Each region contains the rows in a sorted order.

Regions:
1.) A table can be divided into a number of regions. A Region is a sorted range of rows s
data between a start key and an end key.
2.) A Region has a default size of 256MB which can be configured according to the need
3.) A Group of regions is served to the clients by a Region Server.
4.) A Region Server can serve approximately 1000 regions to the client.

HMaster Server:
1.) HBase HMaster performs DDL operations (create and delete tables) and assigns reg
Region servers
2.) It coordinates and manages the Region Server
3.) It assigns regions to the Region Servers on startup and re-assigns regions to Region
during recovery and load balancing.
4.) It monitors all the Region Server’s instances in the cluster (with the help of Zookeep
performs recovery activities whenever any Region Server is down.
Zoo Keeper:
1.) Zookeeper acts like a coordinator inside HBase distributed environment.
2.) Every Region Server along with HMaster Server sends continuous heartbeat at regu
to Zookeeper and it checks which server is alive and available.
3.) There is an inactive server, which acts as a backup for active server. If the active ser
comes for the rescue.
4.) The active HMaster sends heartbeats to the Zookeeper while the inactive HMaster l
the notification send by active HMaster. If the active HMaster fails to send a heartbeat
session is deleted and the inactive HMaster becomes active.
5.) While if a Region Server fails to send a heartbeat,Then HMaster performs suitable r
actions which we will discuss later in this blog.
6.) Zookeeper also maintains the .META Server’s path, which helps any client in searchi
region. The Client first has to check with .META Server in which Region Server a region
and it gets the path of that Region Server.

1.)WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file att
every Region Server inside the distributed environment. The WAL stores the new data
been persisted or committed to the permanent storage. It is used in case of failure to r
data sets.
2.) Block Cache: From the above image, it is clearly visible that Block Cache resides in th
Region Server. It stores the frequently read data in the memory. If the data in BlockCac
recently used, then that data is removed from BlockCache.
3.) MemStore: It is the write cache. It stores all the incoming data before committing it
disk or permanent memory. There is one MemStore for each column family in a region
can see in the image, there are multiple MemStores for a region because each region c
multiple column families.
4.) HFile: HFile is stored on HDFS. Thus it stores the actual cells on the disk. MemStore
the data to HFile when the size of MemStore exceeds.
HBase hotspotting occurs because of poorly designed row key. Because of bad row key
stores large amount of data on single node and entire traffic is redirected to this node
client requests some data leaving other node idle.

More Details:
In hbase set of row resides on region servers and when we have a sequential row keys
write the rows on a single region server, which result in huge amount of write to a sing
result in a kind of congestion and region server to be overloaded. So we design a row k
a way that it is distributed in a better way through out the regionservers.
Can Be Solved using techniques such as Hashing,Salting

Partition
Hive Partitions is a way to organizes tables into partitions by dividing tables into differe
based on partition keys.
Partition is helpful when the table has one or more Partition keys. Partition keys are ba
elements for determining how the data is stored in the table.
Bucketing:
Buckets in hive is used in segregating of hive table-data into multiple files or directories
for efficient querying.

The data i.e. present in that partitions can be divided further into Buckets
The division is performed based on Hash of particular columns that we selected in the
Buckets use some form of Hashing algorithm at back end to read each record and place
buckets
In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
Static :
Insert input data files individually into a partition table is Static Partition.
Usually when loading files (big files) into Hive tables static partitions are preferred.
If you want to use the Static partition in the hive you should set property set hive.mapr
= strict This property set by default in hive-site.xml
Dynamic:
If you want to partition a number of columns but you don’t know how many columns t
dynamic partition is suitable.
If you want to use the Dynamic partition in the hive then the mode is in non-strict mod
You can use ALTER TABLE with DROP PARTITION option to drop a partition for a table.
ALTER TABLE some_table DROP IF EXISTS PARTITION(year = 2012);

It is a file that is executed when you launch the hive shell - making it an ideal place for a
hive configuration/customization you want set, on start of the hive shell. This could be:
- Setting column headers to be visible in query results
- Making the current database name part of the hive prompt
- Adding any jars or files
- Registering UDFs

Metastore is the central repository of Apache Hive metadata. It stores metadata for Hi
(like their schema and location) and partitions in a relational database.

Hive metastore consists of two fundamental units:


A service that provides metastore access to other Apache Hive services.
Disk storage for the Hive metadata which is separate from HDFS storage.

When drop is used on internal table, both data and table schema will be deleted

when drop is used on external table only table schema will be deleted and data will stil
hdfs
Yes , hive> set hive.cli.print.header=true;
Yes, Using where,having etc.
Map side join is a process where joins between between two tables are performed in t
phase without the involvement of reduce phase. Map side join allows a table to get loa
memory ensuring a very fast join operation, performed entirely within a mapper and th
without having to use both map and reduce phases.

in sortby only partial sorting is done as sorting is ristricted to each reducer,


where as in order by since on one reducer is used we get complete sorted output
Configure: hive.auto.convert.join=true

Sqoop is a tool designed to transfer data between Hadoop and relational database serv

Import:
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult

Using where for subset of table:


$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =’sec-bad’” \
--target-dir
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
-last value 1205
Import All:
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root

It is mandatory that the table to be exported is created manually and is present in the d
from where it has to be exported.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));

The following command is used to export the table data (which is in emp_data file on H
the employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
Block – HDFS Block is the physical representation of data in Hadoop.
InputSplit – MapReduce InputSplit is the logical representation of data present in the b
Hadoop.

Block – By default, the HDFS block size is 128MB which you can change as per your req
InputSplit – InputSplit size by default is approximately equal to block size. It is user defi

InputFormat is the first component in Map-Reduce, it is responsible for creating the in


and dividing them into records, Using InputFormat we define how the input files are sp
read.
InputFormat defines the Data splits, which defines both the size of individual Map task
potential execution server.
InputFormat defines the RecordReader, which is responsible for reading actual records
input files.
ag: textInputFormat,KeyValueTextInputFormat

By Default number of reducer is 1.


to change number of reducers : job.setNumReduceTasks(5);

Number of Mappers depends on the file size.


setJarByClass method we tell Hadoop to find out the relevant jar by finding out that the
specified as it's parameter to be present as part of that jar.

A Combiner, also known as a semi-reducer, is an optional class that operates by accepti


inputs from the Map class and thereafter passing the output key-value pairs to the Red

The Combiner class is used in between the Map class and the Reduce class to reduce th
of data transfer between Map and Reduce. Usually, the output of the map task is large
data transferred to the reduce task is high.
The Partitioner in MapReduce controls the partitioning of the key of the intermediate m
output. By hash function, key (or a subset of the key) is used to derive the partition
According to the key-value each mapper output is partitioned and records having the s
value go into the same partition and then each partition is sent to a reducer. Partition c
determines which partition a given pair will go. Partition phase takes place after map p
before reduce phase.

The Default Hadoop partitioner in Hadoop MapReduce is Hash Partitioner which comp
hash value for the key and assigns the partition based on this result.
Domain

Spark
Spark
Hbase
Hbase
HIVE
Sqoop
Sqoop
Map Reduce
https://www.youtube.com/watch?v=t5J1iIww4R0
Transformation

map(func)

filter(func)

flatMap(func)

mapPartitions(func)

mapPartitionsWithI
ndex(func)
sample(withReplac
ement, fraction,
seed)
union(otherDataset
)
intersection(otherD
ataset)
distinct([numPartiti
ons]))

groupByKey([numP
artitions])
reduceByKey(func,
[numPartitions])

aggregateByKey(ze
roValue)(seqOp,
combOp,
sortByKey([ascendi
[numPartitions])
ng],
[numPartitions])
join(otherDataset,
[numPartitions])
cogroup(otherData
set,
[numPartitions])
cartesian(otherDat
aset)
pipe(command,
[envVars])
coalesce(numPartiti
ons)
repartition(numPar
titions)
repartitionAndSort
WithinPartitions(pa
rtitioner)
Description

It returns a new distributed dataset formed by passing each element of the source through a function func

It returns a new dataset formed by selecting those elements of the source on which func returns true.

Here, each input item can be mapped to zero or more output items, so func should return a sequence rath
than a single item.
It is similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator<T> => Iterator<U> when running on an RDD of type T.
It is similar to mapPartitions that provides func with an integer value representing the index of the partitio
func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
It samples the fraction fraction of the data, with or without replacement, using a given random number
generator seed.
It returns a new dataset that contains the union of the elements in the source dataset and the argument.

It returns a new RDD that contains the intersection of elements in the source dataset and the argument.

It returns a new dataset that contains the distinct elements of the source dataset.

It returns a dataset of (K, Iterable) pairs when called on a dataset of (K, V) pairs.
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key ar
aggregated using the given reduce function func, which must be of type (V,V) => V.

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key a
aggregated using the given combine functions and a neutral "zero" value.
It returns a dataset of key-value pairs sorted by keys in ascending or descending order, as specified in the
boolean ascending argument.
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of
elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. Th
operation is also called groupWith.
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script.

It decreases the number of partitions in the RDD to numPartitions.

It reshuffles the data in the RDD randomly to create either more or fewer partitions and balance it across

It repartition the RDD according to the given partitioner and, within each resulting partition, sort records b
their keys.
Action

reduce(func)

collect()

count()

first()

take(n)

takeSample(withRepl
acement, num,
[seed])
takeOrdered(n,
[ordering])
saveAsTextFile(path)

saveAsSequenceFile(
path)

(Java and Scala)


saveAsObjectFile(pat
h)

(Java and Scala)

countByKey()

foreach(func)
Description

It aggregate the elements of the dataset using a function func (which takes two arguments and returns on
commutative and associative so that it can be computed correctly in parallel.
It returns all the elements of the dataset as an array at the driver program. This is usually useful after a fi
returns a sufficiently small subset of the data.
It returns the number of elements in the dataset.

It returns the first element of the dataset (similar to take(1)).

It returns an array with the first n elements of the dataset.

It returns an array with a random sample of num elements of the dataset, with or without replacement, op
random number generator seed.
It returns the first n elements of the RDD using either their natural order or a custom comparator.

It is used to write the elements of the dataset as a text file (or set of text files) in a given directory in the
other Hadoop-supported file system. Spark calls toString on each element to convert it to a line of text in
It is used to write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesys
Hadoop-supported file system.
It is used to write the elements of the dataset in a simple format using Java serialization, which can then b
usingSparkContext.objectFile().

It is only available on RDDs of type (K, V). Thus, it returns a hashmap of (K, Int) pairs with the count of ea

It runs a function func on each element of the dataset for side effects such as updating an Accumulator or
storage systems.
Region server
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory an
is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the da
produces a new set of output, which will be stored in the HDFS.

Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
https://techvidvan.com/tutorials/hadoop-inputsplit-vs-blocks/

To change block size : dfs.block.size property in hdfs-site.xml


A combiner does not have a predefined interface and it must implement the Reducer interface’s reduce() method.

A combiner operates on each map output key. It must have the same output key-value types as the Reducer class.

A combiner can produce summary information from a large dataset because it replaces the original Map output.
https://data-flair.training/blogs/hadoop-partitioner-tutorial/

You might also like