Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Big Data Analytics with Hadoop and Apache Spark

The combined power of Spark and Hadoop Distributed File System


(HDFS)

Selecting transcript lines in this section will navigate to timestamp in the video
- [Kumaran Ponnambalam] Data engineers often use stacks   to leverage the power of
multiple technologies.  For example, there is often a need for not just  scalable storage
but also fast processing.  Many teams find themselves using the combination  of
Hadoop for storage and Spark for compute,  because it provides unparalleled
scalability  and performance for analytics pipelines.  In order to harness this power,  it is
important to understand how Hadoop and Spark  work with each other and utilize the
levers available. My name is Kumaran Ponnambalam,  in this course, I will show you how
to build scalable and high performance analytics pipelines  with Apache Hadoop and
Spark.  I will only discuss key tools and best practices  for taking advantage of this
combination.  We will use a Hortonworks Sandbox for this course.  You need prior
familiarity,  with both Apache Hadoop and Spark.  In this course we will only focus  on
using Hadoop and Spark together.  We will also use Zeppelin notebooks for our
examples.  Please refer to other essential courses and resources,  if you want to learn the
essentials of these technologies.  That being said, let's explore how to maximize  the
combined power of Hadoop and Spark. 

Apache Hadoop overview

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video,  I will review the key features  and the current state of
technology for Apache Hadoop.  Hadoop is an open-source technology  that started the
big data wave.  It provides distributed data storage  and computing using low-cost
hardware.  It can scale to petabytes of data  and can run on clusters with hundreds of
nodes.  Hadoop mainly consists of two components,  the Hadoop Distributed File
System, or HDFS,  that provides data storage.  MapReduce is a programming model and
implementation  that provides distributed computing capabilities  with data stored in
HDFS.  Where does Hadoop stand today?  Let's look at HDFS and MapReduce
separately.  HDFS is still a very good option  for cheap storage of large quantities of
data. It provides scaling, security, and cost benefits  that help in its adoption.  It is most
suitable for enterprises  with in-house data centers  who want to host the data within
their network.  Cloud alternatives like AWS S3, Google Cloud Storage,  and Azure Blob
are becoming increasingly popular too.  MapReduce, on the other hand, is becoming
old. While it scales horizontally  over hundreds of compute nodes, it is very slow,  as it
primarily uses disk storage  for intermediate caching instead of memory.  Newer
technologies, like Apache Spark and Apache Flink,  have emerged that can execute the
same processing  at much faster rates.  The newer technologies also support other
capabilities  and a growing library of connectors,  which makes them a better choice
than MapReduce.  

Apache Spark overview

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will review  some of the salient features of Apache
Spark. Apache Spark is an open source technology  that started out as a more efficient
alternative  to Hadoop MapReduce.  It is a large-scale distributed data processing
engine. Spark stores its data primarily in memory  to speed up computations.  It has also
grown to add a number of capabilities  like batch processing, real-time
streaming,  machine learning and graphs.  Spark can be programmed with Scala, Java,
Python and R.  It's performance features and programing support  makes it the most
popular big-data technology today.  

Integrating Hadoop and Spark

Selecting transcript lines in this section will navigate to timestamp in the video
- In this video, I will review the benefits  of using Hadoop and Spark together for big
data analytics.  Why is the combination of Hadoop and Spark so powerful?  HDFS
provides large-scale distributed data storage.  Spark provides large-scale fast processing
of the same data.  Together, they make an excellent combination  for building data
pipelines.  Spark is well integrated with Hadoop natively  and makes optimal use of that
integration.  For example, Spark can access and update HDFS data  using multiple
parallel nodes.  There are a number of data read optimizations  that use less memory
and I/O.  Spark can use HDFS for intermediate data caching.  Also, YARN provides a
single cluster management mechanism  for both HDFS and Spark.  So, my
recommendation, especially for enterprise deployments,  is to utilize the processing
power of Spark  with the scalable storage of HDFS  to build high performance
processing jobs.  In this course, I will demonstrate  the strengths of this integration  and
provide samples and best practices  for building big data pipelines with Spark and
Hadoop.  

Setting up the environment

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] This course requires  a number of big data components to be installed and
setup,  including Apache Hadoop and Apache Spark.  We also will use Apache Zeppelin
as the notebook  for building our exercises.  For ease of installation,  we will use the
Hortonworks Sandbox  that already pre-bundles all these components.  Since, you are
familiar with this technologies,  you may also try out these examples in your own
setups. We will use the Docker version of the Sandbox on Mac,  please install Docker if
you do not already have it.  The Sandbox itself can be downloaded from this
website, cloudera.com/downloads/hortonworks-sandbox.html.  You should download
the Hortonworks HDP Sandbox.  It provides downloads and instructions for
Mac,  Windows and Linux.  There is also documentation to set up and
troubleshoot.  Please use the same If you run into issues  with other versions of the
Sandbox.  Windows installation can get tricky and may require hacks  to get it running
based on your individual setup.  Before we begin, please make sure that the Dockers
setup  has sufficient memory and CPUs allocated to it.  Please provide at least two CPU
cores  and eight GB of memory to Docker.  If you notice sluggishness in the US,  our
services frequently stopping then please increase  your RAM allocation.  I have already
downloaded the installation package and unzipped in this specific folder.  Before you
move forward, please update your ETC host file  to add Sandbox HDP to 127.0.0.1.  Now,
we can proceed to install the Sandbox. We can do so by executing the command  sh
docker-deploy-hdp30.sh,  it will install to Docker images.  Sandbox HDP is the main
image that runs all the software.  Sandbox proxy is a reverse proxy that is required  for
the main image to work.  The install will also start both the containers.  Please note that
they are already  in my local Docker repository.  So it's keeping the Docker pull
step.  The images are huge, so the pool may run into many minutes.  The containers
usually take some time to start up,  say 30 minutes to start all the services.  Please check
the status of the services by visiting the page  localhost 80 80.  You can log in using the
account raj_ ops  with the password raj_ops again.  This Ambari dashboard shows the
status  of all the services installed,  and most of them will be in the starting state.  Go to
hosts and click on the host link here.  You will notice that the services are in starting
state  for most of the time,  so give it about 30 minutes for all the services  to come
alive. 
Using exercise files

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] In this video,  I will show you the steps to set up the exercise files.  Before
we go there, let's make sure  that the Sandbox is fully up and running.  We can do so by
checking the host link in the Ambari UI.  Please make sure that the green check
box shows up for the host.  Next, I have downloaded the exercise files  for this course in
this directory.  And has three CSV files, which are data files,  and four JSON files, which
are Zeppelin notebook files.  Let's now load them into the setup.  Go to the
Hortonworks landing page on port 1080.  This shows all the application UIs
available.  Go to the Shell client available here.  Log in as Raj_OPS, with the password as
again Raj_OPS.  We want to provide full access to the Raj Ops directory  in HDFS, to all
the users.  We can do so with the following command,  HDFS DFS minus CHMOD 777
/user/raj_ops.  This command has been executed successfully.  Now go to thee Ambari
UI on port 8080.  On the top right corner, open the files view.  Click on files view to open
the viewer for HDSF files.  This shows all the HDFS data trees  available inside this
instance.  Navigate to the user Raj Ops data tree.  You can create a new folder here
called raw_data.  Navigate to the raw data folder.  Now upload all the three CSV files  we
have in the exercise files to this folder.  We first load product vendor.CSV,  then we load
sales orders.CSV,  and finally we upload student scores.CSV.  Verify if the uploaded files
show up  in the folder correctly.  Next, we go to the Zeppelin notebook.  The Zeppelin
notebook grants on boat 9995  on the Sandbox HDP website.  We first want to import
all our exercise notes,  so click on the import note link.  Click select a JSON file.  Now
upload each of the JSON files one by one.  The JSON files will start showing  under this
Sparks course directory.  You can verify if all the notebooks have been uploaded  by
looking under the Sparks course directory.  Next, open the notebook called 03_XX_data
ingestion  with Spark and HDFS. Please make sure that the notebook loads up
correctly  and shows up as shown here. Now go to the second paragraph here,  which is
just a command  to print the current version of Spark.  Click on the run paragraph
button here.  Now this should immediately run  and print the version of Spark.  Typically,
when you run Spark for the first time,  it may take some time, even a couple of
minutes,  for the first command to run  and successfully come back.  This is perfectly
okay,  you just have to be patient of what the command execute.  This confirms that our
setup is up and running,  and now we can start using it for our course.  

Storage formats

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, I will review  various options available, and best practices to
store data in HDFS.  I will start off with storage formats in this video.  HDFS supports a
variety of storage formats,  each with its own advantages and use cases.  The list
includes raw text files,  structured text files like CSV, XML, and JSON,  native sequence
files, Avro formatted files,  ORC files, and Parquet files.  I will review the most popular
ones for analytics now.  Text files carry the same format  they have in a normal file
system.  They are stored as a single physical file in HDFS.  They are of low
performance, as they do not support parallel operations.  They require more storage,
and do not have any schema.  In general, they are not recommended.  Avro files support
language-neutral data serialization.  So data written through one language, or two,  can
be read with another with no problems.  Data is stored row by row, like CSV files.  They
support a self-describing schema,  and is used to enforce constraints on data.  They are
compressible, and hence can optimize on storage.  They are splittable into
partitions,  and hence can help in parallel reads and writes.  They are ideal for
situations  that require multi-language support.  Parquet files store data column by
column,  similar to columnar databases.  This means each column can be read separately
from disk  without reading other columns. This saves on I/O.  They support
schema.  Parquet files are both compressible and splittable,  and hence are performance
and storage optimized.  They also can support nested data structures.  Parquet files are
ideal for batch analytics jobs  for these reasons. Analytics applications typically have
data stored  as records and columns, similar to RDBMS tables.  Parquet provides overall
better performance  and flexibility for these applications.  I will show later in the
course  how Parquet enables parallelization and I/O optimization. 

Storage formats

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, I will review  various options available, and best practices to
store data in HDFS.  I will start off with storage formats in this video.  HDFS supports a
variety of storage formats,  each with its own advantages and use cases.  The list
includes raw text files,  structured text files like CSV, XML, and JSON,  native sequence
files, Avro formatted files,  ORC files, and Parquet files.  I will review the most popular
ones for analytics now.  Text files carry the same format  they have in a normal file
system.  They are stored as a single physical file in HDFS.  They are of low
performance, as they do not support parallel operations.  They require more storage,
and do not have any schema.  In general, they are not recommended.  Avro files support
language-neutral data serialization.  So data written through one language, or two,  can
be read with another with no problems.  Data is stored row by row, like CSV files.  They
support a self-describing schema,  and is used to enforce constraints on data.  They are
compressible, and hence can optimize on storage.  They are splittable into
partitions,  and hence can help in parallel reads and writes.  They are ideal for
situations  that require multi-language support.  Parquet files store data column by
column,  similar to columnar databases.  This means each column can be read separately
from disk  without reading other columns. This saves on I/O.  They support
schema.  Parquet files are both compressible and splittable,  and hence are performance
and storage optimized.  They also can support nested data structures.  Parquet files are
ideal for batch analytics jobs  for these reasons. Analytics applications typically have
data stored  as records and columns, similar to RDBMS tables.  Parquet provides overall
better performance  and flexibility for these applications.  I will show later in the
course  how Parquet enables parallelization and I/O optimization. 

Partitioning

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] Partitioning is a key concept to use while   working with HDFS data.  In this
video I will review the importance of partitioning  and how it works.  Why do we need
partitioning?  Relational databases speed up data access by using  indexes on columns
used in filter conditions.  HDFS does not have the concept of an index.  Even if a single
row is required from a large  betabyte file, the entire file should be read  to extract the
record.  This introduces significant desk IO (incoherent)  Partitioning provides a way to
read only a subset of data  based on a partition key.  Similar to indexes, partitions can
also be based  on multiple attributes.  Typical attributes suitable for partitioning include
dates,  and element identifiers like customer or product names.  How does partitioning
work?  When we create a HDFS file specifying a partition key  Hardu creates a separate
data tree for partition.  Records corresponding to a specific partition key is stored  in the
same data tree.  For example, if we use product as a partition key,  a seperate data tree
will be created for each product and  corresponding records will be stored there.  If we
use a filter on the product attribute while querying  only those subdirectories that match
the filter need to be  read.  While selecting attributes for partitioning  choose attributes
that have a limited  or controlled set of values  otherwise too many subdirectories might
be created.  Also ensure that the records are equally distributed  among the various
values.  Choose attributes that are most used in query patterns,  likely candidates
include dates, customer IDs, products IDs  among others.  In the next video, I will discuss
an alternative  to partitioning called bucketing.  
Bucketing

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As seen in the previous video  partitioning is only optimal when a given
attribute  has a small set of unique values.  What if we need to partition for a key  with a
large number of values  without prolifercating the number of that increase?  Bucketing
is the answer.  Bucketing works similar to partitioning,  but instead of using the value of
the attribute  it uses a hash function to convert the value  into a specific hash
key.  Values that have the same hash key end up in the same bucket  or sub data
tree.  The number of unique buckets can be controlled and limited.  This also ensures
even distribution  of values across all buckets.  It's ideal for attributes  that have a large
number of uniques values  like order number or transaction I.D.  Choose buckets for
attributes  that have a large number of unique values  and those that are most
frequently used in query filters. Experiment with multiple buckets columns  to find
optimal read/write performance  for the specific use case.  In the next video I will
review  some best practices for data storage.  

Best practices for data storage

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video,  I will walk through some of the best practices  for designing
HDFS schema and storage.  First, during the design stage,  understand the most used,
read and write patterns  for your data.  Identify if it's read intensive or write intensive or
both.  For reads analyze what filters are usually applied on data,  Determine What needs
optimization  and what can be compromised.  Is it important to reduce storage
requirements  or is it okay to compromise on storage  for better read-write
performance? Choose your options carefully  as these cannot be easily changed  after
the pipeline is deployed and data is created.  Changing things like storage formats  and
compression cortex would require reprocessing  all the data.  Run tests on actual
data  to understand performance and storage characteristics.  Experiment if required  to
compare between different storage options available.  Choose partitioning and
bucketing keys wisely  as they incur significantly additional costs during writes.  What
helping in reads.  In the next chapter,  let's start reading and writing history of us
files  with spark using these practices.  

Reading external files into Spark


Selecting transcript lines in this section will navigate to timestamp in the video
- In this chapter, I will demonstrate options available  to ingest data into HDFS with
Spark.  We will be using the Zeppelin notebook  titled Code_03_XX  Data Ingestion with
Spark and HDFS.  Navigate to this notebook at sandbox-HDP 9995.  On opening the
notebook, you will find  that Zeppelin is similar to Jupiter notebooks  in many ways.  We
can create paragraphs each with a different interpreter.  The Code can be executed by
clicking on the Run button.  Results will display immediately below the paragraph.  In
this video, we will focus on reading external data  into Spark.  Spark provides
connectors  to a number of external data sources  including a local file, a file from
HDFS,  or even a Kafka Topic.  The first paragraph here is to test  if Spark is successfully
installed and running. The %spark2 in this first line indicates  the interpreter to use.  We
can run this paragraph with a run button  and the results will show up in the
bottom.  We see that the current version  of Kafka is showing up correctly.  So we are
good to proceed with the other exercises.  In the next paragraph, we read a CSV
file.  Since Spark is running under Yarn in the sandbox  it uses HDFS as its disk.  We will
link the sales_orders.csv file  that we uploaded earlier in the course  into a data frame
called a rawSalesData.  We set the option for Header to tell Spark  to consider the first
line of this file  as the Header.  We also specify inferSchema equal to true,  Spark will
examine the first few lines in the file  to infer the data type of each column.  It will also
use the Header line  to name the individual columns.  We then pin this schema for the
data frame  as well as the first five rows  to make sure that the data is read
correctly.  Let's run this code now and review the results. We can see that the schema as
well as the data shows up  as desired.  In the next few videos, I will show you many
ways  of parallelizing this data and storing in HDFS. 

Writing to HDFS

Selecting transcript lines in this section will navigate to timestamp in the video
- As discussed in the previous videos,  CSV files cannot be used for parallel reads and
writes.  We need to convert them to other formats like Parquet,  for efficient processing
of data in the later stages.  In this video, we will write the raw sales data data frame  into
a Parquet file in HDFS.  The code for this is simple,  We will use the right function
available in the data frame.  We didn't set the format to Parquet,  the mode is set to
overwrite,  to overwrite any existing contents.  In real pipelines though, append maybe
the better option  if there are periodic additions to the data.  We then use GZIP to
compress the data.  We save it to the raw Parquet directory under user/Raj_ops  Let's
execute this code and review the results.  First, notice the Spark job feature
appearing  at the top of the paragraph.  You can click on this to open the Spark UI  and
look at how Spark executed this job.  The spark UI may launch with a fully qualified
URL.  And this may generate an error.  You can overcome it by adding that URL to the
EPZ host file.  Here is how my EPZ host file is setup.  We can also go to the HDFS file
viewer to review the data  that is created.  We can see a directory created called raw
Parquet.  If you go under that you will see part files created  under this specific directory
and the extension shows  that they are GC files of Parquet format.  Depending on the
size of data,  there could be more files that get created.  In the next video,  I will show
you how to partition data  while writing to HDFS.  

Parallel writes with partitioning

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As reviewed in the earlier videos,  partitioning of data enables parallel
reads and writes.  It also helps in filtering out data  while reading into memory.  We will
create a partition HDFStore  based on the product column.  There are only four unique
products in the data cell.  So it lends itself to easier partitioning.  We simply need to add
the partition buy method  in the write process  to trigger partitioning while storing
data. We then save this to the partitioned parquet data tree.  Let's run this code and
examine the HDFS files created.  Let's go and look at the HDFS files.  When we navigate
to the partition parquet directory,  we see four subdirectories created.  They are one per
partition.  The name of the directory shows the partition key  and the value.  This
directory name can be then used  to fill the data,  and focus on directories  that contain
the relevant data only.  In the next video,  I will show you how to use bucketing with
Hive.  

Parallel writes with bucketing

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As reviewed in the early videos,   bucketing can be used to partition
data when there are a large number of unique values  for a given column.  In this
example, we will create buckets  again based on the product column.  We will create
three buckets.  In order to do bucketing, we use the bucket by method.  We specify the
number of buckets and the column to bucket by.  We also want to save this data as a
hive table.  In the sandbox, park is already integrated with hive  as its default warehouse
tool.  Adding a save as table with the table name,  saves the data in hive.  We also print
the direct HDFS where the data would be stored  so we can go and examine it.  We run
an example query from this table  to verify its contents.  Let's execute this code
now.  We see the contents printed correctly.  We can code the HDFS directory to
examine the contents.  The HDFS directory is apps, spark, warehouse.  We see the
product bucket table created here. Navigating to this table, we see three parts being
created.  They correspond to the three different buckets.  In the next video, let us
review  some of the best practices for data addition.  

Best practices for ingestion

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's review some of the best practices   for data ingestion with Hadoop
and Spark.  Enable parallelism for maximum write performance.  This can be achieved by
using splitable file formats  like Parquet, and using partitions or buckets.  When doing
incremental data ingestion, use APPEND.  This will help optimally distribute the write
loads  across partitions and buckets.  While reading external data into Spark,  prefer
sources that can enable parallelism.  This includes JDBC and Kafka.  Break down large
files into smaller files  if reading from disk.  Request the data originators  to create such
parallelizable data sources.  In the next chapter, I will show you how to read data  that is
stored in an optimal fashion in (mumbles).  

How Spark works

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] In order to optimize an Apache Spark Pipeline,  it is important to understand
how Spark works internally.  When design decisions are made,  they need to be analyzed
on how they impact scalability,  and performance.  In this video,  I will review how Spark
executes a Pipeline  and optimizes it.  I recommend further reading on this topic  to
master the internals.  Spark programs run on a Driver Node.  Which works with a Spark
cluster to execute them.  A Spark cluster can consist of  multiple Executor Nodes
capable of executing  the program in parallel.  The level of parallelism and performance
achieved is dependent upon how the Pipeline is designed.  Let's review an example
Pipeline  and how it gets executed.  First, the source reader is read  from an external
data source into a structure, Data 1.  Data 1 is then converted to a data frame  or its
internal representation, Resilient Distributed Datasets are RDDs.  During this conversion,
it is partitioned,  and individual partitions are assigned and moved  to the Executor
Nodes available.  When a transform operation like Map or Filter is executed,  these
operations are pushed down to the Executors.  The Executors execute the code locally
on their partitions  and create new partitions with the result.  There is no moment of
data between the Executors.  Hence, transforms can be executed in parallel.  Next, when
an action like Reduce or Group By is performed,  the partitions need to be shuffled and
aggregated.  This results in a moment of data between Executors  and can create I/O
and memory bottlenecks.  Finally, the data is collected back to the Driver Node.  The
partitions are merged and sent back to the Driver.  From here,  they can be stored into
external destination databases.  Spark has an optimizer  that analyzes the steps needed
to process data  and optimizes for performance and resources.  Spark only executes
code  when an action like Reduce or Collect is performed.  At this point, the optimizer
kicks in  and analyzes all the previous steps required  to achieve this action.  It then
comes up with a physical execution plan. The optimizer looks for reducing
I/O,  shuffling, and memory usage.  If the data sources can support parallel I/O,  then
Spark accesses them directly from the Executor  and parallelizes these operations.  This
provides improved performance  and reduces memory requirements on the driver.  In
the later videos,  I will show you how to influence the physical plans for better
performance. 

Reading HDFS files with schema

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, we will read  the parquet files we created earlier into
Spark. We will examine the execution plans  to understand how Spark works to read
these files. We will use the exercise files code_04_XX  Data Extraction into Spark for this
chapter. Let's open this.  We can read the nonpartitioned raw parquet file into
Spark  using the read.parquet function.  We print the first file records in the data
frame.  We also use the spark.time function  to measure the elapsed time for the total
operation.  Spark.time can be used to compare the performance  of different
approaches while designing data pipelines.  Finally, we execute the explain function  to
print out the physical plan.  Let's run this code and examine the results.  The operation
took 228 milliseconds.  For a small data set like the one we have here,  most of the time
is overhead  and may not make sense for comparison.  But for operations that run for
many minutes or a few hours,  this can provide a true measure.  Let's examine the
physical plan to understand what it shows.  It does a file scan for a parquet file.  It shows
the columns that are read from the file.  It shows the location of the file, and then it
shows  the schema that is used to read the file. We will examine the rest of the
contents  in the future examples as we exercise them.  

Reading partitioned data


Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video,  we will read a partitioned data set into Spark  and
understand how it works.  We will leave the Parquet files  under the directory,
partitioned_parquet. The product name, which is the partition value,  will not be stored
inside the files,  as it is already available in the data to remain.  The base part needs to
be provided  for the data to read the product name also as a column.  We again time
the operation  and display the first five rows.  We will also print the execution plan.  Let's
run this code and review the results.  The most important addition to the physical
plan  is the partition count.  This shows the number of partitions read into
memory.  More partition means more I/O and memory requirements.  Reducing this
count will lead to better performance.  We will see techniques for this later in the
course.  Next, we only read one partition from the stored data.  If we need to unlace
only a subset of data,  it is recommended to only read that subset and  minimize I/O
and memory.  Let's run this code now.  In the physical plan, you will notice  that the
partition code is not printed,  as only one set directory has been read. In the next video,
I will show you  how to read bucketed data from Hive.  

Reading bucketed data

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video,  I will show you how Spark reads bucketed data  stored in
Hive.  We can read data in Hive using a SQL command.  We do a simple SELECT
statement  to read the entire table and print its contents.  We also print its execution
plan. Let's run this code and examine the results.  When we look at the execution
plan,  We can see that it is no different  than reading a file from HDFS.  Data frames,
data sets, SQL, and RDDs  provide different interfaces  to the same underlying
operations.  So the execution plans will be similar  irrespective of which API we use.  The
plan soon shows what HDFS file is read  and will also provide partition information if it is
used.  We will now review some of the best practices  for reading data into Spark  in the
next video.  

Best practices for data extraction

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] What are some of the key best practices   for data extraction from HDFS
into Spark for analytics?  The first is to read only the required data into memory.  This
means read subdirectories, subset of partitions,  and subset of columns.  Less data
means less resource requirements  and less time to execute.  Use data sources and file
formats that support parallelism.  Avro and Parquet are some of the recommended
ones. The number of partitions in the data files are important.  Each partition can be
independently read  by a separate executor code in parallel.  The number of parallel
operations  in a Spark cluster is the number of executor nodes  multiplied by the
number of CPU cores in each executor.  If the number of partitions are greater than this
value,  it will trigger maximum parallelism.  Please keep in mind that other jobs  running
at the same time will also compete  for these resources.  In the next chapter,  I will focus
on optimizing processing data read from HDFS.  

Pushing down projections

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter,  we will review some of the techniques  that can be used
during data processing  to optimize Spark and HDFS performance.  The code for this
chapter is available  in the notebook, code_05_XX Optimizing Data Processing.  We will
start with pushing down projections.  Projection here means the set or subset of
columns  that are selected from a data set.  Typically, read and enter your file  with all
the columns into memory  and then use only a subset of columns  later for
computations. During lazy evaluation, Spark is smart enough  to identify the subset of
columns  that will actually be used  and only fetch them into memory.  This is called
projection push down. In this example, we read the entire Parquet file  into the sales
data data frame.  Later, we only select the product and quantity columns.  Spark
identifies this and only fetches these columns into memory.  Let's run this code and
review the execution plan.  In the execution plan, the FileScan  reads only two columns,
quantity and product,  and this provides optimization.  A pipeline developer needs to
help Spark  to do these optimizations  by not using columns unnecessarily.  For
example, using a show function with all the columns, even for troubleshooting,  would
fetch all the columns  and prevent projection push downs.  In the next video, let's look
at pushing down filters. 

Pushing down filters

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Similar to projection push downs,  Spark is capable of identifying a subset
of rows  that are actually required for processing  and fix them into memory.  If the
subset of rows correspond to specific partitions,  Spark will only leave those
partitions.  In this example, we try two filters.  One is the filter on Product.  Product is a
partition column. The other is a filter  on a non-partition column called Customer.  Spark
will push down both these filters to the file scan,  so it will not read unnecessary rows
into memory.  Let's execute the code and review the results.  In the case of a filter on a
partition column,  we see partition filters being used.  The partition count is also
one.  This means that Spark has identified  that the filter is on a partition  and only
attempts to read files for that partition.  In the case of filter without a partition
column,  we see pushed filters on the file scan.  But the partition count used is still
four.  Partition-based filter push down is most efficient  since Spark will only attempt  to
read files within that partition.  In the non-partition case,  it has to read all the files of
partitions,  but only load those records that match the filter condition.  Understanding
this mechanism helps design partitions  to maximize push downs.  This saves on audio
and memory.  The earlier the filtering happens,  less data needs to be processed in the
later stages  in the pipeline.  In the next video,  I will discuss partitioning and
coalescing.  

Managing partitions

Selecting transcript lines in this section will navigate to timestamp in the video
- [Explainer] One of the key aspects to understand  about Spark internals is
partitioning. This is different from HDFS partitioning.  When Spark creates a partition
file,  it creates internal partitions,  equal to the default parallelism,  setup for
Spark.  Transforms maintain the same number of partitions.  But actions will create a
different number,  usually equal to the default parallelism setup  for this Spark
instance.  Typically, in a Local node, parallelism is two.  And in a cluster mode, it's
200.  Having too many or too little partitions will impact performance.  As discussed in
the earlier video,  the ideal number of partitions should be equal to,  the total number of
cores available for Spark.  We can change the number of partitions  by repartitioning
and coalescing.  Let's run the exercise code first and then review the results.  We first
print the default parallelism,  setup for this cluster. It's two,  this number can be changed
in Spark configuration.  Then we look at the number of partitions  on a partition data
source,  which is SalesData.  The number of partitions is again two.  Next we read a non
partition data source,  which is the rawOrders file  and look at the partition count.  This
number is only one.  This is the problem with having data sources  that do not support
parallelism.  An RDD can be repartitioned  to a new number of partitions,  using the
repartition function.  We can specify the number of partitions,  we repartition
rawSalesData into eight partitions  and confirm that count. Next, we can also coalesce a
data frame  to reduce the number of partitions.  We coalesce the partitionedSalesData
from eight partitions  to three partitions,  and we also print and confirm that
size.  Repartitioning does partitioning from scratch.  Coalescing simply combines
existing partitions  to create a larger partition.  Note that both these activities
themselves,  take significant amount of time.  Typically actions upset the ideal partition
count,  and hence repartition need to be done after that.  But it's recommended to do
that only  if there are multiple transforms  that follow the action  and can take
advantage of these resized partitions.  In the next video,  I will show you how to
optimize shuffling 

Managing shuffling

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As seen in the earlier videos.  Spark actions like reduce and group by cause
shuffling of data between executer nodes.  This creates IO and delays in overall
processing.  Spark optimizer does a lot of work in the background  to minimize
shuffling. However, as a designer  it's still important to understand the shuffling
impact  of the design and focus on minimizing it.  We can either eliminate shuffling
steps  or minimize the amount of data being shuffled.  In this exercise,  we will
implement two different workout logics  and compare the shuffling between them.  The
first method uses a group by key  and then a map to compute the word count.  The
second method users are reduced by key directly.  Let's run this code and review the
results.  The group by key took 540 milliseconds.  The reduced took 361
milliseconds.  Comparison of time taken may not make sense  for small datasets,  but
doing the same on large data sets  will provide the real difference.  Let's check the spark
UI for the execution plan  for this alternatives.  First, let's click on job zero for the
execution plan  for group by.  Look at the shuffle read and shuffled write numbers
here  they indicate the amount of data being shuffled.  This is 182 bytes for group
by.  Next let's look at job one.  This is the execution plan for the reduced by key.  We see
that the shuffle read and write numbers are 154. The data shuffled here for reduced by
key  is less than that of the group by key.  Let's also look at the actual data being
shuffled.  To look at the actual data being shuffled.  We print the outcomes of these
activities.  We see that in the case of a group by key, individual values are being
shuffled,  whereas in the case of a reduced by key  only the sum was being shuffled
leading to less data.  This is a very small data set.  In the case of large data sets,  the
difference can be significant.  It is important to look at execution plans  and the data
sites as being shuffled  and optimize them for building faster processing pipelines  for
analytics.  In the next video,  I will show you how to optimize joins.  
Improving joins

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Joins in Spark help combining   two data sets to provide better insights.  As
important as they are in analytics,  they also cause significant delays.  Join between two
data frames require that the partitions  of these data frames to be shuffled,  so rows that
match the join condition  are in the same executor nodes.  Spark optimizer again does a
lot of behind-the-scenes work  to optimize joins.  One of them is called a Broadcast
join. If the size of one of the joined data frames  is less than the
spark.sql.autoBroadcastJoinThreshold,  then Spark broadcasts the entire data frame  to
all executor nodes where the other data frame resides.  Then, the join itself becomes a
local activity  within the executor since the entire copy  of the smaller data frame is
available in that node.  In this example, we read the product_vendor.csv  into the
products data frame.  This data frame itself is very small.  We then join it with sales
data to produce a combined data frame.  In Spark, we provide a join end to recommend
the use  of Broadcast joins using the Broadcast function.  Let's execute the code and
look at the execution plan  that Spark has generated.  We see that there are two parallel
FileScan operations  to read both the files.  Then we see a BroadcastExchange
operation  for the smaller products data frame.  This means that this data frame is
broadcast  to all executor nodes.  Then we see the Broadcast HashedJoin function  for
the actual join.  It is recommended to use denormalized data sources  and avoid joins if
possible.  Else, review the execution plans to make sure  that the joins are optimal.  

Storing intermediate results

Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] As we have seen in the previous examples   for execution plans.  Every time
an action is performed,  Spark goes all the way towards data source  and reads the
data. This happens even if the data was read before  and some actions were
performed.  While this works fine while running automated jobs,  it is a problem during
interactive analytics. Every time a new action command is executed  on an interactive
shell, Spark goes back to its source,  it is better to cache intermediate results,  so, we can
receive analytics from these results  without starting all over.  Spark has two modes of
caching in memory and disk.  The cache method is used to cache in memory only.  The
persist method is used to cache in memory,  disk or both.  In this example, we first cache
the words RDD  into memory using the cache function.  Spark does less evaluation,  so
we need to execute an action to trigger the caching.  Next, we will compare execution
plans before  and after intermediate caching.  First, we do a filter for product equals
mouse  on the coalased sales data, data frame.  Then we use the persist function to
store  the intermediate coalased sales data, data frame to disk.  We then run the same
filter for product equals most  on this data frame, and then review the execution
plan.  Let's run this code and review the results.  First, let's open the Spark UI.  Go to the
storage tab.  This shows all the cached and persisted data frames.  We see the first
words RDD showing as stored in memory.  We also see the coalased RDD being stored
in disk.  We coalased into three partitions before  and we see the same partition count
here.  Next, let's review the execution plans.  In the plan before caching,  we see that
there is a filter push down.  The filter push down goes all the way to the file scan.  This
means that Spark goes all the way back to the file  and re-executes all the code.  In the
second plan after caching,  we see that there is an in memory scan.  And the filter is on
this in memory scan.  This means that Spark is using  the temporary persisted data
frame for doing this filter.  Caching helps in performance and re-use,  especially while
doing interactive queries.  

Best practices for data processing

Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will review  the best practices for data processing  with Spark
and HDFS.  Use push downs for filters and projections  to data sources as much as
possible.  The smaller the data being transferred,  the better is the performance.  Choose
and design partition keys  based on the columns most used in filters and
aggregations. This speeds up both reading and processing data.  Use repartitioning and
coalescing wisely.  These activities themself take significant time,  so only use them if
there are a series of transforms  that can take advantage of them.  Avoid joins as much
as possible. Use denormalized data sources.  If required, use them judiciously  and check
execution plans.  Clock all operations with spark.time()  on production equivalent data to
understand  slow-running operations and take actions.  Use caching when
appropriate. Caching takes memory and disk space,  hence choose them for
intermediate results  that are frequently reused.  Use Explain plan to understand the
physical plan  and look for ways in which the plan can be made better.  In the next
chapter, we will do a use-case project  to exercise the learnings in this course.  

You might also like