Professional Documents
Culture Documents
Big Data Analytics With Hadoop and Apache Spark
Big Data Analytics With Hadoop and Apache Spark
Selecting transcript lines in this section will navigate to timestamp in the video
- [Kumaran Ponnambalam] Data engineers often use stacks to leverage the power of
multiple technologies. For example, there is often a need for not just scalable storage
but also fast processing. Many teams find themselves using the combination of
Hadoop for storage and Spark for compute, because it provides unparalleled
scalability and performance for analytics pipelines. In order to harness this power, it is
important to understand how Hadoop and Spark work with each other and utilize the
levers available. My name is Kumaran Ponnambalam, in this course, I will show you how
to build scalable and high performance analytics pipelines with Apache Hadoop and
Spark. I will only discuss key tools and best practices for taking advantage of this
combination. We will use a Hortonworks Sandbox for this course. You need prior
familiarity, with both Apache Hadoop and Spark. In this course we will only focus on
using Hadoop and Spark together. We will also use Zeppelin notebooks for our
examples. Please refer to other essential courses and resources, if you want to learn the
essentials of these technologies. That being said, let's explore how to maximize the
combined power of Hadoop and Spark.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will review the key features and the current state of
technology for Apache Hadoop. Hadoop is an open-source technology that started the
big data wave. It provides distributed data storage and computing using low-cost
hardware. It can scale to petabytes of data and can run on clusters with hundreds of
nodes. Hadoop mainly consists of two components, the Hadoop Distributed File
System, or HDFS, that provides data storage. MapReduce is a programming model and
implementation that provides distributed computing capabilities with data stored in
HDFS. Where does Hadoop stand today? Let's look at HDFS and MapReduce
separately. HDFS is still a very good option for cheap storage of large quantities of
data. It provides scaling, security, and cost benefits that help in its adoption. It is most
suitable for enterprises with in-house data centers who want to host the data within
their network. Cloud alternatives like AWS S3, Google Cloud Storage, and Azure Blob
are becoming increasingly popular too. MapReduce, on the other hand, is becoming
old. While it scales horizontally over hundreds of compute nodes, it is very slow, as it
primarily uses disk storage for intermediate caching instead of memory. Newer
technologies, like Apache Spark and Apache Flink, have emerged that can execute the
same processing at much faster rates. The newer technologies also support other
capabilities and a growing library of connectors, which makes them a better choice
than MapReduce.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will review some of the salient features of Apache
Spark. Apache Spark is an open source technology that started out as a more efficient
alternative to Hadoop MapReduce. It is a large-scale distributed data processing
engine. Spark stores its data primarily in memory to speed up computations. It has also
grown to add a number of capabilities like batch processing, real-time
streaming, machine learning and graphs. Spark can be programmed with Scala, Java,
Python and R. It's performance features and programing support makes it the most
popular big-data technology today.
Selecting transcript lines in this section will navigate to timestamp in the video
- In this video, I will review the benefits of using Hadoop and Spark together for big
data analytics. Why is the combination of Hadoop and Spark so powerful? HDFS
provides large-scale distributed data storage. Spark provides large-scale fast processing
of the same data. Together, they make an excellent combination for building data
pipelines. Spark is well integrated with Hadoop natively and makes optimal use of that
integration. For example, Spark can access and update HDFS data using multiple
parallel nodes. There are a number of data read optimizations that use less memory
and I/O. Spark can use HDFS for intermediate data caching. Also, YARN provides a
single cluster management mechanism for both HDFS and Spark. So, my
recommendation, especially for enterprise deployments, is to utilize the processing
power of Spark with the scalable storage of HDFS to build high performance
processing jobs. In this course, I will demonstrate the strengths of this integration and
provide samples and best practices for building big data pipelines with Spark and
Hadoop.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] This course requires a number of big data components to be installed and
setup, including Apache Hadoop and Apache Spark. We also will use Apache Zeppelin
as the notebook for building our exercises. For ease of installation, we will use the
Hortonworks Sandbox that already pre-bundles all these components. Since, you are
familiar with this technologies, you may also try out these examples in your own
setups. We will use the Docker version of the Sandbox on Mac, please install Docker if
you do not already have it. The Sandbox itself can be downloaded from this
website, cloudera.com/downloads/hortonworks-sandbox.html. You should download
the Hortonworks HDP Sandbox. It provides downloads and instructions for
Mac, Windows and Linux. There is also documentation to set up and
troubleshoot. Please use the same If you run into issues with other versions of the
Sandbox. Windows installation can get tricky and may require hacks to get it running
based on your individual setup. Before we begin, please make sure that the Dockers
setup has sufficient memory and CPUs allocated to it. Please provide at least two CPU
cores and eight GB of memory to Docker. If you notice sluggishness in the US, our
services frequently stopping then please increase your RAM allocation. I have already
downloaded the installation package and unzipped in this specific folder. Before you
move forward, please update your ETC host file to add Sandbox HDP to 127.0.0.1. Now,
we can proceed to install the Sandbox. We can do so by executing the command sh
docker-deploy-hdp30.sh, it will install to Docker images. Sandbox HDP is the main
image that runs all the software. Sandbox proxy is a reverse proxy that is required for
the main image to work. The install will also start both the containers. Please note that
they are already in my local Docker repository. So it's keeping the Docker pull
step. The images are huge, so the pool may run into many minutes. The containers
usually take some time to start up, say 30 minutes to start all the services. Please check
the status of the services by visiting the page localhost 80 80. You can log in using the
account raj_ ops with the password raj_ops again. This Ambari dashboard shows the
status of all the services installed, and most of them will be in the starting state. Go to
hosts and click on the host link here. You will notice that the services are in starting
state for most of the time, so give it about 30 minutes for all the services to come
alive.
Using exercise files
Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] In this video, I will show you the steps to set up the exercise files. Before
we go there, let's make sure that the Sandbox is fully up and running. We can do so by
checking the host link in the Ambari UI. Please make sure that the green check
box shows up for the host. Next, I have downloaded the exercise files for this course in
this directory. And has three CSV files, which are data files, and four JSON files, which
are Zeppelin notebook files. Let's now load them into the setup. Go to the
Hortonworks landing page on port 1080. This shows all the application UIs
available. Go to the Shell client available here. Log in as Raj_OPS, with the password as
again Raj_OPS. We want to provide full access to the Raj Ops directory in HDFS, to all
the users. We can do so with the following command, HDFS DFS minus CHMOD 777
/user/raj_ops. This command has been executed successfully. Now go to thee Ambari
UI on port 8080. On the top right corner, open the files view. Click on files view to open
the viewer for HDSF files. This shows all the HDFS data trees available inside this
instance. Navigate to the user Raj Ops data tree. You can create a new folder here
called raw_data. Navigate to the raw data folder. Now upload all the three CSV files we
have in the exercise files to this folder. We first load product vendor.CSV, then we load
sales orders.CSV, and finally we upload student scores.CSV. Verify if the uploaded files
show up in the folder correctly. Next, we go to the Zeppelin notebook. The Zeppelin
notebook grants on boat 9995 on the Sandbox HDP website. We first want to import
all our exercise notes, so click on the import note link. Click select a JSON file. Now
upload each of the JSON files one by one. The JSON files will start showing under this
Sparks course directory. You can verify if all the notebooks have been uploaded by
looking under the Sparks course directory. Next, open the notebook called 03_XX_data
ingestion with Spark and HDFS. Please make sure that the notebook loads up
correctly and shows up as shown here. Now go to the second paragraph here, which is
just a command to print the current version of Spark. Click on the run paragraph
button here. Now this should immediately run and print the version of Spark. Typically,
when you run Spark for the first time, it may take some time, even a couple of
minutes, for the first command to run and successfully come back. This is perfectly
okay, you just have to be patient of what the command execute. This confirms that our
setup is up and running, and now we can start using it for our course.
Storage formats
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, I will review various options available, and best practices to
store data in HDFS. I will start off with storage formats in this video. HDFS supports a
variety of storage formats, each with its own advantages and use cases. The list
includes raw text files, structured text files like CSV, XML, and JSON, native sequence
files, Avro formatted files, ORC files, and Parquet files. I will review the most popular
ones for analytics now. Text files carry the same format they have in a normal file
system. They are stored as a single physical file in HDFS. They are of low
performance, as they do not support parallel operations. They require more storage,
and do not have any schema. In general, they are not recommended. Avro files support
language-neutral data serialization. So data written through one language, or two, can
be read with another with no problems. Data is stored row by row, like CSV files. They
support a self-describing schema, and is used to enforce constraints on data. They are
compressible, and hence can optimize on storage. They are splittable into
partitions, and hence can help in parallel reads and writes. They are ideal for
situations that require multi-language support. Parquet files store data column by
column, similar to columnar databases. This means each column can be read separately
from disk without reading other columns. This saves on I/O. They support
schema. Parquet files are both compressible and splittable, and hence are performance
and storage optimized. They also can support nested data structures. Parquet files are
ideal for batch analytics jobs for these reasons. Analytics applications typically have
data stored as records and columns, similar to RDBMS tables. Parquet provides overall
better performance and flexibility for these applications. I will show later in the
course how Parquet enables parallelization and I/O optimization.
Storage formats
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, I will review various options available, and best practices to
store data in HDFS. I will start off with storage formats in this video. HDFS supports a
variety of storage formats, each with its own advantages and use cases. The list
includes raw text files, structured text files like CSV, XML, and JSON, native sequence
files, Avro formatted files, ORC files, and Parquet files. I will review the most popular
ones for analytics now. Text files carry the same format they have in a normal file
system. They are stored as a single physical file in HDFS. They are of low
performance, as they do not support parallel operations. They require more storage,
and do not have any schema. In general, they are not recommended. Avro files support
language-neutral data serialization. So data written through one language, or two, can
be read with another with no problems. Data is stored row by row, like CSV files. They
support a self-describing schema, and is used to enforce constraints on data. They are
compressible, and hence can optimize on storage. They are splittable into
partitions, and hence can help in parallel reads and writes. They are ideal for
situations that require multi-language support. Parquet files store data column by
column, similar to columnar databases. This means each column can be read separately
from disk without reading other columns. This saves on I/O. They support
schema. Parquet files are both compressible and splittable, and hence are performance
and storage optimized. They also can support nested data structures. Parquet files are
ideal for batch analytics jobs for these reasons. Analytics applications typically have
data stored as records and columns, similar to RDBMS tables. Parquet provides overall
better performance and flexibility for these applications. I will show later in the
course how Parquet enables parallelization and I/O optimization.
Partitioning
Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] Partitioning is a key concept to use while working with HDFS data. In this
video I will review the importance of partitioning and how it works. Why do we need
partitioning? Relational databases speed up data access by using indexes on columns
used in filter conditions. HDFS does not have the concept of an index. Even if a single
row is required from a large betabyte file, the entire file should be read to extract the
record. This introduces significant desk IO (incoherent) Partitioning provides a way to
read only a subset of data based on a partition key. Similar to indexes, partitions can
also be based on multiple attributes. Typical attributes suitable for partitioning include
dates, and element identifiers like customer or product names. How does partitioning
work? When we create a HDFS file specifying a partition key Hardu creates a separate
data tree for partition. Records corresponding to a specific partition key is stored in the
same data tree. For example, if we use product as a partition key, a seperate data tree
will be created for each product and corresponding records will be stored there. If we
use a filter on the product attribute while querying only those subdirectories that match
the filter need to be read. While selecting attributes for partitioning choose attributes
that have a limited or controlled set of values otherwise too many subdirectories might
be created. Also ensure that the records are equally distributed among the various
values. Choose attributes that are most used in query patterns, likely candidates
include dates, customer IDs, products IDs among others. In the next video, I will discuss
an alternative to partitioning called bucketing.
Bucketing
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As seen in the previous video partitioning is only optimal when a given
attribute has a small set of unique values. What if we need to partition for a key with a
large number of values without prolifercating the number of that increase? Bucketing
is the answer. Bucketing works similar to partitioning, but instead of using the value of
the attribute it uses a hash function to convert the value into a specific hash
key. Values that have the same hash key end up in the same bucket or sub data
tree. The number of unique buckets can be controlled and limited. This also ensures
even distribution of values across all buckets. It's ideal for attributes that have a large
number of uniques values like order number or transaction I.D. Choose buckets for
attributes that have a large number of unique values and those that are most
frequently used in query filters. Experiment with multiple buckets columns to find
optimal read/write performance for the specific use case. In the next video I will
review some best practices for data storage.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will walk through some of the best practices for designing
HDFS schema and storage. First, during the design stage, understand the most used,
read and write patterns for your data. Identify if it's read intensive or write intensive or
both. For reads analyze what filters are usually applied on data, Determine What needs
optimization and what can be compromised. Is it important to reduce storage
requirements or is it okay to compromise on storage for better read-write
performance? Choose your options carefully as these cannot be easily changed after
the pipeline is deployed and data is created. Changing things like storage formats and
compression cortex would require reprocessing all the data. Run tests on actual
data to understand performance and storage characteristics. Experiment if required to
compare between different storage options available. Choose partitioning and
bucketing keys wisely as they incur significantly additional costs during writes. What
helping in reads. In the next chapter, let's start reading and writing history of us
files with spark using these practices.
Writing to HDFS
Selecting transcript lines in this section will navigate to timestamp in the video
- As discussed in the previous videos, CSV files cannot be used for parallel reads and
writes. We need to convert them to other formats like Parquet, for efficient processing
of data in the later stages. In this video, we will write the raw sales data data frame into
a Parquet file in HDFS. The code for this is simple, We will use the right function
available in the data frame. We didn't set the format to Parquet, the mode is set to
overwrite, to overwrite any existing contents. In real pipelines though, append maybe
the better option if there are periodic additions to the data. We then use GZIP to
compress the data. We save it to the raw Parquet directory under user/Raj_ops Let's
execute this code and review the results. First, notice the Spark job feature
appearing at the top of the paragraph. You can click on this to open the Spark UI and
look at how Spark executed this job. The spark UI may launch with a fully qualified
URL. And this may generate an error. You can overcome it by adding that URL to the
EPZ host file. Here is how my EPZ host file is setup. We can also go to the HDFS file
viewer to review the data that is created. We can see a directory created called raw
Parquet. If you go under that you will see part files created under this specific directory
and the extension shows that they are GC files of Parquet format. Depending on the
size of data, there could be more files that get created. In the next video, I will show
you how to partition data while writing to HDFS.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As reviewed in the earlier videos, partitioning of data enables parallel
reads and writes. It also helps in filtering out data while reading into memory. We will
create a partition HDFStore based on the product column. There are only four unique
products in the data cell. So it lends itself to easier partitioning. We simply need to add
the partition buy method in the write process to trigger partitioning while storing
data. We then save this to the partitioned parquet data tree. Let's run this code and
examine the HDFS files created. Let's go and look at the HDFS files. When we navigate
to the partition parquet directory, we see four subdirectories created. They are one per
partition. The name of the directory shows the partition key and the value. This
directory name can be then used to fill the data, and focus on directories that contain
the relevant data only. In the next video, I will show you how to use bucketing with
Hive.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As reviewed in the early videos, bucketing can be used to partition
data when there are a large number of unique values for a given column. In this
example, we will create buckets again based on the product column. We will create
three buckets. In order to do bucketing, we use the bucket by method. We specify the
number of buckets and the column to bucket by. We also want to save this data as a
hive table. In the sandbox, park is already integrated with hive as its default warehouse
tool. Adding a save as table with the table name, saves the data in hive. We also print
the direct HDFS where the data would be stored so we can go and examine it. We run
an example query from this table to verify its contents. Let's execute this code
now. We see the contents printed correctly. We can code the HDFS directory to
examine the contents. The HDFS directory is apps, spark, warehouse. We see the
product bucket table created here. Navigating to this table, we see three parts being
created. They correspond to the three different buckets. In the next video, let us
review some of the best practices for data addition.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Let's review some of the best practices for data ingestion with Hadoop
and Spark. Enable parallelism for maximum write performance. This can be achieved by
using splitable file formats like Parquet, and using partitions or buckets. When doing
incremental data ingestion, use APPEND. This will help optimally distribute the write
loads across partitions and buckets. While reading external data into Spark, prefer
sources that can enable parallelism. This includes JDBC and Kafka. Break down large
files into smaller files if reading from disk. Request the data originators to create such
parallelizable data sources. In the next chapter, I will show you how to read data that is
stored in an optimal fashion in (mumbles).
Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] In order to optimize an Apache Spark Pipeline, it is important to understand
how Spark works internally. When design decisions are made, they need to be analyzed
on how they impact scalability, and performance. In this video, I will review how Spark
executes a Pipeline and optimizes it. I recommend further reading on this topic to
master the internals. Spark programs run on a Driver Node. Which works with a Spark
cluster to execute them. A Spark cluster can consist of multiple Executor Nodes
capable of executing the program in parallel. The level of parallelism and performance
achieved is dependent upon how the Pipeline is designed. Let's review an example
Pipeline and how it gets executed. First, the source reader is read from an external
data source into a structure, Data 1. Data 1 is then converted to a data frame or its
internal representation, Resilient Distributed Datasets are RDDs. During this conversion,
it is partitioned, and individual partitions are assigned and moved to the Executor
Nodes available. When a transform operation like Map or Filter is executed, these
operations are pushed down to the Executors. The Executors execute the code locally
on their partitions and create new partitions with the result. There is no moment of
data between the Executors. Hence, transforms can be executed in parallel. Next, when
an action like Reduce or Group By is performed, the partitions need to be shuffled and
aggregated. This results in a moment of data between Executors and can create I/O
and memory bottlenecks. Finally, the data is collected back to the Driver Node. The
partitions are merged and sent back to the Driver. From here, they can be stored into
external destination databases. Spark has an optimizer that analyzes the steps needed
to process data and optimizes for performance and resources. Spark only executes
code when an action like Reduce or Collect is performed. At this point, the optimizer
kicks in and analyzes all the previous steps required to achieve this action. It then
comes up with a physical execution plan. The optimizer looks for reducing
I/O, shuffling, and memory usage. If the data sources can support parallel I/O, then
Spark accesses them directly from the Executor and parallelizes these operations. This
provides improved performance and reduces memory requirements on the driver. In
the later videos, I will show you how to influence the physical plans for better
performance.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, we will read the parquet files we created earlier into
Spark. We will examine the execution plans to understand how Spark works to read
these files. We will use the exercise files code_04_XX Data Extraction into Spark for this
chapter. Let's open this. We can read the nonpartitioned raw parquet file into
Spark using the read.parquet function. We print the first file records in the data
frame. We also use the spark.time function to measure the elapsed time for the total
operation. Spark.time can be used to compare the performance of different
approaches while designing data pipelines. Finally, we execute the explain function to
print out the physical plan. Let's run this code and examine the results. The operation
took 228 milliseconds. For a small data set like the one we have here, most of the time
is overhead and may not make sense for comparison. But for operations that run for
many minutes or a few hours, this can provide a true measure. Let's examine the
physical plan to understand what it shows. It does a file scan for a parquet file. It shows
the columns that are read from the file. It shows the location of the file, and then it
shows the schema that is used to read the file. We will examine the rest of the
contents in the future examples as we exercise them.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will show you how Spark reads bucketed data stored in
Hive. We can read data in Hive using a SQL command. We do a simple SELECT
statement to read the entire table and print its contents. We also print its execution
plan. Let's run this code and examine the results. When we look at the execution
plan, We can see that it is no different than reading a file from HDFS. Data frames,
data sets, SQL, and RDDs provide different interfaces to the same underlying
operations. So the execution plans will be similar irrespective of which API we use. The
plan soon shows what HDFS file is read and will also provide partition information if it is
used. We will now review some of the best practices for reading data into Spark in the
next video.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] What are some of the key best practices for data extraction from HDFS
into Spark for analytics? The first is to read only the required data into memory. This
means read subdirectories, subset of partitions, and subset of columns. Less data
means less resource requirements and less time to execute. Use data sources and file
formats that support parallelism. Avro and Parquet are some of the recommended
ones. The number of partitions in the data files are important. Each partition can be
independently read by a separate executor code in parallel. The number of parallel
operations in a Spark cluster is the number of executor nodes multiplied by the
number of CPU cores in each executor. If the number of partitions are greater than this
value, it will trigger maximum parallelism. Please keep in mind that other jobs running
at the same time will also compete for these resources. In the next chapter, I will focus
on optimizing processing data read from HDFS.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this chapter, we will review some of the techniques that can be used
during data processing to optimize Spark and HDFS performance. The code for this
chapter is available in the notebook, code_05_XX Optimizing Data Processing. We will
start with pushing down projections. Projection here means the set or subset of
columns that are selected from a data set. Typically, read and enter your file with all
the columns into memory and then use only a subset of columns later for
computations. During lazy evaluation, Spark is smart enough to identify the subset of
columns that will actually be used and only fetch them into memory. This is called
projection push down. In this example, we read the entire Parquet file into the sales
data data frame. Later, we only select the product and quantity columns. Spark
identifies this and only fetches these columns into memory. Let's run this code and
review the execution plan. In the execution plan, the FileScan reads only two columns,
quantity and product, and this provides optimization. A pipeline developer needs to
help Spark to do these optimizations by not using columns unnecessarily. For
example, using a show function with all the columns, even for troubleshooting, would
fetch all the columns and prevent projection push downs. In the next video, let's look
at pushing down filters.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Similar to projection push downs, Spark is capable of identifying a subset
of rows that are actually required for processing and fix them into memory. If the
subset of rows correspond to specific partitions, Spark will only leave those
partitions. In this example, we try two filters. One is the filter on Product. Product is a
partition column. The other is a filter on a non-partition column called Customer. Spark
will push down both these filters to the file scan, so it will not read unnecessary rows
into memory. Let's execute the code and review the results. In the case of a filter on a
partition column, we see partition filters being used. The partition count is also
one. This means that Spark has identified that the filter is on a partition and only
attempts to read files for that partition. In the case of filter without a partition
column, we see pushed filters on the file scan. But the partition count used is still
four. Partition-based filter push down is most efficient since Spark will only attempt to
read files within that partition. In the non-partition case, it has to read all the files of
partitions, but only load those records that match the filter condition. Understanding
this mechanism helps design partitions to maximize push downs. This saves on audio
and memory. The earlier the filtering happens, less data needs to be processed in the
later stages in the pipeline. In the next video, I will discuss partitioning and
coalescing.
Managing partitions
Selecting transcript lines in this section will navigate to timestamp in the video
- [Explainer] One of the key aspects to understand about Spark internals is
partitioning. This is different from HDFS partitioning. When Spark creates a partition
file, it creates internal partitions, equal to the default parallelism, setup for
Spark. Transforms maintain the same number of partitions. But actions will create a
different number, usually equal to the default parallelism setup for this Spark
instance. Typically, in a Local node, parallelism is two. And in a cluster mode, it's
200. Having too many or too little partitions will impact performance. As discussed in
the earlier video, the ideal number of partitions should be equal to, the total number of
cores available for Spark. We can change the number of partitions by repartitioning
and coalescing. Let's run the exercise code first and then review the results. We first
print the default parallelism, setup for this cluster. It's two, this number can be changed
in Spark configuration. Then we look at the number of partitions on a partition data
source, which is SalesData. The number of partitions is again two. Next we read a non
partition data source, which is the rawOrders file and look at the partition count. This
number is only one. This is the problem with having data sources that do not support
parallelism. An RDD can be repartitioned to a new number of partitions, using the
repartition function. We can specify the number of partitions, we repartition
rawSalesData into eight partitions and confirm that count. Next, we can also coalesce a
data frame to reduce the number of partitions. We coalesce the partitionedSalesData
from eight partitions to three partitions, and we also print and confirm that
size. Repartitioning does partitioning from scratch. Coalescing simply combines
existing partitions to create a larger partition. Note that both these activities
themselves, take significant amount of time. Typically actions upset the ideal partition
count, and hence repartition need to be done after that. But it's recommended to do
that only if there are multiple transforms that follow the action and can take
advantage of these resized partitions. In the next video, I will show you how to
optimize shuffling
Managing shuffling
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] As seen in the earlier videos. Spark actions like reduce and group by cause
shuffling of data between executer nodes. This creates IO and delays in overall
processing. Spark optimizer does a lot of work in the background to minimize
shuffling. However, as a designer it's still important to understand the shuffling
impact of the design and focus on minimizing it. We can either eliminate shuffling
steps or minimize the amount of data being shuffled. In this exercise, we will
implement two different workout logics and compare the shuffling between them. The
first method uses a group by key and then a map to compute the word count. The
second method users are reduced by key directly. Let's run this code and review the
results. The group by key took 540 milliseconds. The reduced took 361
milliseconds. Comparison of time taken may not make sense for small datasets, but
doing the same on large data sets will provide the real difference. Let's check the spark
UI for the execution plan for this alternatives. First, let's click on job zero for the
execution plan for group by. Look at the shuffle read and shuffled write numbers
here they indicate the amount of data being shuffled. This is 182 bytes for group
by. Next let's look at job one. This is the execution plan for the reduced by key. We see
that the shuffle read and write numbers are 154. The data shuffled here for reduced by
key is less than that of the group by key. Let's also look at the actual data being
shuffled. To look at the actual data being shuffled. We print the outcomes of these
activities. We see that in the case of a group by key, individual values are being
shuffled, whereas in the case of a reduced by key only the sum was being shuffled
leading to less data. This is a very small data set. In the case of large data sets, the
difference can be significant. It is important to look at execution plans and the data
sites as being shuffled and optimize them for building faster processing pipelines for
analytics. In the next video, I will show you how to optimize joins.
Improving joins
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] Joins in Spark help combining two data sets to provide better insights. As
important as they are in analytics, they also cause significant delays. Join between two
data frames require that the partitions of these data frames to be shuffled, so rows that
match the join condition are in the same executor nodes. Spark optimizer again does a
lot of behind-the-scenes work to optimize joins. One of them is called a Broadcast
join. If the size of one of the joined data frames is less than the
spark.sql.autoBroadcastJoinThreshold, then Spark broadcasts the entire data frame to
all executor nodes where the other data frame resides. Then, the join itself becomes a
local activity within the executor since the entire copy of the smaller data frame is
available in that node. In this example, we read the product_vendor.csv into the
products data frame. This data frame itself is very small. We then join it with sales
data to produce a combined data frame. In Spark, we provide a join end to recommend
the use of Broadcast joins using the Broadcast function. Let's execute the code and
look at the execution plan that Spark has generated. We see that there are two parallel
FileScan operations to read both the files. Then we see a BroadcastExchange
operation for the smaller products data frame. This means that this data frame is
broadcast to all executor nodes. Then we see the Broadcast HashedJoin function for
the actual join. It is recommended to use denormalized data sources and avoid joins if
possible. Else, review the execution plans to make sure that the joins are optimal.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Narrator] As we have seen in the previous examples for execution plans. Every time
an action is performed, Spark goes all the way towards data source and reads the
data. This happens even if the data was read before and some actions were
performed. While this works fine while running automated jobs, it is a problem during
interactive analytics. Every time a new action command is executed on an interactive
shell, Spark goes back to its source, it is better to cache intermediate results, so, we can
receive analytics from these results without starting all over. Spark has two modes of
caching in memory and disk. The cache method is used to cache in memory only. The
persist method is used to cache in memory, disk or both. In this example, we first cache
the words RDD into memory using the cache function. Spark does less evaluation, so
we need to execute an action to trigger the caching. Next, we will compare execution
plans before and after intermediate caching. First, we do a filter for product equals
mouse on the coalased sales data, data frame. Then we use the persist function to
store the intermediate coalased sales data, data frame to disk. We then run the same
filter for product equals most on this data frame, and then review the execution
plan. Let's run this code and review the results. First, let's open the Spark UI. Go to the
storage tab. This shows all the cached and persisted data frames. We see the first
words RDD showing as stored in memory. We also see the coalased RDD being stored
in disk. We coalased into three partitions before and we see the same partition count
here. Next, let's review the execution plans. In the plan before caching, we see that
there is a filter push down. The filter push down goes all the way to the file scan. This
means that Spark goes all the way back to the file and re-executes all the code. In the
second plan after caching, we see that there is an in memory scan. And the filter is on
this in memory scan. This means that Spark is using the temporary persisted data
frame for doing this filter. Caching helps in performance and re-use, especially while
doing interactive queries.
Selecting transcript lines in this section will navigate to timestamp in the video
- [Instructor] In this video, I will review the best practices for data processing with Spark
and HDFS. Use push downs for filters and projections to data sources as much as
possible. The smaller the data being transferred, the better is the performance. Choose
and design partition keys based on the columns most used in filters and
aggregations. This speeds up both reading and processing data. Use repartitioning and
coalescing wisely. These activities themself take significant time, so only use them if
there are a series of transforms that can take advantage of them. Avoid joins as much
as possible. Use denormalized data sources. If required, use them judiciously and check
execution plans. Clock all operations with spark.time() on production equivalent data to
understand slow-running operations and take actions. Use caching when
appropriate. Caching takes memory and disk space, hence choose them for
intermediate results that are frequently reused. Use Explain plan to understand the
physical plan and look for ways in which the plan can be made better. In the next
chapter, we will do a use-case project to exercise the learnings in this course.