Professional Documents
Culture Documents
Spark Introduction
Spark Introduction
Spark Introduction
History of Spark
Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first published
the following year in a paper entitled “Spark: Cluster Computing with Working Sets” by Matei
Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica of the UC Berkeley
AMPlab. At the time, Hadoop MapReduce was the dominant parallel programming engine for
clusters, being the first open source system to tackle data-parallel processing on clusters of thousands
of nodes. The AMPlab had worked with multiple early MapReduce users to understand the benefits
and drawbacks of this new programming model, and was therefore able to synthesize a list of
problems across several use cases and begin designing more general computing platforms. In
addition, Zaharia had also worked with Hadoop users at UC Berkeley to understand their needs for
the platform—specifically, teams that were doing large-scale machine learning using iterative
algorithms that need to make multiple passes over the data.
Across these conversations, two things were clear. First, cluster computing held tremendous
potential: at every organization that used MapReduce, brand new applications could be built using the
existing data, and many new groups began using the system after its initial use cases. Second,
however, the MapReduce engine made it both challenging and inefficient to build large applications.
For example, the typical machine learning algorithm might need to make 10 or 20 passes over the
data, and in MapReduce, each pass had to be written as a separate MapReduce job, which had to be
launched separately on the cluster and load the data from scratch.
To address this problem, the Spark team first designed an API based on functional programming that
could succinctly express multistep applications. The team then implemented this API over a new
engine that could perform efficient, in-memory data sharing across computation steps. The team also
began testing this system with both Berkeley and external users.
The first version of Spark supported only batch applications, but soon enough another compelling use
case became clear: interactive data science and ad hoc queries. By simply plugging the Scala
interpreter into Spark, the project could provide a highly usable interactive system for running queries
on hundreds of machines. The AMPlab also quickly built on this idea to develop Shark, an engine that
could run SQL queries over Spark and enable interactive use by analysts as well as data scientists.
Shark was first released in 2011.
After these initial releases, it quickly became clear that the most powerful additions to Spark would
be new libraries, and so the project began to follow the “standard library” approach it has today. In
particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX. They also ensured
that these APIs would be highly interoperable, enabling writing end-to-end big data applications in
the same engine for the first time.
In 2013, the project had grown to widespread use, with more than 100 contributors from more than 30
organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache Software
Foundation as a long-term, vendor-independent home for the project. The early AMPlab team also
launched a company, Databricks, to harden the project, joining the community of other companies and
organizations contributing to Spark. Since that time, the Apache Spark community released Spark 1.0
in 2014 and Spark 2.0 in 2016, and continues to make regular releases, bringing new features into the
project.
Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of
Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations
such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added
Spark SQL, a new API for working with structured data—tables with a fixed data format that is not
tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations across
libraries and APIs by understanding both the data format and the user code that runs on it in more
detail. Over time, the project added a plethora of new APIs that build on this more powerful
structured foundation, including DataFrames, machine learning pipelines, and Structured Streaming, a
high-level, automatically optimized streaming API.
To work with the huge amount of information available to modern consumers, Apache
Spark was created. It is a distributed, cluster-based computing system and a highly
popular framework used for big data, with capabilities that provide speed and ease
of use, and includes APIs that support the following use cases:
• Easy cluster management
• Data integration and ETL procedures
• Interactive advanced analytics
• ML and deep learning
• Real-time data processing
It can run very quickly on large datasets thanks to its in-memory processing design that
allows it to run with very few read/write disk operations. It has a SQL-like interface and
its object-oriented design makes it very easy to understand and write code for; it also has
a large support community.
Despite its numerous benefits, Apache Spark has its limitations. These limitations include
the following:
• Users need to provide a database infrastructure to store the information to
work with.
• The in-memory processing feature makes it fast to run, but also implies that it has
high memory requirements.
• It isn't well suited for real-time analytics.
• It has an inherent complexity with a significant learning curve.
• Because of its open source nature, it lacks dedicated training and customer support.
Let's look at the solution to these issues: Azure Databricks.
Spark UI env –
DataFrames in Spark are built on top of Resilient Distributed Datasets (RDDs), which
are now treated as the assembly language of the Spark ecosystem. Spark is compatible
with various programming languages – Scala, Python, R, Java, and SQL.
Note
Executor slots are also called cores or threads.
Every time a DAG is triggered by an action, a Spark job gets created. A Spark job is
further broken down into stages. The number of stages depends on the number of times a
shuffle occurs. All narrow transformations occur in one stage while wide transformations
lead to the formation of new stages. Each stage comprises of one or more tasks and each
task processes one partition of data in the slots. For wide transformations, the stage
execution time is determined by its slowest running task. This is not the case with
narrow transformations.
At any moment, one or more tasks run parallelly across the cluster. Every time a Spark
cluster is set up, it leads to the creation of a Spark session. This Spark session provides
entry into the Spark program and is accessible with the spark keyword.
Sometimes, a few tasks process small partitions while others process larger chunks, we
call this data skewing. This skewing of data should always be avoided if you hope to run
efficient Spark jobs. In a broad execution, the stage is determined by its slowest task, so if a
task is slow, the overall stage is slow and everything waits for that to finish. Also, whenever
a wide transformation is run, the number of partitions of the data in the cluster changes to
200. This is a default setting, but can be modified using Spark configuration.
As a rule of thumb, the total number of partitions should always be in the multiples of
the total slots in the cluster. For instance, if a cluster has 16 slots and the data has
16 partitions, then each slot receives 1 task that processes 1 partition. But instead, if
there are 15 partitions, then 1 slot will remain empty. This leads to the state of cluster
underutilization. In the case of 17 partitions, a job will take twice the time to complete
as it will wait for that 1 extra task to finish processing.
Candy analogy
- Filtering –
Show them spark internals video-
Query execution process –
1. We begin by writing the code. This code can be DataFrame, DataSet or a SQL
and then we submit it.
2. If the code is valid, Spark will convert it into a Logical Plan.
3. Further, Spark will pass the Logical Plan to a Catalyst Optimizer.
4. In the next step, the Physical Plan is generated (after it has passed through the
Catalyst Optimizer), this is where the majority of our optimizations are going to
happen.
5. Once the above steps are complete, Spark executes/processes the Physical
Plan and does all the computation to get the output.
7. Now let’s break down each step into detail. The DRIVER (Master Node) is
responsible for the generation of the Logical and Physical Plan.
Logical Plan:
Let’s say we have a code (DataFrame, DataSet, SQL). Now the first step will be the
generation of the Logical Plan.
Logical Plan is an abstract of all transformation steps that need to be performed and it
does not refer anything about the Driver (Master Node) or Executor (Worker Node). The
SparkContext is responsible for generating and storing it. This helps us to convert the
user expression into the most optimized version.
It may so happen that our code is valid and the syntax is also correct, but the column
name or the table name may be inaccurate or may not even exist. That’s why we call it
an Unresolved Logical Plan.
Basically, what happens here is that, when we give a wrong column name, our
Unresolved Logical Plan is still created. This is the first step where Spark has created a
blank Logical Plan where there are no checks for the column name, table name, etc.
or example:
dataFrame.select(“age”) //Column “age” may not even exist.
If the Analyzer is not able to resolve them (column name, table name, etc), it can reject
our Unresolved Logical Plan. But if it gets resolved then it creates Resolved Logical
Plan.
We can also create our own customized Catalyst Optimizer according to our business
use case by defining/applying specific rules to it to perform custom optimization.
Physical Plan:
Now coming to Physical Plan, it is an internal optimization for Spark. Once our
Optimized Logical Plan is created then further, Physical Plan is generated.
It simply specifies how our Logical Plan is going to be executed on the cluster. It
generates different kinds of execution strategies and then keeps comparing them in the
“Cost Model”. For example, it estimates the execution time and resources to be taken
by each strategy. Finally, whichever plan/strategy is going to be the best optimal one is
selected as the “Best Physical Plan” or “Selected Physical Plan”.
Let’s say, we are performing a join query between two tables. In that join operation, we
may have a fat table and a thin table with a different number of partitions scattered in
different nodes across the cluster (same rack or different rack). Spark decides which
partitions should be joined first (basically it decides the order of joining the partitions),
the type of join, etc for better optimization.
Physical Plan is specific to Spark operation and for this, it will do a check-up of multiple
physical plans and decide the best optimal physical plan. And finally, the Best Physical
Plan runs in our cluster.
Once the Best Physical Plan is selected, it’s the time to generate the executable code
(DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion.
This process is called Codegen and that’s the job of Spark’s Tungsten Execution
Engine.
One of the major feature introduced in Apache Spark 3.0 is the new Adaptive Query
Execution (AQE) over the Spark SQL engine.
So, in this feature, the Spark SQL engine can keep updating the execution plan per
computation at runtime based on the observed properties of the data. For example, it
can automatically tune the number of shuffle partitions while doing an aggregation,
handle data skew in the join operation.
This makes it much easier to run Spark as we don’t need to configure these things in
advance as a developer, now it is AQE’s responsibility. Thus making the life of the
developer easier. AQE will adapt and optimise based on our input data and
also leads to better performance in many cases.
s
Moving on to ADB architecture
-
The immediate benefit this architecture gives to users is that there is a seamless
connection with Azure, allowing them to easily connect Azure Databricks to any resource
within the same Azure account and have a centrally managed Databricks from the Azure
control center with no additional setup.
As mentioned previously, Azure Databricks is a managed application on the Azure cloud
that is composed by a control plane and a data plane. The control plane is on the Azure
cloud and hosts services such as cluster management and jobs services. The data plane is
a component that includes the aforementioned VNet, NSG, and the storage account that is
known as DBFS.
You could also deploy the data plane in a customer-managed VNet to allow data
engineering teams to build and secure the network architecture according to their
organization policies. This is called VNet injection.
Create a Databricks Workspace –
1) Portal
2) CLI
3) ARM
From the Azure portal, you should see a resource group containing the name of
your workspace, as shown in the following screenshot. This is called as Managed
Resource Group.
Figure 1.8 – Databricks workspace – locked in resource group
• Since we haven't created the cluster yet, you won't see the VMs in the managed
resource group that was created by the Azure Databricks service.
• We won't be able to remove the resources or read the contents in the storage
account in the managed resource group that was created by the service.
• These resources and the respective resource group will be deleted when the Azure
Databricks service is deleted.
4. You can grant admin privileges or allow cluster creation permissions based on your
requirements while adding users.
5. You can create a group from the admin console and add users to that group. You can
do this to classify groups such as data engineers and data scientists and then provide
access accordingly:
6. Once we've created a group, we can add users to it, or we can add an ADD group
to it. Here, we are adding the ADD user to the group that we have created:
Creating a cluster from the user interface (UI)
In this recipe, we will look at the different types of clusters and cluster modes in Azure
Databricks and how to create them. Understanding cluster types will help you determine
the right type of cluster you should use for your workload and usage pattern
Getting ready
Before you get started, ensure you have created an Azure Databricks workspace, as shown
in the preceding recipes.
There are three cluster modes that are supported in Azure Databricks:
• Standard clusters
This mode is recommended for single users and data applications that can be
developed using Python, Scale, SQL, and R. Clusters are set to terminate after
120 minutes of inactivity, and Standard cluster is the default cluster mode. We can
use this type of cluster to, for example, execute a Notebook through a Databricks job
or via ADF as a scheduled activity.
• High Concurrency clusters
This mode is ideal when multiple users are trying to use the same cluster. It can
be used to provide maximum resource utilization and has lower query latency
requirements in multiuser scenarios. Data applications can be developed using
Python, SQL, and R. These clusters are configured not to terminate by default.
• Single-Node clusters
Single-Node clusters do not have worker nodes, and all the Spark jobs run on the
driver node. This cluster mode be used for building and testing small data pipelines
and to do lightweight Exploratory Data Analysis (EDA). The driver node acts as
both a master and a worker. These clusters are configured to terminate after 120
minutes of inactivity by default. Data applications can be developed using Python,
Scale, SQL, and R.
The Databricks runtime includes Apache Spark, along with many other components, to
improve performance, security, and ease of use. It comes with Python, Scala, and Java
libraries installed for building pipelines in any language. It also provides Delta Lake, an
open source storage layer that provides ACID transactions and metadata handling. It also
unifies batch data and streaming data for building near-real-time analytics.
Once we have a cluster created and running, we will see that multiple VMs have been
created in the control pane, which is managed by Azure Databricks in its own Azure
account, as we discussed in the Creating a Databricks service in the Azure portal recipe.
Here, we have created a cluster with two worker nodes and one driver node, which is
why we see three VMs in the Databricks managed resource group, as shown in the
following screenshot:
Getting ready
PATs are enabled by default for all Databricks workspaces created on or after 2018. If this
is not enabled, an administrator can enable or disable tokens, irrespective of their
creation date.
Users can create PATs and use them in REST API requests. Tokens have optional
expiration dates and can be revoked.
How to do it…
This section will show you how to generate PATs using the Azure Databricks UI. Also,
apart from the UI, you can use the Token API to generate and revoke tokens. However,
there is a limitation of 600 tokens per user, per workspace. Let's go through the steps for
creating tokens using the Databricks UI:
1. Click on the profile icon in the top-right corner of the Databricks workspace.
2. Click on User Settings.
3. Click on the Access Tokens tab:
5. Open Power BI Desktop, go to Get Data > Azure, and select the Azure Databricks
connector:
7. Paste the Server Hostname and HTTP Path details you retrieved in the
previous step:
Figure
• You relieve Spark from the onus of inferring data types. • You prevent Spark from creating a
separate job just to read a large portion of your file to ascertain the schema, which for a large
data file can be expensive and time-consuming. • You can detect errors early if data doesn’t
match the schema. So, we encourage you to always define your schema up front whenever you
want to read a large file from a data source. For a short illustration, let’s define a schema for the
data in Table 3-1 and use that schema to create a DataFrame.
Two ways to dene a schema Spark allows you to define a schema in two ways. One is to define
it programmati‐ cally, and the other is to employ a Data Definition Language (DDL) string, which
is much simpler and easier to read. To define a schema programmatically for a DataFrame with
three named columns, author, title, and pages, you can use the Spark DataFrame API. For
example:
from pyspark.sql.types import * schema = StructType([StructField("author", StringType(), False),
StructField("title", StringType(), False), StructField("pages", IntegerType(), False)])
Transformations are methods the return a new dataframe and lazily evaluated
Actions are methods that trigger computation