Spark Introduction

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Spark training –

History of Spark
Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first published
the following year in a paper entitled “Spark: Cluster Computing with Working Sets” by Matei
Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica of the UC Berkeley
AMPlab. At the time, Hadoop MapReduce was the dominant parallel programming engine for
clusters, being the first open source system to tackle data-parallel processing on clusters of thousands
of nodes. The AMPlab had worked with multiple early MapReduce users to understand the benefits
and drawbacks of this new programming model, and was therefore able to synthesize a list of
problems across several use cases and begin designing more general computing platforms. In
addition, Zaharia had also worked with Hadoop users at UC Berkeley to understand their needs for
the platform—specifically, teams that were doing large-scale machine learning using iterative
algorithms that need to make multiple passes over the data.
Across these conversations, two things were clear. First, cluster computing held tremendous
potential: at every organization that used MapReduce, brand new applications could be built using the
existing data, and many new groups began using the system after its initial use cases. Second,
however, the MapReduce engine made it both challenging and inefficient to build large applications.
For example, the typical machine learning algorithm might need to make 10 or 20 passes over the
data, and in MapReduce, each pass had to be written as a separate MapReduce job, which had to be
launched separately on the cluster and load the data from scratch.
To address this problem, the Spark team first designed an API based on functional programming that
could succinctly express multistep applications. The team then implemented this API over a new
engine that could perform efficient, in-memory data sharing across computation steps. The team also
began testing this system with both Berkeley and external users.
The first version of Spark supported only batch applications, but soon enough another compelling use
case became clear: interactive data science and ad hoc queries. By simply plugging the Scala
interpreter into Spark, the project could provide a highly usable interactive system for running queries
on hundreds of machines. The AMPlab also quickly built on this idea to develop Shark, an engine that
could run SQL queries over Spark and enable interactive use by analysts as well as data scientists.
Shark was first released in 2011.
After these initial releases, it quickly became clear that the most powerful additions to Spark would
be new libraries, and so the project began to follow the “standard library” approach it has today. In
particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX. They also ensured
that these APIs would be highly interoperable, enabling writing end-to-end big data applications in
the same engine for the first time.
In 2013, the project had grown to widespread use, with more than 100 contributors from more than 30
organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache Software
Foundation as a long-term, vendor-independent home for the project. The early AMPlab team also
launched a company, Databricks, to harden the project, joining the community of other companies and
organizations contributing to Spark. Since that time, the Apache Spark community released Spark 1.0
in 2014 and Spark 2.0 in 2016, and continues to make regular releases, bringing new features into the
project.
Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of
Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations
such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added
Spark SQL, a new API for working with structured data—tables with a fixed data format that is not
tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations across
libraries and APIs by understanding both the data format and the user code that runs on it in more
detail. Over time, the project added a plethora of new APIs that build on this more powerful
structured foundation, including DataFrames, machine learning pipelines, and Structured Streaming, a
high-level, automatically optimized streaming API.

To work with the huge amount of information available to modern consumers, Apache
Spark was created. It is a distributed, cluster-based computing system and a highly
popular framework used for big data, with capabilities that provide speed and ease
of use, and includes APIs that support the following use cases:
• Easy cluster management
• Data integration and ETL procedures
• Interactive advanced analytics
• ML and deep learning
• Real-time data processing
It can run very quickly on large datasets thanks to its in-memory processing design that
allows it to run with very few read/write disk operations. It has a SQL-like interface and
its object-oriented design makes it very easy to understand and write code for; it also has
a large support community.
Despite its numerous benefits, Apache Spark has its limitations. These limitations include
the following:
• Users need to provide a database infrastructure to store the information to
work with.
• The in-memory processing feature makes it fast to run, but also implies that it has
high memory requirements.
• It isn't well suited for real-time analytics.
• It has an inherent complexity with a significant learning curve.
• Because of its open source nature, it lacks dedicated training and customer support.
Let's look at the solution to these issues: Azure Databricks.

Introducing Azure Databricks


With these and other limitations in mind, Databricks was designed. It is a cloud-based
platform that uses Apache Spark as a backend and builds on top of it, to add features
including the following:
• Highly reliable data pipelines
• Data science at scale
• Simple data lake integration
• Built-in security
• Automatic cluster management
Built as a joint effort by Microsoft and the team that started Apache Spark, Azure
Databricks also allows easy integration with other Azure products, such as Blob Storage
and SQL databases, alongside AWS services, including S3 buckets. It has a dedicated
support team that assists the platform's clients.
Databricks streamlines and simplifies the setup and maintenance of clusters while
supporting different languages, such as Scala and Python, making it easy for developers
to create ETL pipelines. It also allows data teams to have real-time, cross-functional
collaboration thanks to its notebook-like integrated workspace, while keeping a significant
amount of backend services managed by Azure Databricks. Notebooks can be used to
create jobs that can later be scheduled, meaning that locally developed notebooks can be
deployed to production easily. Other features that make Azure Databricks a great tool for
any data team include the following:
• A high-speed connection to all Azure resources, such as storage accounts.
• Clusters scale and are terminated automatically according to use.
• The optimization of SQL.
• Integration with BI tools such as Power BI and Tableau.

Spark UI env –

• DataFrames: Fundamental data structures consisting of rows and columns.


• Machine Learning (ML): Spark ML provides ML algorithms for processing
big data.
• Graph processing: GraphX helps to analyze relationships between objects.
• Streaming: Spark's Structured Streaming helps to process real-time data.
• Spark SQL: A SQL to Spark engine with query plans and a cost-based optimizer.

DataFrames in Spark are built on top of Resilient Distributed Datasets (RDDs), which
are now treated as the assembly language of the Spark ecosystem. Spark is compatible
with various programming languages – Scala, Python, R, Java, and SQL.

Spark Cluster Architecture –


Spark encompasses an architecture with one driver node and multiple worker nodes. The
driver and worker nodes together constitute a Spark cluster. Under the hood, these nodes
are based in Java Virtual Machines (JVMs). The driver is responsible for assigning and
coordinating work between the workers.
The worker nodes have executors running inside each of them, which host the Spark
program. Each executor consists of one or more slots that act as the compute resource.
Each slot can process a single unit of work at a time.
Figure

Every executor reserves memory for two purposes:


• Cache
• Computation
The cache section of the memory is used to store the DataFrames in a compressed format
(called caching), while the compute section is utilized for data processing (aggregations,
joins, and so on). For resource allocation, Spark can be used with a cluster manager that
is responsible for provisioning the nodes of the cluster. Databricks has an in-built cluster
manager as part of its overall offering.

Note
Executor slots are also called cores or threads.

Spark supports parallelism in two ways:


• Vertical parallelism: Scaling the number of slots in the executors
• Horizontal parallelism: Scaling the number of executors in a Spark cluster
Spark processes the data by breaking it down into chunks called partitions. These
partitions are usually 128 MB blocks that are read by the executors and assigned to them
by the driver. The size and the number of partitions are decided by the driver node. While
writing Spark code, we come across two functionalities, transformations and actions.
Transformations instruct the Spark cluster to perform changes to the DataFrame. These
are further categorized into narrow transformations and wide transformations. Wide
transformations lead to the shuffling of data as data requires movement across executors,
whereas narrow transformations do not lead to re-partitioning across executors.
Running these transformations does not make the Spark cluster do anything. It is only
when an action is called that the Spark cluster begins execution, hence the saying Spark
is lazy. Before executing an action, all that Spark does is make a data processing plan.
We call this plan the Directed Acyclic Graph (DAG). The DAG consists of various
transformations such as read, filter, and join and is triggered by an action

Every time a DAG is triggered by an action, a Spark job gets created. A Spark job is
further broken down into stages. The number of stages depends on the number of times a
shuffle occurs. All narrow transformations occur in one stage while wide transformations
lead to the formation of new stages. Each stage comprises of one or more tasks and each
task processes one partition of data in the slots. For wide transformations, the stage
execution time is determined by its slowest running task. This is not the case with
narrow transformations.
At any moment, one or more tasks run parallelly across the cluster. Every time a Spark
cluster is set up, it leads to the creation of a Spark session. This Spark session provides
entry into the Spark program and is accessible with the spark keyword.
Sometimes, a few tasks process small partitions while others process larger chunks, we
call this data skewing. This skewing of data should always be avoided if you hope to run
efficient Spark jobs. In a broad execution, the stage is determined by its slowest task, so if a
task is slow, the overall stage is slow and everything waits for that to finish. Also, whenever
a wide transformation is run, the number of partitions of the data in the cluster changes to
200. This is a default setting, but can be modified using Spark configuration.
As a rule of thumb, the total number of partitions should always be in the multiples of
the total slots in the cluster. For instance, if a cluster has 16 slots and the data has
16 partitions, then each slot receives 1 task that processes 1 partition. But instead, if
there are 15 partitions, then 1 slot will remain empty. This leads to the state of cluster
underutilization. In the case of 17 partitions, a job will take twice the time to complete
as it will wait for that 1 extra task to finish processing.

Candy analogy

- Filtering –
Show them spark internals video-
Query execution process –
1. We begin by writing the code. This code can be DataFrame, DataSet or a SQL
and then we submit it.
2. If the code is valid, Spark will convert it into a Logical Plan.
3. Further, Spark will pass the Logical Plan to a Catalyst Optimizer.
4. In the next step, the Physical Plan is generated (after it has passed through the
Catalyst Optimizer), this is where the majority of our optimizations are going to
happen.
5. Once the above steps are complete, Spark executes/processes the Physical
Plan and does all the computation to get the output.

6. These are the 5 steps at the high-level which Spark follows.

7. Now let’s break down each step into detail. The DRIVER (Master Node) is
responsible for the generation of the Logical and Physical Plan.

Query Plan - https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-


laymans-term/
https://blog.knoldus.com/adaptive-query-execution-aqe-in-spark-3-0/

Logical Plan:

Let’s say we have a code (DataFrame, DataSet, SQL). Now the first step will be the
generation of the Logical Plan.

Logical Plan is divided into three parts:

1. Unresolved Logical Plan OR Parsed Logical Plan


2. Resolved Logical Plan OR Analyzed Logical Plan OR Logical Plan
3. Optimized Logical Plan

Now the question is what is a Logical Plan?

Logical Plan is an abstract of all transformation steps that need to be performed and it
does not refer anything about the Driver (Master Node) or Executor (Worker Node). The
SparkContext is responsible for generating and storing it. This helps us to convert the
user expression into the most optimized version.

he first step which happens is the generation of Unresolved Logical Plan.

It may so happen that our code is valid and the syntax is also correct, but the column
name or the table name may be inaccurate or may not even exist. That’s why we call it
an Unresolved Logical Plan.

Basically, what happens here is that, when we give a wrong column name, our
Unresolved Logical Plan is still created. This is the first step where Spark has created a
blank Logical Plan where there are no checks for the column name, table name, etc.

Further, Spark is going to use a component called “Catalog” which is a repository


where all the information about Spark table, DataFrame, DataSet will be present. The
data from meta-store is pulled into the Catalog which is an internal storage component
of Spark. There is yet another component named “Analyzer” which helps us to
resolve/verify the semantics, column name, table name by cross-checking from the
Catalog. And that’s why I say that DataFrame/DataSet follows Semi-lazy evaluation
because, at the time of the creation of the Logical Plan, it starts performing analysis
without an Action.

or example:
dataFrame.select(“age”) //Column “age” may not even exist.

dataFrame.select(max(“name”)) //Checks if column “name” is a number

If the Analyzer is not able to resolve them (column name, table name, etc), it can reject
our Unresolved Logical Plan. But if it gets resolved then it creates Resolved Logical
Plan.

SQL to Resolved Logical Plan

After, the Resolved Logical plan is generated it is then passed on to a “Catalyst


Optimizer” which will apply its own rule and will try to optimize the plan. Basically,
Catalyst Optimizer performs logical optimization. For example, (i) It checks for all the
tasks which can be performed and computed together in one Stage. (ii) In a multi-join
query, it decides the order of execution of query for better performance. (iii) Tries to
optimize the query by evaluating the filter clause before any project. This, in turn,
generates an Optimized Logical Plan.

We can also create our own customized Catalyst Optimizer according to our business
use case by defining/applying specific rules to it to perform custom optimization.

Physical Plan:
Now coming to Physical Plan, it is an internal optimization for Spark. Once our
Optimized Logical Plan is created then further, Physical Plan is generated.

What does this Physical Plan do?

It simply specifies how our Logical Plan is going to be executed on the cluster. It
generates different kinds of execution strategies and then keeps comparing them in the
“Cost Model”. For example, it estimates the execution time and resources to be taken
by each strategy. Finally, whichever plan/strategy is going to be the best optimal one is
selected as the “Best Physical Plan” or “Selected Physical Plan”.

What actually happens in Physical Plan?

Let’s say, we are performing a join query between two tables. In that join operation, we
may have a fat table and a thin table with a different number of partitions scattered in
different nodes across the cluster (same rack or different rack). Spark decides which
partitions should be joined first (basically it decides the order of joining the partitions),
the type of join, etc for better optimization.

Physical Plan is specific to Spark operation and for this, it will do a check-up of multiple
physical plans and decide the best optimal physical plan. And finally, the Best Physical
Plan runs in our cluster.

Optimized Logical Plan to Physical Plan

Once the Best Physical Plan is selected, it’s the time to generate the executable code
(DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion.
This process is called Codegen and that’s the job of Spark’s Tungsten Execution
Engine.

One of the major feature introduced in Apache Spark 3.0 is the new Adaptive Query
Execution (AQE) over the Spark SQL engine.

So, in this feature, the Spark SQL engine can keep updating the execution plan per
computation at runtime based on the observed properties of the data. For example, it
can automatically tune the number of shuffle partitions while doing an aggregation,
handle data skew in the join operation.

This makes it much easier to run Spark as we don’t need to configure these things in
advance as a developer, now it is AQE’s responsibility. Thus making the life of the

developer easier. AQE will adapt and optimise based on our input data and
also leads to better performance in many cases.

s
Moving on to ADB architecture
-

Each Databricks cluster is a Databricks application composed of a set of pre-configured,


VMs running as Azure resources managed as a single group. You can specify the number
and type of VMs that it will use while Databricks manages other parameters in the
backend. The managed resource group is deployed and populated with a virtual network
called VNet, a security group that manages the permissions of the resources, and a
storage account that will be used, among other things, as the Databricks filesystem. Once
everything is deployed, users can manage these clusters through the Azure Databricks UI.
All the metadata used is stored in a geo-replicated and fault-tolerant Azure database. This
can all be seen in Figure 1.1:

The immediate benefit this architecture gives to users is that there is a seamless
connection with Azure, allowing them to easily connect Azure Databricks to any resource
within the same Azure account and have a centrally managed Databricks from the Azure
control center with no additional setup.
As mentioned previously, Azure Databricks is a managed application on the Azure cloud
that is composed by a control plane and a data plane. The control plane is on the Azure
cloud and hosts services such as cluster management and jobs services. The data plane is
a component that includes the aforementioned VNet, NSG, and the storage account that is
known as DBFS.
You could also deploy the data plane in a customer-managed VNet to allow data
engineering teams to build and secure the network architecture according to their
organization policies. This is called VNet injection.
Create a Databricks Workspace –

1) Portal
2) CLI
3) ARM

From the Azure portal, you should see a resource group containing the name of
your workspace, as shown in the following screenshot. This is called as Managed
Resource Group.
Figure 1.8 – Databricks workspace – locked in resource group
• Since we haven't created the cluster yet, you won't see the VMs in the managed
resource group that was created by the Azure Databricks service.
• We won't be able to remove the resources or read the contents in the storage
account in the managed resource group that was created by the service.
• These resources and the respective resource group will be deleted when the Azure
Databricks service is deleted.

Adding users and groups to the workspace


In this recipe, we will learn how to add users and groups to the workspace so that they can
collaborate when creating data applications. This exercise will provide you with a detailed
understanding of how users are created in a workspace. You will also learn about the
different permissions that can be granted to users.
Getting ready
Log into the Databricks workspace as an Azure Databricks admin. Before you add a user
to the workspace, ensure that the user exists in Azure Active Directory.
How to do it…
Follow these steps to create users and groups from the Admin Console:
1. From the Azure Databricks service, click on the Launch workspace option.
2. After launching the workspace, click the user icon at the top right and click on
Admin Console, as shown in the following screenshot:
3. You can click on Add User and invite users who are part of your Azure Active
Directory (AAD). Here, we have added demouser1@gmail.com:

4. You can grant admin privileges or allow cluster creation permissions based on your
requirements while adding users.
5. You can create a group from the admin console and add users to that group. You can
do this to classify groups such as data engineers and data scientists and then provide
access accordingly:

6. Once we've created a group, we can add users to it, or we can add an ADD group
to it. Here, we are adding the ADD user to the group that we have created:
Creating a cluster from the user interface (UI)
In this recipe, we will look at the different types of clusters and cluster modes in Azure
Databricks and how to create them. Understanding cluster types will help you determine
the right type of cluster you should use for your workload and usage pattern

Getting ready
Before you get started, ensure you have created an Azure Databricks workspace, as shown
in the preceding recipes.

There are three cluster modes that are supported in Azure Databricks:
• Standard clusters
This mode is recommended for single users and data applications that can be
developed using Python, Scale, SQL, and R. Clusters are set to terminate after
120 minutes of inactivity, and Standard cluster is the default cluster mode. We can
use this type of cluster to, for example, execute a Notebook through a Databricks job
or via ADF as a scheduled activity.
• High Concurrency clusters
This mode is ideal when multiple users are trying to use the same cluster. It can
be used to provide maximum resource utilization and has lower query latency
requirements in multiuser scenarios. Data applications can be developed using
Python, SQL, and R. These clusters are configured not to terminate by default.
• Single-Node clusters
Single-Node clusters do not have worker nodes, and all the Spark jobs run on the
driver node. This cluster mode be used for building and testing small data pipelines
and to do lightweight Exploratory Data Analysis (EDA). The driver node acts as
both a master and a worker. These clusters are configured to terminate after 120
minutes of inactivity by default. Data applications can be developed using Python,
Scale, SQL, and R.
The Databricks runtime includes Apache Spark, along with many other components, to
improve performance, security, and ease of use. It comes with Python, Scala, and Java
libraries installed for building pipelines in any language. It also provides Delta Lake, an
open source storage layer that provides ACID transactions and metadata handling. It also
unifies batch data and streaming data for building near-real-time analytics.
Once we have a cluster created and running, we will see that multiple VMs have been
created in the control pane, which is managed by Azure Databricks in its own Azure
account, as we discussed in the Creating a Databricks service in the Azure portal recipe.
Here, we have created a cluster with two worker nodes and one driver node, which is
why we see three VMs in the Databricks managed resource group, as shown in the
following screenshot:

Authenticating to Databricks using a PAT


To authenticate and access Databricks REST APIs, we can use two types of tokens:
• PATs
• Azure Active Directory tokens
A PAT is used as an alternative password to authenticate and access Databricks REST
APIs. By the end of this recipe, you will have learned how to use PATs to access the Spark
managed tables that we created in the preceding recipes using Power BI Desktop and
create basic visualizations.

Getting ready
PATs are enabled by default for all Databricks workspaces created on or after 2018. If this
is not enabled, an administrator can enable or disable tokens, irrespective of their
creation date.
Users can create PATs and use them in REST API requests. Tokens have optional
expiration dates and can be revoked.
How to do it…
This section will show you how to generate PATs using the Azure Databricks UI. Also,
apart from the UI, you can use the Token API to generate and revoke tokens. However,
there is a limitation of 600 tokens per user, per workspace. Let's go through the steps for
creating tokens using the Databricks UI:
1. Click on the profile icon in the top-right corner of the Databricks workspace.
2. Click on User Settings.
3. Click on the Access Tokens tab:

4. Click on the Generate New Token button.


5. Enter a description (Comment) and expiration period (Lifetime (days)). These are
just an optional, but it's good practice to have a lifetime for a token and comment
on where it is being used:

6. Click the Generate button.


7. Copy the generated token and store it in a secure location.
Now that the access token has been created, we will learn how to connect a Databricks
cluster to Power BI using a PAT:
1. Click on the cluster icon in the left sidebar of your Azure Databricks workspace.
2. Click on the Advanced Options toggle.
3. Click on the JDBC/ODBC tab.
4. Copy the hostname, port, and HTTP path provided:

5. Open Power BI Desktop, go to Get Data > Azure, and select the Azure Databricks
connector:
7. Paste the Server Hostname and HTTP Path details you retrieved in the
previous step:
Figure

8. Select a Data Connectivity mode.


9. Click OK.
10. At the authentication prompt, select how you wish to authenticate to Azure
Databricks:
(a) Azure Active Directory: Use your Azure account credentials.
(b) PAT: Use the PAT you retrieved earlier.
11. Click Connect.
12. Create a Power BI report about the mnmdata dataset, which is available in
Azure Databricks:
Now, you can explore and visualize this data, as you would with any other data in Power
BI Desktop.

Dataframes : Distributed collection of data grouped into named columns


Schemas and Creating DataFrames A schema in Spark defines the column names and associated
data types for a Data‐ Frame. Most often, schemas come into play when you are reading
structured data from an external data source (more on this in the next chapter). Defining a
schema up front as opposed to taking a schema-on-read approach offers three benefits:

• You relieve Spark from the onus of inferring data types. • You prevent Spark from creating a
separate job just to read a large portion of your file to ascertain the schema, which for a large
data file can be expensive and time-consuming. • You can detect errors early if data doesn’t
match the schema. So, we encourage you to always define your schema up front whenever you
want to read a large file from a data source. For a short illustration, let’s define a schema for the
data in Table 3-1 and use that schema to create a DataFrame.

Two ways to de€ne a schema Spark allows you to define a schema in two ways. One is to define
it programmati‐ cally, and the other is to employ a Data Definition Language (DDL) string, which
is much simpler and easier to read. To define a schema programmatically for a DataFrame with
three named columns, author, title, and pages, you can use the Spark DataFrame API. For
example:
from pyspark.sql.types import * schema = StructType([StructField("author", StringType(), False),
StructField("title", StringType(), False), StructField("pages", IntegerType(), False)])

schema = "author STRING, title STRING, pages INT"

Transformations are methods the return a new dataframe and lazily evaluated
Actions are methods that trigger computation

You might also like