MOD5-BDA

Module - 5
Data analysis using Apache Spark programming: Introduction to Data frame and dataset, Spark Types, overview
of structured API execution: Logical planning, Physical planning, SparkSQL, Perform ETL using pyspark, ML
packages in spark, real world applications.
DataFrame
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a
table in a relational database. DataFrames provide a higher-level abstraction compared to RDDs (Resilient
Distributed Datasets) and are optimized for SQL-style operations.
Key Features of DataFrame:
Schema: DataFrames have a schema that defines the column names and data types.
API: DataFrames provide a domain-specific language for structured data manipulation (filtering, aggregations,
etc.) in languages like Java, Scala, Python, and R.
Optimization: DataFrames leverage the Catalyst query optimizer for efficient query execution.
Interoperability: They can interoperate with SQL queries using the Spark SQL module.
Example:
Creating a DataFrame from a JSON file:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("DataFrame Example").getOrCreate()
val df = spark.read.json("path/to/your/json/file")
df.show()
df.printSchema()
df.select("name").show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()
Dataset
A Dataset is a distributed collection of data that combines the benefits of RDDs with the optimizations of
DataFrames. Datasets provide type safety, meaning that errors can be caught at compile time, which is particularly
useful in large-scale data processing applications.
Key Features of Dataset:
Type Safety: Datasets are strongly typed, allowing the compiler to catch errors at compile time.
API: Datasets provide a high-level API similar to DataFrames, but with the added benefit of compile-time type
checking.
Interoperability: Datasets can interoperate with DataFrames and SQL queries.
Optimizations: Datasets leverage the Catalyst optimizer and Tungsten execution engine for efficient query
execution.
Example:
Creating a Dataset from a case class:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder() .appName("Dataset Example").getOrCreate()
import spark.implicits._case class Person(name: String, age: Long)
val peopleDS = Seq(Person("John", 25), Person("Doe", 29)).toDS()
peopleDS.show()
val adultsDS = peopleDS.filter(_.age > 21)
adultsDS.show()
Spark Types
All data types of Spark SQL are located in the package of pyspark.sql.types. You can access them by doing
from pyspark.sql.types import *
API to access or create a data

Data type Value type in Python
type
int or long
Note: Numbers will be converted to 1-byte signed
ByteType ByteType()
integer numbers at runtime. Please make sure that
numbers are within the range of -128 to 127.
int or long
ShortType ShortType()
numbers are within the range of -32768 to 32767.
IntegerType int or long IntegerType()
long
LongType numbers are within the range of - LongType()
9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal
and use DecimalType.
float
FloatType Note: Numbers will be converted to 4-byte single- FloatType()
precision floating point numbers at runtime.
DoubleType float DoubleType()
DecimalType decimal.Decimal DecimalType()
StringType string StringType()
BinaryType bytearray BinaryType()
BooleanType bool BooleanType()
TimestampType datetime.datetime TimestampType()
TimestampNTZType datetime.datetime TimestampNTZType()
DateType datetime.date DateType()
DayTimeIntervalType datetime.timedelta DayTimeIntervalType()
ArrayType list, tuple, or array ArrayType(elementType,
[containsNull])
Note:The default value
of containsNull is True.
MapType(keyType, valueType,
[valueContainsNull])
MapType dict
Note:The default value
of valueContainsNull is True.
StructType(fields)
Note: fields is a Seq of
StructType list or tuple
StructFields. Also, two fields with
the same name are not allowed.
StructField(name, dataType,
The value type in Python of the data type of this field
[nullable])
StructField (For example, Int for a StructField with the data type
Note: The default value
IntegerType)
of nullable is True.
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructType, StructField, StringType, IntegerType, FloatType, DateType, ArrayType)
from pyspark.sql.functions import col
# Initialize a SparkSession
spark = SparkSession.builder \ .appName("Spark Data Types Example") \ .getOrCreate()
# Define the schema using Spark data types
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("salary", FloatType(), True),
StructField("joining_date", DateType(), True),
StructField("skills", ArrayType(StringType()), True)
])
# Create a list of rows (data)
data = [ ("Alice", 30, 50000.0, "2015-04-23", ["Python", "Spark"]), ("Bob", 25, 45000.0, "2017-08-12", ["Java",
"Scala"]),
("Charlie", 35, 70000.0, "2012-05-19", ["SQL", "R"]), ("David", 29, 60000.0, "2018-01-01", ["Python", "Java"]) ]
# Create a DataFrame using the schema and data
df = spark.createDataFrame(data, schema)
# Show the DataFrame schema
df.printSchema()
# Show the DataFrame data
df.show()
# Perform some operations
# 1. Select name and age columns
df.select("name", "age").show()
# 2. Filter rows where age is greater than 28
df.filter(col("age") > 28).show()
# 3. Group by age and count
df.groupBy("age").count().show()
# Stop the SparkSession
spark.stop()
Output
The output will display the schema and data of the DataFrame, and the results of the select, filter, and group by
operations. The schema will look like this:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: float (nullable = true)
|-- joining_date: date (nullable = true)
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
Logical and Physical Plan
In Apache Spark, the process of query execution involves several stages, including the creation of logical and
physical plans. These plans are part of the Catalyst optimizer, which optimizes and executes the queries efficiently.
Here’s a detailed explanation of logical and physical plans in Spark:
Logical Plan
The logical plan is an abstract representation of the structured query. It goes through several stages before becoming
an executable plan. The stages include parsing, analysis, and logical optimization.
1. Parsing
Description: The initial step where the high-level query (written in SQL, DataFrame, or Dataset API) is parsed
into an unresolved logical plan.
Unresolved Logical Plan: This plan captures the structure of the query but doesn't resolve references to
columns or tables.
2. Analysis
Description: The unresolved logical plan is analyzed to resolve all references to datasets, columns, and
functions.
Analyzer: Uses the catalog (metadata repository) to resolve these references.
Resolved Logical Plan: After analysis, the logical plan includes fully qualified identifiers.
3. Logical Optimization
Description: The resolved logical plan is optimized using various rule-based optimizations.
Catalyst Optimizer: Applies several rules to simplify and optimize the logical plan.
Optimized Logical Plan: The result is a logical plan that is efficient and ready for translation to a physical plan.
Physical Plan
The physical plan is a detailed plan that describes how the logical plan will be executed on the cluster. It includes the
specific operations and their order, as well as details about data distribution and execution strategies.
1. Physical Planning
Description: The optimized logical plan is translated into one or more physical execution plans.
Physical Plan(s): Each physical plan represents a potential way to execute the query, detailing how data is
distributed and how operations are executed on the data.
2. Cost-Based Optimization
Description: Among the generated physical plans, Spark uses cost-based optimization to select the most
efficient plan.
Cost-Based Optimizer: Evaluates the cost of each plan based on factors like data size, complexity of
operations, and available resources.
Selected Physical Plan: The plan with the lowest estimated cost is chosen for execution.
3. Execution
Description: The selected physical plan is executed across the Spark cluster.
Task Execution: The plan is divided into tasks that are distributed to worker nodes.
Shuffling and Execution: Data is shuffled as necessary (e.g., for joins or aggregations), and intermediate results
are combined to produce the final output.
Example of Logical and Physical Plans
Consider a simple example where we read a JSON file into a DataFrame, filter the data, and perform a group by
operation:
spark = SparkSession.builder.appName("Logical and Physical Plan Example").getOrCreate()

# Read JSON file into DataFrame
df = spark.read.json("path/to/your/json/file")
# Perform transformations
filtered_df = df.filter(df["age"] > 21)
grouped_df = filtered_df.groupBy("age").count()
# Show the DataFrame schema
grouped_df.printSchema()
# Explain the logical and physical plans
grouped_df.explain()
Spark SQL
Spark SQL is a module in Apache Spark that allows for structured data processing using SQL queries. It integrates
relational processing with Spark’s functional programming API, enabling users to run SQL queries, both ad-hoc and
complex, on large-scale datasets.
Key Features of Spark SQL:
Unified Data Access: Allows access to data in various formats (like JSON, Parquet, Avro) and storage systems
(like HDFS, HBase, Cassandra).
Seamless Integration with Spark: Combines SQL querying capabilities with Spark’s powerful data processing
features.
Optimized Execution: Utilizes the Catalyst optimizer and Tungsten execution engine to optimize query plans
and execution for better performance.
DataFrames and Datasets: Provides DataFrame and Dataset APIs, which are more type-safe and object-
oriented than RDDs, for working with structured data.
Hive Compatibility: Supports querying data stored in Apache Hive, leveraging Hive's metastore, SerDes, and
UDFs.
Core Components of Spark SQL:
DataFrame API: A DataFrame is a distributed collection of data organized into named columns. It provides a
domain-specific language for structured data manipulation.
Dataset API: Dataset, like DataFrame, is a distributed collection of data but adds the benefits of type-safety and
object-oriented programming, allowing compile-time type checking.
SQL Queries: Spark SQL allows executing raw SQL queries directly, making it possible to leverage SQL skills
directly in Spark.
Catalyst Optimizer: A sophisticated query optimizer that automatically optimizes the logical and physical
query plans for improved performance.
Tungsten Execution Engine: A backend execution engine focused on CPU and memory efficiency.
ML Packages in Spark
Apache Spark includes a powerful library for machine learning called MLlib. MLlib contains a variety of tools
designed to make it easy to build and deploy machine learning models at scale. The library is divided into two main
components: the original RDD-based API and the newer DataFrame-based API, which is part of the spark.ml
package. The DataFrame-based API is recommended for most users due to its simplicity, scalability, and ease of use.
Key Features of Spark MLlib:
Algorithms: Includes a wide array of algorithms for classification, regression, clustering, collaborative filtering,
and more.
Featurization: Tools for feature extraction, transformation, and selection.
Pipelines: High-level APIs for building machine learning workflows.
Evaluation Metrics: Tools for evaluating the performance of machine learning models.
Persistence
PySpark MLlib is organized into several key components, each designed to handle different aspects of machine
learning tasks. Here’s a detailed overview of the main packages and features available in PySpark MLlib:
Key Components of PySpark MLlib
Data Types:
Vectors: Used to represent features for ML algorithms.

DenseVector
SparseVector
Matrices: Used for mathematical computations.
DenseMatrix
SparseMatrix
Feature Extraction and Transformation:

pyspark.ml.feature: Contains tools for feature extraction, transformation, normalization, and selection.
Tokenizer, RegexTokenizer: Split text into words.
CountVectorizer, HashingTF: Convert text into feature vectors.
PCA, StandardScaler, MinMaxScaler: For dimensionality reduction and scaling.
StringIndexer, OneHotEncoder: Convert categorical data into numerical format.
VectorAssembler: Combine multiple columns into a single feature vector.
Pipelines
pyspark.ml.Pipeline: Allows chaining of multiple transformers and estimators into a single workflow.
Pipeline: Defines a sequence of stages (transformers and estimators).

PipelineModel: Represents a fitted pipeline.
Classification and Regression:
pyspark.ml.classification: Algorithms for classification tasks.
LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier (Gradient-Boosted Trees)
NaiveBayes
MultilayerPerceptronClassifier (Neural Networks)
pyspark.ml.regression: Algorithms for regression tasks.
LinearRegression
DecisionTreeRegressor
RandomForestRegressor
GBTRegressor
AFTSurvivalRegression
Clustering:
pyspark.ml.clustering: Algorithms for clustering tasks.
KMeans
GaussianMixture
BisectingKMeans
LDA (Latent Dirichlet Allocation)
Collaborative Filtering:
pyspark.ml.recommendation: For building recommendation systems.
ALS (Alternating Least Squares)
Frequent Pattern Mining:
pyspark.ml.fpm: Algorithms for frequent pattern mining.
FPGrowth
PrefixSpan
Evaluation Metrics:
pyspark.ml.evaluation: Tools for evaluating the performance of ML models.
BinaryClassificationEvaluator
MulticlassClassificationEvaluator
RegressionEvaluator
ClusteringEvaluator
Example
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Create Spark session
spark = SparkSession.builder.appName("MLlibExample").getOrCreate()
# Load and prepare data
data = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Feature engineering
indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
# Model
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=10)
# Pipeline
pipeline = Pipeline(stages=[indexer, assembler, rf])
# Train model
model = pipeline.fit(data)
# Make predictions
predictions = model.transform(data)
# Evaluate model
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction",

metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy}")
# Stop Spark session
spark.stop()

MOD5-BDA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MOD5-BDA

Uploaded by

Copyright:

Available Formats

Module - 5

Key Features of DataFrame:

Creating a DataFrame from a JSON file:

val spark = SparkSession.builder().appName("DataFrame Example").getOrCreate()

df.filter($"age" > 21).show()

Key Features of Dataset:

Creating a Dataset from a case class:

import spark.implicits._case class Person(name: String, age: Long)

val peopleDS = Seq(Person("John", 25), Person("Doe", 29)).toDS()

val adultsDS = peopleDS.filter(_.age > 21)

from pyspark.sql.types import *

API to access or create a data

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

spark = SparkSession.builder \ .appName("Spark Data Types Example") \ .getOrCreate()

# Define the schema using Spark data types

StructField("name", StringType(), True),

StructField("age", IntegerType(), True),

StructField("salary", FloatType(), True),

StructField("joining_date", DateType(), True),

StructField("skills", ArrayType(StringType()), True)

# Create a list of rows (data)

# Create a DataFrame using the schema and data

# Show the DataFrame schema

# Show the DataFrame data

# 1. Select name and age columns

# 2. Filter rows where age is greater than 28

df.filter(col("age") > 28).show()

# 3. Group by age and count

# Stop the SparkSession

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

|-- salary: float (nullable = true)

|-- joining_date: date (nullable = true)

|-- skills: array (nullable = true)

| |-- element: string (containsNull = true)

Logical and Physical Plan

Example of Logical and Physical Plans

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Logical and Physical Plan Example").getOrCreate()

Core Components of Spark SQL:

Key Features of Spark MLlib:

Key Components of PySpark MLlib

Vectors: Used to represent features for ML algorithms.

Feature Extraction and Transformation:

Pipeline: Defines a sequence of stages (transformers and estimators).

Classification and Regression:

pyspark.ml.classification: Algorithms for classification tasks.

pyspark.ml.regression: Algorithms for regression tasks.

pyspark.ml.clustering: Algorithms for clustering tasks.

pyspark.ml.recommendation: For building recommendation systems.

ALS (Alternating Least Squares)

Frequent Pattern Mining:

pyspark.ml.fpm: Algorithms for frequent pattern mining.

pyspark.ml.evaluation: Tools for evaluating the performance of ML models.

from pyspark.sql import SparkSession

from pyspark.ml import Pipeline

from pyspark.ml.feature import VectorAssembler, StringIndexer

from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create Spark session

# Load and prepare data

data = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")