Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Module - 5

Data analysis using Apache Spark programming: Introduction to Data frame and dataset, Spark Types, overview
of structured API execution: Logical planning, Physical planning, SparkSQL, Perform ETL using pyspark, ML
packages in spark, real world applications.

DataFrame

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a
table in a relational database. DataFrames provide a higher-level abstraction compared to RDDs (Resilient
Distributed Datasets) and are optimized for SQL-style operations.

Key Features of DataFrame:

Schema: DataFrames have a schema that defines the column names and data types.
API: DataFrames provide a domain-specific language for structured data manipulation (filtering, aggregations,
etc.) in languages like Java, Scala, Python, and R.
Optimization: DataFrames leverage the Catalyst query optimizer for efficient query execution.
Interoperability: They can interoperate with SQL queries using the Spark SQL module.

Example:

Creating a DataFrame from a JSON file:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("DataFrame Example").getOrCreate()

val df = spark.read.json("path/to/your/json/file")

df.show()

df.printSchema()

df.select("name").show()

df.filter($"age" > 21).show()

df.groupBy("age").count().show()

Dataset

A Dataset is a distributed collection of data that combines the benefits of RDDs with the optimizations of
DataFrames. Datasets provide type safety, meaning that errors can be caught at compile time, which is particularly
useful in large-scale data processing applications.

Key Features of Dataset:

Type Safety: Datasets are strongly typed, allowing the compiler to catch errors at compile time.
API: Datasets provide a high-level API similar to DataFrames, but with the added benefit of compile-time type
checking.
Interoperability: Datasets can interoperate with DataFrames and SQL queries.
Optimizations: Datasets leverage the Catalyst optimizer and Tungsten execution engine for efficient query
execution.

Example:

Creating a Dataset from a case class:

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder() .appName("Dataset Example").getOrCreate()

import spark.implicits._case class Person(name: String, age: Long)

val peopleDS = Seq(Person("John", 25), Person("Doe", 29)).toDS()

peopleDS.show()

val adultsDS = peopleDS.filter(_.age > 21)

adultsDS.show()

Spark Types

All data types of Spark SQL are located in the package of pyspark.sql.types. You can access them by doing

from pyspark.sql.types import *

API to access or create a data


Data type Value type in Python
type
int or long
Note: Numbers will be converted to 1-byte signed
ByteType ByteType()
integer numbers at runtime. Please make sure that
numbers are within the range of -128 to 127.
int or long
Note: Numbers will be converted to 2-byte signed
ShortType ShortType()
integer numbers at runtime. Please make sure that
numbers are within the range of -32768 to 32767.
IntegerType int or long IntegerType()
long
Note: Numbers will be converted to 8-byte signed
integer numbers at runtime. Please make sure that
LongType numbers are within the range of - LongType()
9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal
and use DecimalType.
float
FloatType Note: Numbers will be converted to 4-byte single- FloatType()
precision floating point numbers at runtime.
DoubleType float DoubleType()
DecimalType decimal.Decimal DecimalType()
StringType string StringType()
BinaryType bytearray BinaryType()
BooleanType bool BooleanType()
TimestampType datetime.datetime TimestampType()
TimestampNTZType datetime.datetime TimestampNTZType()
DateType datetime.date DateType()
DayTimeIntervalType datetime.timedelta DayTimeIntervalType()
ArrayType list, tuple, or array ArrayType(elementType,
[containsNull])
Note:The default value
of containsNull is True.
MapType(keyType, valueType,
[valueContainsNull])
MapType dict
Note:The default value
of valueContainsNull is True.
StructType(fields)
Note: fields is a Seq of
StructType list or tuple
StructFields. Also, two fields with
the same name are not allowed.
StructField(name, dataType,
The value type in Python of the data type of this field
[nullable])
StructField (For example, Int for a StructField with the data type
Note: The default value
IntegerType)
of nullable is True.

from pyspark.sql import SparkSession

from pyspark.sql.types import (StructType, StructField, StringType, IntegerType, FloatType, DateType, ArrayType)

from pyspark.sql.functions import col

# Initialize a SparkSession

spark = SparkSession.builder \ .appName("Spark Data Types Example") \ .getOrCreate()

# Define the schema using Spark data types

schema = StructType([

StructField("name", StringType(), True),

StructField("age", IntegerType(), True),

StructField("salary", FloatType(), True),

StructField("joining_date", DateType(), True),

StructField("skills", ArrayType(StringType()), True)

])

# Create a list of rows (data)

data = [ ("Alice", 30, 50000.0, "2015-04-23", ["Python", "Spark"]), ("Bob", 25, 45000.0, "2017-08-12", ["Java",
"Scala"]),

("Charlie", 35, 70000.0, "2012-05-19", ["SQL", "R"]), ("David", 29, 60000.0, "2018-01-01", ["Python", "Java"]) ]

# Create a DataFrame using the schema and data

df = spark.createDataFrame(data, schema)

# Show the DataFrame schema

df.printSchema()

# Show the DataFrame data

df.show()
# Perform some operations

# 1. Select name and age columns

df.select("name", "age").show()

# 2. Filter rows where age is greater than 28

df.filter(col("age") > 28).show()

# 3. Group by age and count

df.groupBy("age").count().show()

# Stop the SparkSession

spark.stop()

Output

The output will display the schema and data of the DataFrame, and the results of the select, filter, and group by
operations. The schema will look like this:

root

|-- name: string (nullable = true)

|-- age: integer (nullable = true)

|-- salary: float (nullable = true)

|-- joining_date: date (nullable = true)

|-- skills: array (nullable = true)

| |-- element: string (containsNull = true)

Logical and Physical Plan

In Apache Spark, the process of query execution involves several stages, including the creation of logical and
physical plans. These plans are part of the Catalyst optimizer, which optimizes and executes the queries efficiently.
Here’s a detailed explanation of logical and physical plans in Spark:

Logical Plan

The logical plan is an abstract representation of the structured query. It goes through several stages before becoming
an executable plan. The stages include parsing, analysis, and logical optimization.

1. Parsing

Description: The initial step where the high-level query (written in SQL, DataFrame, or Dataset API) is parsed
into an unresolved logical plan.
Unresolved Logical Plan: This plan captures the structure of the query but doesn't resolve references to
columns or tables.

2. Analysis
Description: The unresolved logical plan is analyzed to resolve all references to datasets, columns, and
functions.
Analyzer: Uses the catalog (metadata repository) to resolve these references.
Resolved Logical Plan: After analysis, the logical plan includes fully qualified identifiers.

3. Logical Optimization

Description: The resolved logical plan is optimized using various rule-based optimizations.
Catalyst Optimizer: Applies several rules to simplify and optimize the logical plan.
Optimized Logical Plan: The result is a logical plan that is efficient and ready for translation to a physical plan.

Physical Plan

The physical plan is a detailed plan that describes how the logical plan will be executed on the cluster. It includes the
specific operations and their order, as well as details about data distribution and execution strategies.

1. Physical Planning

Description: The optimized logical plan is translated into one or more physical execution plans.
Physical Plan(s): Each physical plan represents a potential way to execute the query, detailing how data is
distributed and how operations are executed on the data.

2. Cost-Based Optimization

Description: Among the generated physical plans, Spark uses cost-based optimization to select the most
efficient plan.
Cost-Based Optimizer: Evaluates the cost of each plan based on factors like data size, complexity of
operations, and available resources.
Selected Physical Plan: The plan with the lowest estimated cost is chosen for execution.

3. Execution

Description: The selected physical plan is executed across the Spark cluster.
Task Execution: The plan is divided into tasks that are distributed to worker nodes.
Shuffling and Execution: Data is shuffled as necessary (e.g., for joins or aggregations), and intermediate results
are combined to produce the final output.

Example of Logical and Physical Plans

Consider a simple example where we read a JSON file into a DataFrame, filter the data, and perform a group by
operation:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Logical and Physical Plan Example").getOrCreate()


# Read JSON file into DataFrame
df = spark.read.json("path/to/your/json/file")
# Perform transformations
filtered_df = df.filter(df["age"] > 21)
grouped_df = filtered_df.groupBy("age").count()
# Show the DataFrame schema
grouped_df.printSchema()
# Explain the logical and physical plans
grouped_df.explain()

Spark SQL

Spark SQL is a module in Apache Spark that allows for structured data processing using SQL queries. It integrates
relational processing with Spark’s functional programming API, enabling users to run SQL queries, both ad-hoc and
complex, on large-scale datasets.
Key Features of Spark SQL:

Unified Data Access: Allows access to data in various formats (like JSON, Parquet, Avro) and storage systems
(like HDFS, HBase, Cassandra).
Seamless Integration with Spark: Combines SQL querying capabilities with Spark’s powerful data processing
features.
Optimized Execution: Utilizes the Catalyst optimizer and Tungsten execution engine to optimize query plans
and execution for better performance.
DataFrames and Datasets: Provides DataFrame and Dataset APIs, which are more type-safe and object-
oriented than RDDs, for working with structured data.
Hive Compatibility: Supports querying data stored in Apache Hive, leveraging Hive's metastore, SerDes, and
UDFs.

Core Components of Spark SQL:

DataFrame API: A DataFrame is a distributed collection of data organized into named columns. It provides a
domain-specific language for structured data manipulation.
Dataset API: Dataset, like DataFrame, is a distributed collection of data but adds the benefits of type-safety and
object-oriented programming, allowing compile-time type checking.
SQL Queries: Spark SQL allows executing raw SQL queries directly, making it possible to leverage SQL skills
directly in Spark.
Catalyst Optimizer: A sophisticated query optimizer that automatically optimizes the logical and physical
query plans for improved performance.
Tungsten Execution Engine: A backend execution engine focused on CPU and memory efficiency.

ML Packages in Spark

Apache Spark includes a powerful library for machine learning called MLlib. MLlib contains a variety of tools
designed to make it easy to build and deploy machine learning models at scale. The library is divided into two main
components: the original RDD-based API and the newer DataFrame-based API, which is part of the spark.ml
package. The DataFrame-based API is recommended for most users due to its simplicity, scalability, and ease of use.

Key Features of Spark MLlib:

Algorithms: Includes a wide array of algorithms for classification, regression, clustering, collaborative filtering,
and more.
Featurization: Tools for feature extraction, transformation, and selection.
Pipelines: High-level APIs for building machine learning workflows.
Evaluation Metrics: Tools for evaluating the performance of machine learning models.
Persistence

PySpark MLlib is organized into several key components, each designed to handle different aspects of machine
learning tasks. Here’s a detailed overview of the main packages and features available in PySpark MLlib:

Key Components of PySpark MLlib

Data Types:

Vectors: Used to represent features for ML algorithms.


DenseVector
SparseVector
Matrices: Used for mathematical computations.

DenseMatrix
SparseMatrix

Feature Extraction and Transformation:


pyspark.ml.feature: Contains tools for feature extraction, transformation, normalization, and selection.
Tokenizer, RegexTokenizer: Split text into words.
CountVectorizer, HashingTF: Convert text into feature vectors.
PCA, StandardScaler, MinMaxScaler: For dimensionality reduction and scaling.
StringIndexer, OneHotEncoder: Convert categorical data into numerical format.
VectorAssembler: Combine multiple columns into a single feature vector.

Pipelines

pyspark.ml.Pipeline: Allows chaining of multiple transformers and estimators into a single workflow.

Pipeline: Defines a sequence of stages (transformers and estimators).


PipelineModel: Represents a fitted pipeline.

Classification and Regression:

pyspark.ml.classification: Algorithms for classification tasks.

LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier (Gradient-Boosted Trees)
NaiveBayes
MultilayerPerceptronClassifier (Neural Networks)

pyspark.ml.regression: Algorithms for regression tasks.

LinearRegression
DecisionTreeRegressor
RandomForestRegressor
GBTRegressor
AFTSurvivalRegression

Clustering:

pyspark.ml.clustering: Algorithms for clustering tasks.

KMeans
GaussianMixture
BisectingKMeans
LDA (Latent Dirichlet Allocation)

Collaborative Filtering:

pyspark.ml.recommendation: For building recommendation systems.

ALS (Alternating Least Squares)

Frequent Pattern Mining:

pyspark.ml.fpm: Algorithms for frequent pattern mining.

FPGrowth
PrefixSpan

Evaluation Metrics:

pyspark.ml.evaluation: Tools for evaluating the performance of ML models.

BinaryClassificationEvaluator
MulticlassClassificationEvaluator
RegressionEvaluator
ClusteringEvaluator

Example

from pyspark.sql import SparkSession

from pyspark.ml import Pipeline

from pyspark.ml.feature import VectorAssembler, StringIndexer

from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create Spark session

spark = SparkSession.builder.appName("MLlibExample").getOrCreate()

# Load and prepare data

data = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Feature engineering

indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")

assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")

# Model

rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=10)

# Pipeline

pipeline = Pipeline(stages=[indexer, assembler, rf])

# Train model

model = pipeline.fit(data)

# Make predictions

predictions = model.transform(data)

# Evaluate model

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction",


metricName="accuracy")

accuracy = evaluator.evaluate(predictions)

print(f"Test Accuracy: {accuracy}")

# Stop Spark session

spark.stop()

You might also like