Professional Documents
Culture Documents
MOD5-BDA
MOD5-BDA
Data analysis using Apache Spark programming: Introduction to Data frame and dataset, Spark Types, overview
of structured API execution: Logical planning, Physical planning, SparkSQL, Perform ETL using pyspark, ML
packages in spark, real world applications.
DataFrame
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a
table in a relational database. DataFrames provide a higher-level abstraction compared to RDDs (Resilient
Distributed Datasets) and are optimized for SQL-style operations.
Schema: DataFrames have a schema that defines the column names and data types.
API: DataFrames provide a domain-specific language for structured data manipulation (filtering, aggregations,
etc.) in languages like Java, Scala, Python, and R.
Optimization: DataFrames leverage the Catalyst query optimizer for efficient query execution.
Interoperability: They can interoperate with SQL queries using the Spark SQL module.
Example:
import org.apache.spark.sql.SparkSession
val df = spark.read.json("path/to/your/json/file")
df.show()
df.printSchema()
df.select("name").show()
df.groupBy("age").count().show()
Dataset
A Dataset is a distributed collection of data that combines the benefits of RDDs with the optimizations of
DataFrames. Datasets provide type safety, meaning that errors can be caught at compile time, which is particularly
useful in large-scale data processing applications.
Type Safety: Datasets are strongly typed, allowing the compiler to catch errors at compile time.
API: Datasets provide a high-level API similar to DataFrames, but with the added benefit of compile-time type
checking.
Interoperability: Datasets can interoperate with DataFrames and SQL queries.
Optimizations: Datasets leverage the Catalyst optimizer and Tungsten execution engine for efficient query
execution.
Example:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder() .appName("Dataset Example").getOrCreate()
peopleDS.show()
adultsDS.show()
Spark Types
All data types of Spark SQL are located in the package of pyspark.sql.types. You can access them by doing
from pyspark.sql.types import (StructType, StructField, StringType, IntegerType, FloatType, DateType, ArrayType)
# Initialize a SparkSession
schema = StructType([
])
data = [ ("Alice", 30, 50000.0, "2015-04-23", ["Python", "Spark"]), ("Bob", 25, 45000.0, "2017-08-12", ["Java",
"Scala"]),
("Charlie", 35, 70000.0, "2012-05-19", ["SQL", "R"]), ("David", 29, 60000.0, "2018-01-01", ["Python", "Java"]) ]
df = spark.createDataFrame(data, schema)
df.printSchema()
df.show()
# Perform some operations
df.select("name", "age").show()
df.groupBy("age").count().show()
spark.stop()
Output
The output will display the schema and data of the DataFrame, and the results of the select, filter, and group by
operations. The schema will look like this:
root
In Apache Spark, the process of query execution involves several stages, including the creation of logical and
physical plans. These plans are part of the Catalyst optimizer, which optimizes and executes the queries efficiently.
Here’s a detailed explanation of logical and physical plans in Spark:
Logical Plan
The logical plan is an abstract representation of the structured query. It goes through several stages before becoming
an executable plan. The stages include parsing, analysis, and logical optimization.
1. Parsing
Description: The initial step where the high-level query (written in SQL, DataFrame, or Dataset API) is parsed
into an unresolved logical plan.
Unresolved Logical Plan: This plan captures the structure of the query but doesn't resolve references to
columns or tables.
2. Analysis
Description: The unresolved logical plan is analyzed to resolve all references to datasets, columns, and
functions.
Analyzer: Uses the catalog (metadata repository) to resolve these references.
Resolved Logical Plan: After analysis, the logical plan includes fully qualified identifiers.
3. Logical Optimization
Description: The resolved logical plan is optimized using various rule-based optimizations.
Catalyst Optimizer: Applies several rules to simplify and optimize the logical plan.
Optimized Logical Plan: The result is a logical plan that is efficient and ready for translation to a physical plan.
Physical Plan
The physical plan is a detailed plan that describes how the logical plan will be executed on the cluster. It includes the
specific operations and their order, as well as details about data distribution and execution strategies.
1. Physical Planning
Description: The optimized logical plan is translated into one or more physical execution plans.
Physical Plan(s): Each physical plan represents a potential way to execute the query, detailing how data is
distributed and how operations are executed on the data.
2. Cost-Based Optimization
Description: Among the generated physical plans, Spark uses cost-based optimization to select the most
efficient plan.
Cost-Based Optimizer: Evaluates the cost of each plan based on factors like data size, complexity of
operations, and available resources.
Selected Physical Plan: The plan with the lowest estimated cost is chosen for execution.
3. Execution
Description: The selected physical plan is executed across the Spark cluster.
Task Execution: The plan is divided into tasks that are distributed to worker nodes.
Shuffling and Execution: Data is shuffled as necessary (e.g., for joins or aggregations), and intermediate results
are combined to produce the final output.
Consider a simple example where we read a JSON file into a DataFrame, filter the data, and perform a group by
operation:
Spark SQL
Spark SQL is a module in Apache Spark that allows for structured data processing using SQL queries. It integrates
relational processing with Spark’s functional programming API, enabling users to run SQL queries, both ad-hoc and
complex, on large-scale datasets.
Key Features of Spark SQL:
Unified Data Access: Allows access to data in various formats (like JSON, Parquet, Avro) and storage systems
(like HDFS, HBase, Cassandra).
Seamless Integration with Spark: Combines SQL querying capabilities with Spark’s powerful data processing
features.
Optimized Execution: Utilizes the Catalyst optimizer and Tungsten execution engine to optimize query plans
and execution for better performance.
DataFrames and Datasets: Provides DataFrame and Dataset APIs, which are more type-safe and object-
oriented than RDDs, for working with structured data.
Hive Compatibility: Supports querying data stored in Apache Hive, leveraging Hive's metastore, SerDes, and
UDFs.
DataFrame API: A DataFrame is a distributed collection of data organized into named columns. It provides a
domain-specific language for structured data manipulation.
Dataset API: Dataset, like DataFrame, is a distributed collection of data but adds the benefits of type-safety and
object-oriented programming, allowing compile-time type checking.
SQL Queries: Spark SQL allows executing raw SQL queries directly, making it possible to leverage SQL skills
directly in Spark.
Catalyst Optimizer: A sophisticated query optimizer that automatically optimizes the logical and physical
query plans for improved performance.
Tungsten Execution Engine: A backend execution engine focused on CPU and memory efficiency.
ML Packages in Spark
Apache Spark includes a powerful library for machine learning called MLlib. MLlib contains a variety of tools
designed to make it easy to build and deploy machine learning models at scale. The library is divided into two main
components: the original RDD-based API and the newer DataFrame-based API, which is part of the spark.ml
package. The DataFrame-based API is recommended for most users due to its simplicity, scalability, and ease of use.
Algorithms: Includes a wide array of algorithms for classification, regression, clustering, collaborative filtering,
and more.
Featurization: Tools for feature extraction, transformation, and selection.
Pipelines: High-level APIs for building machine learning workflows.
Evaluation Metrics: Tools for evaluating the performance of machine learning models.
Persistence
PySpark MLlib is organized into several key components, each designed to handle different aspects of machine
learning tasks. Here’s a detailed overview of the main packages and features available in PySpark MLlib:
Data Types:
DenseMatrix
SparseMatrix
Pipelines
pyspark.ml.Pipeline: Allows chaining of multiple transformers and estimators into a single workflow.
LogisticRegression
DecisionTreeClassifier
RandomForestClassifier
GBTClassifier (Gradient-Boosted Trees)
NaiveBayes
MultilayerPerceptronClassifier (Neural Networks)
LinearRegression
DecisionTreeRegressor
RandomForestRegressor
GBTRegressor
AFTSurvivalRegression
Clustering:
KMeans
GaussianMixture
BisectingKMeans
LDA (Latent Dirichlet Allocation)
Collaborative Filtering:
FPGrowth
PrefixSpan
Evaluation Metrics:
BinaryClassificationEvaluator
MulticlassClassificationEvaluator
RegressionEvaluator
ClusteringEvaluator
Example
spark = SparkSession.builder.appName("MLlibExample").getOrCreate()
# Feature engineering
# Model
# Pipeline
# Train model
model = pipeline.fit(data)
# Make predictions
predictions = model.transform(data)
# Evaluate model
accuracy = evaluator.evaluate(predictions)
spark.stop()