Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Getting Started with PySpark DataFrames

Working with Dates and Timestamps PySpark SQL Cheat Sheet: SQL Functions for

PySpark DataFrame Cheat Sheet


DataFrames
Initializing a SparkSession:
Months Between Sorting Data
# Step 1: Import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import months_between
from pyspark.sql.functions import desc
# Step 2: Create a SparkSession
spark = SparkSession.builder \ months_between_df = df.withColumn("MonthsBetween", months_between(current_date(),
# Using SQL-like functions for sorting
.appName("PySpark DataFrame Tutorial") \ # Set a name for your Spark application col("Date")))
sorted_data = df.orderBy("Age")
.config("spark.some.config.option", "config-value") \ # Add any additional configurations desc_sorted_data = df.sort(col("Age").desc())
if needed # Show the DataFrame with months between
.getOrCreate() print("DataFrame with Months Between:")
# Show the sorted data
months_between_df.show()
print("Sorted Data:")
# Step 3: Verify the SparkSession sorted_data.show()
print(spark.version) # Print the Spark version to ensure the SparkSession is created
successfully Date Addition and Subtraction print("Descending Sorted Data:")

Aggregating and Grouping Data Joins and Combining DataFrames


desc_sorted_data.show()
Selecting and Filtering Data Filtering Data Elements Data Manipulation and Transformation from pyspark.sql.functions import date_add, date_sub
Joining DataFrames
Loading data from CSV date_add_df = df.withColumn("DatePlus10Days", date_add(col("Date"), 10))
date_sub_df = df.withColumn("DateMinus5Days", date_sub(col("Date"), 5))
Output: Using Expressions to Transform Data
Original DataFrame: Aggregation using min() and max() Functions 1. Inner Join: # Sample data for another DataFrame
# Create a SparkSession # Show the DataFrames with date addition and subtraction data2 = [
spark = SparkSession.builder \ + + + + inner_join = orders.join(products, on="product", how="inner") print("DataFrame with Date Addition:") ("Alice", "New York"),
.appName("Loading Data from CSV") \ | ID | Name | Age | Filtered DataFrame (using where() method): # Using expressions to transform data # Using min() and max() functions to find minimum and maximum salary inner_join.show() ("Bob", "San Francisco"),
date_add_df.show()
.getOrCreate() + + + + df_with_id_name = df.withColumn("ID_Name", concat(col("ID"), lit("_"), col("Name"))) min_salary = df.select(min("Salary")).collect()[0][0] ("Eva", "Los Angeles")
| 1 | Alice | 28 |
| 2 | Bob | 22 | + + + + max_salary = df.select(max("Salary")).collect()[0][0] Output: print("DataFrame with Date Subtraction:") ]
# Load data from CSV file into a DataFrame | ID | Name | Age | # Show the DataFrame with the new "ID_Name" column + + + + + date_sub_df.show()
| 3 | Charlie | 35 | + + + +
csv_file_path = "path/to/your/csv/file.csv" + + + + print("DataFrame with 'ID_Name' Column:") print("Minimum Salary:", min_salary) | product | order | quantity | price | columns2 = ["Name", "City"]
df_csv = spark.read.csv(csv_file_path, header=True, inferSchema=True) | 1 | Alice | 28 | + + + + + df2 = spark.createDataFrame(data2, columns2)
df_with_id_name.show() print("Maximum Salary:", max_salary)
| 3 | Charlie | 35 | | apple | 101 | 3 | 1.5 |
| 4 | David | 30 | | banana | 102 | 2 | 0.75 |
# Show the first few rows of the DataFrame Selecting specific columns from the DataFrame: + + + + # Using SQL-like functions to join DataFrames
Output: + + + + +
df_csv.show() Output: joined_data = df.join(df2, on="Name", how="inner")

# Select specific columns from the DataFrame


2. Left Join: Advanced DataFrame Operations # Show the joined data
selected_columns = df.select("Name", "Age") DataFrame with 'ID_Name' Column: left_join = orders.join(products, on="product", how="left") print("Joined Data:")
Loading data from JSON Minimum Salary: 100
left_join.show() joined_data.show()
# Show the DataFrame with selected columns Data Manipulation and Transformation +
| ID
+
| Name
+
| Age
+
| ID_Name |
+ Maximum Salary: 200
Output:
Window Functions
print("DataFrame with Selected Columns:") + + + + +
# Load data from JSON file into a DataFrame + + + + +
json_file_path = "path/to/your/json/file.json"
selected_columns.show() | 1 | Alice | 28 | 1_Alice | Grouping Data Elements | product | order | quantity | price | Suppose you have a DataFrame with sales data and you want to calculate the rolling average
df_json = spark.read.json(json_file_path) | 2 | Bob | 22 | 2_Bob | + + + + + of sales for each product over a specific window size.
A sample DataFrame to demonstrate each operation: | 3 | Charlie | 35 | 3_Charlie | from pyspark.sql import SparkSession | apple | 101 | 3 | 1.5 |
+ + + + +
Output:
Performance Optimization Tips
# Show the first few rows of the DataFrame from pyspark.sql.functions import col, avg | banana | 102 | 2 | 0.75 |
df_json.show() | orange | 103 | 5 | null | from pyspark.sql import SparkSession
+ + + + +
+ + + + # Create a SparkSession from pyspark.sql.window import Window
DataFrame with Selected Columns: | ID | Name | Age | Using SQL-like Expressions spark = SparkSession.builder \ 3. Right Join: from pyspark.sql.functions import col, avg
+ + + + .appName("Grouping Data") \
+ + + | 1 | Alice | 28 | right_join = orders.join(products, on="product", how="right") Optimizing PySpark DataFrame operations is essential to achieve better performance and
Loading data from Parquet | Name | Age | | 2 | Bob | 22 | # Using SQL-like expressions to transform data
.getOrCreate()
right_join.show()
# Create a SparkSession
efficient data processing. Here are some tips and best practices to consider:
+ + + df_with_age_group = df.withColumn("Age_Group", expr("CASE WHEN Age <= 25 THEN spark = SparkSession.builder \
| 3 | Charlie | 35 |
| Alice | 28 | + + + + # Sample data for the DataFrame .appName("Window Functions") \
'Young' ELSE 'Adult' END")) Output:
# Load data from Parquet file into a DataFrame | Bob | 22 | data = [ .getOrCreate() 1. Use Lazy Evaluation: PySpark uses lazy evaluation, which means that transformations on
parquet_file_path = "path/to/your/parquet/file.parquet" | Charlie | 35 | ("Alice", "Department A", 100), + + + + +
# Show the DataFrame with the new "Age_Group" column | product | order | quantity | price | DataFrames are not executed immediately but are queued up. This allows PySpark to
df_parquet = spark.read.parquet(parquet_file_path) + + + ("Bob", "Department B", 200), # Sample data for the DataFrame
Adding new columns print("DataFrame with 'Age_Group' Column:")
("Charlie", "Department A", 150),
+ + +
|
+ +
data = [
optimize and optimize the execution plan before actually performing computations.
df_with_age_group.show() | apple | 101 3 | 1.5 |
# Show the first few rows of the DataFrame ("David", "Department C", 120), | banana | 102 | 2 | 0.75 | ("ProductA", "2022-01-01", 100), 2. Caching: Caching involves storing a DataFrame or RDD in memory so that it can be
df_parquet.show() ("Eva", "Department B", 180) | grape | null | null | 2.0 | ("ProductA", "2022-01-02", 150),
Additionally, you can also use the col() function from pyspark.sql.functions to select columns. # Adding a new column "City" with a default value reused efficiently across multiple operations. Use .cache() or .persist() to cache intermediate
] + + + + + ("ProductA", "2022-01-03", 200),
Here's how you can do it: df_with_new_column = df.withColumn("City", lit("New York")) DataFrames that are used in multiple transformations or actions.
4. Outer Join: ("ProductA", "2022-01-04", 120),
Output: # Define the DataFrame schema ("ProductA", "2022-01-05", 180),
# Show the DataFrame with the new column outer_join = orders.join(products, on="product", how="outer")
# Using col() function to select columns print("DataFrame with New Column:")
columns = ["Name", "Department", "Salary"] ("ProductB", "2022-01-01", 50), 3. Partitioning: Partitioning involves dividing your data into smaller subsets (partitions)
selected_columns_v2 = df.select(col("Name"), col("Age")) outer_join.show() ("ProductB", "2022-01-02", 70),
df_with_new_column.show() based on certain criteria. This can significantly improve query performance as it reduces the
DataFrame with 'Age_Group' Column: # Create the DataFrame Output: ("ProductB", "2022-01-03", 90), amount of data that needs to be processed. Use .repartition() or .coalesce() to manage
DataFrame Basics # Show the DataFrame with selected columns using col() function
print("DataFrame with Selected Columns (using col() function):")
+
| ID
+
| Name
+
| Age
+
| |
| Age_Group
+
df = spark.createDataFrame(data, columns)
+
| product |
+
order
+ + +
| quantity | price |
]
partitions.
selected_columns_v2.show()
Output: # Show the DataFrame # Define the DataFrame schema .
+ + + + + + + + + + 4. Broadcasting: Broadcasting is a technique where smaller DataFrames are distributed to
| 1 | Alice | 28 | Adult | print("DataFrame:") | apple | 101 | 3 | 1.5 | columns = ["Product", "Date", "Sales"]
df.show() | banana | 102 | 2 | 0.75 | worker nodes and cached in memory for join operations. This is particularly useful when you
In PySpark, a DataFrame is a distributed collection of data organized into named columns.For | 2 | Bob | 22 | Young |
DataFrame with New Column: have a small DataFrame that needs to be joined with a larger one.
example, consider the following simple DataFrame representing sales data with distinct rows Output: | 3 | Charlie | 35 | Adult | | orange | 103 | 5 | null | # Create the DataFrame
and columns: + + + + + + + + + + | grape | null | null | 2.0 | df = spark.createDataFrame(data, columns) .
| ID | Name | Age | City |
+ + + + + 5. Avoid Using .collect(): Using .collect() brings all the data to the driver node, which can
DataFrame with Selected Columns (using col() function): Output:
+ + + + + # Define the window specification lead to memory issues. Instead, try to perform operations using distributed computations.
+ + + + + | 1 | Alice | 28 | New York | window_spec = Window.partitionBy("Product").orderBy("Date").rowsBetween(-1, 1)
| OrderID | Item | Quantity | UnitPrice | + + + .
| 2 | Bob | 22 | New York | DataFrame:
+ + + + + | Name | Age | | 3 | Charlie | 35 | New York | 6. Use Built-in Functions: Whenever possible, use built-in functions from the pyspark.sql.
| 101 | Apple |5 | 1.50 | + + + + + + + # Calculate rolling average using window function
+ + + + + functions module. These functions are optimized for distributed processing and provide
| 102
| 103
| Banana | 3
| Orange | 2
| 0.75
| 1.00
|
|
| Alice
| Bob
| 28
| 22
|
|
Aggregating and Grouping Data | Name
+
| Department | Salary |
+ + + Handling Missing Data
df_with_rolling_avg = df.withColumn("RollingAvg", avg(col("Sales")).over(window_spec))
better performance than custom Python functions.
+ + + + + | Charlie | 35 | Renaming Columns | Alice | Department A | 100 | # Show the DataFrame with rolling average .
+ + + | Bob | Department B | 200 | df_with_rolling_avg.show() 7. Avoid Shuffling: Shuffling is an expensive operation that involves data movement
# Renaming the column "Name" to "Full_Name" | Charlie | Department A | 150 | Identifying and handling missing data is a crucial step in data preprocessing and analysis. between partitions. Minimize shuffling by using appropriate partitioning and avoiding
| David | Department C | 120 | operations that require data to be rearranged.
df_with_renamed_column = df.withColumnRenamed("Name", "Full_Name") PySpark provides several techniques to identify and handle missing data in DataFrames.
from pyspark.sql import SparkSession | Eva | Department B | 180 |
+ + + + Let's explore these techniques: .
# Show the DataFrame with the renamed column
from pyspark.sql.functions import col, sum, avg, count, min, max Pivot Tables 8. Optimize Joins: Joins can be performance-intensive. Try to avoid shuffling during joins by
Creating DataFrames print("DataFrame with Renamed Column:")
# Create a SparkSession Identifying Missing Data
ensuring both DataFrames are appropriately partitioned and using the appropriate join
Filtering Data Elements df_with_renamed_column.show() Suppose you have a DataFrame with sales data and you want to create a pivot table showing
spark = SparkSession.builder \ # Grouping data based on "Department" and calculating average salary strategy (broadcast, sortMerge, etc.).
total sales for each product in different months.
.appName("Aggregation Functions") \ grouped_data = df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")) PySpark represents missing data as null values. You can use various methods to identify missing .
PySpark's versatility extends beyond just working with structured data; it also offers the flexibility .getOrCreate() data in DataFrames: from pyspark.sql import SparkSession 9. Use explain(): Use the explain() method on DataFrames to understand the execution plan
to create DataFrames from Resilient Distributed Datasets (RDDs) and various external data sources. Output: # Show the grouped data ▪ isNull() and isNotNull() from pyspark.sql.functions import col and identify potential optimization opportunities.
Let's explore how you can harness these capabilities. # Sample data for the DataFrame
Creating a sample DataFrame to demonstrate each method: print("Grouped Data:") ▪ dropna() from pyspark.sql.pivot import PivotTable .
data = [ grouped_data.show() ▪ fillna()
Creating DataFrames from RDDs DataFrame with Renamed Column:
("Alice", 100), 10. Hardware Considerations: Consider the cluster configuration, hardware resources,
# Create a SparkSession
+ + + + and the amount of data being processed. Properly allocating resources and scaling up the
# Sample data for the DataFrame | ID | Full_Name | Age |
("Bob", 200), Handling Missing Data spark = SparkSession.builder \
("Charlie", 150), Output: .appName("Pivot Tables") \ cluster can significantly impact performance..
# Sample RDD data = [ + + + + ("David", 120), .getOrCreate() .
rdd = spark.sparkContext.parllelize( [ ( 1, “Alice” ), ( 2, “Bob” ), ( 3, “Charlie” ) ] ) (1, "Alice", 28), | 1 | Alice | 28 | • Dropping Rows with Null Values
| 2 | Bob | 22 |
("Eva", 180) 11. Monitor Resource Usage: Keep an eye on resource usage, including CPU, memory, and
(2, "Bob", 22), ] DataFrame:
| 3 | Charlie | 35 | # Sample data for the DataFrame disk I/O. Monitoring can help identify performance bottlenecks and resource constraints..
# Convert RDD to DataFrame (3, "Charlie", 35),
+ + + + df _without _null = df.dropna( ) data = [ .
df_from_rdd = rdd.toDF( [ “ID”, “Name” ] ) (4, "David", 30), # Define the DataFrame schema
+ + +
| Department |AvgSalary| ("ProductA", "2022-01-01", 100), 12. Use Parquet Format: Parquet is a columnar storage format that is highly efficient for
(5, "Eva", 25) columns = ["Name", "Salary"] + + + ("ProductA", "2022-02-01", 150),
# Show the DataFrame ] • Filling Null Values both reading and writing. Consider using Parquet for storage as it can improve read and write
df_from_rdd.show ( )
Dropping Columns | Department B | 190.0 | ("ProductA", "2022-01-01", 200),
# Create the DataFrame | Department C | 120.0 | ("ProductB", "2022-02-01", 120), performance.
# Define the DataFrame schema df = spark.createDataFrame(data, columns) .
| Department A | 125.0 | df _filled_mean = df.fillna( { ‘age’: df.select( avg( ‘age’ ) ).first( ) [ 0 ] } ) ("ProductB", "2022-02-01", 180),
schema = StructType([ # Dropping the column "Age" + + +
Working with Various Data Sources StructField("ID", IntegerType(), True), df_dropped_column = df.drop("Age")
]
# Show the DataFrame
StructField("Name", StringType(), True), print("DataFrame:") • Imputation # Define the DataFrame schema
StructField("Age", IntegerType(), True) # Show the DataFrame with the dropped column df.show()
# Creating DataFrame from CSV columns = ["Product", "Date", "Sales"]
]) print("DataFrame with Dropped Column:")
csv_path = “path/to/data.csv” from pyspark.ml.feature import Imputer
df_dropped_column.show()
df_from_csv = spark.read.csv ( csv_path, header=True, inferSchema-True ) # Create the DataFrame
# Create the DataFrame
df = spark.createDataFrame(data, schema) Output: Joins and Combining DataFrames imputer = Imputer( inputCols=[ ‘age’ ], outputCols=['imputed_age'])
imputed_df = imputer.fit(df).transform(df)
df = spark.createDataFrame(data, columns)
# Creating DataFrame from JSON Output:
json_path = “path/to/data.json” # Create a pivot table
df_from_json = spark.read.json ( json_path ) • Adding an Indicator Column pivot_table = df.groupBy("Product").pivot("Date").sum("Sales")
Using filter() method DataFrame with Dropped Column: DataFrame:
A sample example to illustrate the different join types:
# Creating DataFrame from Parquet + + + + + + # Show the pivot table
parquet_path = “path/to/data.parquet” | ID | Name | | Name | Salary | df_with_indicator = df.withColumn('age_missing', df['age'].isNull()) pivot_table.show()
df_from_parquet = spark.read.parquet ( parquet_path ) # Method 1: Using filter() method + + + + + + DataFrame 1:
filtered_df = df.filter(df.Age > 25) | 1 | Alice | | Alice | 100 |
| 2 | Bob | | Bob | 200 | + + +
# Show the filtered DataFrame | 3 | Charlie | | Charlie | 150 | | ID | Name | • Handling Categorical Missing Data
Basic DataFrame Operations print("Filtered DataFrame (using filter()):") + + + | David | 120 | + + +
| Eva | 180 | | 1 | Alice |
filtered_df.show() df_filled_category = df.fillna({'gender': 'unknown'})
+ + + | 2 | Bob |
from pyspark.sql import SparkSession
from pyspark.sql.functions import col Let's use a sample DataFrame to demonstrate some common data transformation examples: | 3
+
| Charlie |
+ +
PySpark SQL Cheat Sheet: SQL Functions for
# Create a SparkSession Output:
Aggregation using sum() Function
DataFrame 1:
DataFrames
spark = SparkSession.builder \ from pyspark.sql import SparkSession
.appName("DataFrame Operations Example") \ from pyspark.sql.functions import col, lit, concat, expr
# Using sum() function to calculate total salary + + +
.getOrCreate() Filtered DataFrame (using filter()): # Create a SparkSession
total_salary = df.select(sum("Salary")).collect()[0][0] | ID
+
|
+
Role |
+ Working with Dates and Timestamps Filtering Data
# Sample data for the DataFrame spark = SparkSession.builder \ | 2 | Manager |
+ + + + print("Total Salary:", total_salary)
data = [ .appName("Data Transformation") \ | 3 | Employee |
| ID | Name | Age |
.getOrCreate() | 4 | Intern | from pyspark.sql import SparkSession
(1, "Alice", 28), + + + + + + +
| 1 | Alice | 28 | Sample DataFrame with date and timestamp data as mentioned below. from pyspark.sql.functions import col
(2, "Bob", 22),
(3, "Charlie", 35) | 3 | Charlie | 35 | # Sample data for the DataFrame Output:
] | 4 | David | 30 | data = [ # Create a SparkSession
+ + + + (1, "Alice", 28), Now, let's demonstrate different join types: spark = SparkSession.builder \
(2, "Bob", 22), DataFrame: .appName("PySpark SQL Functions") \
# Define the DataFrame schema Total Salary: 750
(3, "Charlie", 35) 1. Inner Join: .getOrCreate()
columns = ["ID", "Name", "Age"] + + + +
Using expr() Function ] | Name | Date | Timestamp |
# Create the DataFrame inner_join = df1.join(df2, on="ID", how="inner") + + + + # Sample data for the DataFrame
df = spark.createDataFrame(data, columns) from pyspark.sql.functions import expr
# Define the DataFrame schema Aggregation using avg() Function inner_join.show() | Alice | 2022-01-15 | 2022-01-15 08:30:00 | data = [
columns = ["ID", "Name", "Age"] | Bob | 2021-12-20 | 2021-12-20 15:45:00 | ("Alice", 28),
2. Outer Join: | Charlie | 2022-02-28 | 2022-02-28 11:00:00 | ("Bob", 22),
# Method 3: Using expr() function + + + +
# Create the DataFrame # Using avg() function to calculate average salary ("Charlie", 35)
expr_filtered_df = df.filter(expr("Age > 25")) outer_join = df1.join(df2, on="ID", how="outer") ]
Display DataFrame contents df = spark.createDataFrame(data, columns) average_salary = df.select(avg("Salary")).collect()[0][0]
outer_join.show()
# Show the filtered DataFrame
print("Average Salary:", average_salary) Current Date and Timestamp # Define the DataFrame schema
Input: print("Filtered DataFrame (using expr() function):") 3. Left Join: columns = ["Name", "Age"]
expr_filtered_df.show()
Using Functions to Transform Data left_join = df1.join(df2, on="ID", how="left") from pyspark.sql.functions import current_date, current_timestamp
# Display the first 20 rows of the DataFrame # Create the DataFrame
Output: left_join.show()
df.show() df = spark.createDataFrame(data, columns)
Output: # Using functions to transform data # Adding columns for current date and timestamp
df_with_status = df.withColumn("Status", when(col("Age") > 25, "Adult").otherwise("Young")) 4. Right Join: df_with_current_date = df.withColumn("CurrentDate", current_date())
# Using SQL-like functions to filter data
Output: right_join = df1.join(df2, on="ID", how="right") df_with_current_timestamp = df.withColumn("CurrentTimestamp", current_timestamp())
Average Salary: 150.0 filtered_data = df.filter(col("Age") > 25)
Filtered DataFrame (using expr() function): # Show the DataFrame with the new "Status" column right_join.show() selected_columns = df.select("Name", "Age")
+ + + + print("DataFrame with 'Status' Column:") # Show the DataFrames
| ID | Name | Age | + + + + df_with_status.show() print("DataFrame with Current Date:")
# Show the filtered and selected data
+
| 1
+
| Alice
+
| 28
+
|
| ID
+
| Name
+
| Age
+
|
+
Aggregation using count() Function Let's now use a few examples to illustrate how to combine DataFrames using different join types:
df_with_current_date.show()
print("Filtered Data:")
| 2 | Bob | 22 | | 1 | Alice | 28 | Output: filtered_data.show()
Consider two sample DataFrames: print("DataFrame with Current Timestamp:")
| 3 | Charlie | 35 | | 3 | Charlie | 35 | # Using count() function to count the number of employees
df_with_current_timestamp.show()
+ + + + | 4 | David | 30 | DataFrame with Dropped Column: employee_count = df.select(count("Name")).collect()[0][0] print("Selected Columns:")
+ + + + a. DataFrame orders: selected_columns.show()
+ + + + + print("Number of Employees:", employee_count)
Check DataFrame schema | ID | Name | Age | Status | + + + + Date Difference
Using where() Method + + + + + | order | product | quantity |
Input: | 1 | Alice | 28 | Adult | + + + + Aggregation
Output: from pyspark.sql.functions import datediff
| 2 | Bob | 22 | Young | | 101 | apple | 3 |
# Method 4: Using where() method | 3 | Charlie | 35 | Adult | | 102 | banana | 2 |
# Check the DataFrame schema + + + + + date_diff_df = df.withColumn("DaysSince", datediff(current_date(), col("Date"))) from pyspark.sql.functions import avg, max
where_filtered_df = df.where(df.Age > 25) | 103 | orange | 4 |
df.printSchema() + + + +
Number of Employees: 5 # Show the DataFrame with date difference # Using SQL-like functions for aggregation
# Show the filtered DataFrame
b. DataFrame products: print("DataFrame with Date Difference:") grouped_data = df.groupBy("Age").agg(avg("Age"), max("Age"))
Output: print("Filtered DataFrame (using where() method):")
date_diff_df.show()
where_filtered_df.show() + + + # Show the aggregated data
root | product| price |
+ + + print("Aggregated Data:")
|-- ID: long (nullable = true) grouped_data.show()
| apple | 1.5 |
|-- Name: string (nullable = true) | banana | 0.75 |
|-- Age: long (nullable = true) | grape | |
2.0
+ + +

You might also like