PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.
PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.
PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.
Working with Dates and Timestamps PySpark SQL Cheat Sheet: SQL Functions for
PySpark DataFrame Cheat Sheet
DataFrames Initializing a SparkSession: Months Between Sorting Data # Step 1: Import the required modules from pyspark.sql import SparkSession from pyspark.sql.functions import months_between from pyspark.sql.functions import desc # Step 2: Create a SparkSession spark = SparkSession.builder \ months_between_df = df.withColumn("MonthsBetween", months_between(current_date(), # Using SQL-like functions for sorting .appName("PySpark DataFrame Tutorial") \ # Set a name for your Spark application col("Date"))) sorted_data = df.orderBy("Age") .config("spark.some.config.option", "config-value") \ # Add any additional configurations desc_sorted_data = df.sort(col("Age").desc()) if needed # Show the DataFrame with months between .getOrCreate() print("DataFrame with Months Between:") # Show the sorted data months_between_df.show() print("Sorted Data:") # Step 3: Verify the SparkSession sorted_data.show() print(spark.version) # Print the Spark version to ensure the SparkSession is created successfully Date Addition and Subtraction print("Descending Sorted Data:")
Aggregating and Grouping Data Joins and Combining DataFrames
desc_sorted_data.show() Selecting and Filtering Data Filtering Data Elements Data Manipulation and Transformation from pyspark.sql.functions import date_add, date_sub Joining DataFrames Loading data from CSV date_add_df = df.withColumn("DatePlus10Days", date_add(col("Date"), 10)) date_sub_df = df.withColumn("DateMinus5Days", date_sub(col("Date"), 5)) Output: Using Expressions to Transform Data Original DataFrame: Aggregation using min() and max() Functions 1. Inner Join: # Sample data for another DataFrame # Create a SparkSession # Show the DataFrames with date addition and subtraction data2 = [ spark = SparkSession.builder \ + + + + inner_join = orders.join(products, on="product", how="inner") print("DataFrame with Date Addition:") ("Alice", "New York"), .appName("Loading Data from CSV") \ | ID | Name | Age | Filtered DataFrame (using where() method): # Using expressions to transform data # Using min() and max() functions to find minimum and maximum salary inner_join.show() ("Bob", "San Francisco"), date_add_df.show() .getOrCreate() + + + + df_with_id_name = df.withColumn("ID_Name", concat(col("ID"), lit("_"), col("Name"))) min_salary = df.select(min("Salary")).collect()[0][0] ("Eva", "Los Angeles") | 1 | Alice | 28 | | 2 | Bob | 22 | + + + + max_salary = df.select(max("Salary")).collect()[0][0] Output: print("DataFrame with Date Subtraction:") ] # Load data from CSV file into a DataFrame | ID | Name | Age | # Show the DataFrame with the new "ID_Name" column + + + + + date_sub_df.show() | 3 | Charlie | 35 | + + + + csv_file_path = "path/to/your/csv/file.csv" + + + + print("DataFrame with 'ID_Name' Column:") print("Minimum Salary:", min_salary) | product | order | quantity | price | columns2 = ["Name", "City"] df_csv = spark.read.csv(csv_file_path, header=True, inferSchema=True) | 1 | Alice | 28 | + + + + + df2 = spark.createDataFrame(data2, columns2) df_with_id_name.show() print("Maximum Salary:", max_salary) | 3 | Charlie | 35 | | apple | 101 | 3 | 1.5 | | 4 | David | 30 | | banana | 102 | 2 | 0.75 | # Show the first few rows of the DataFrame Selecting specific columns from the DataFrame: + + + + # Using SQL-like functions to join DataFrames Output: + + + + + df_csv.show() Output: joined_data = df.join(df2, on="Name", how="inner")
# Select specific columns from the DataFrame
2. Left Join: Advanced DataFrame Operations # Show the joined data selected_columns = df.select("Name", "Age") DataFrame with 'ID_Name' Column: left_join = orders.join(products, on="product", how="left") print("Joined Data:") Loading data from JSON Minimum Salary: 100 left_join.show() joined_data.show() # Show the DataFrame with selected columns Data Manipulation and Transformation + | ID + | Name + | Age + | ID_Name | + Maximum Salary: 200 Output: Window Functions print("DataFrame with Selected Columns:") + + + + + # Load data from JSON file into a DataFrame + + + + + json_file_path = "path/to/your/json/file.json" selected_columns.show() | 1 | Alice | 28 | 1_Alice | Grouping Data Elements | product | order | quantity | price | Suppose you have a DataFrame with sales data and you want to calculate the rolling average df_json = spark.read.json(json_file_path) | 2 | Bob | 22 | 2_Bob | + + + + + of sales for each product over a specific window size. A sample DataFrame to demonstrate each operation: | 3 | Charlie | 35 | 3_Charlie | from pyspark.sql import SparkSession | apple | 101 | 3 | 1.5 | + + + + + Output: Performance Optimization Tips # Show the first few rows of the DataFrame from pyspark.sql.functions import col, avg | banana | 102 | 2 | 0.75 | df_json.show() | orange | 103 | 5 | null | from pyspark.sql import SparkSession + + + + + + + + + # Create a SparkSession from pyspark.sql.window import Window DataFrame with Selected Columns: | ID | Name | Age | Using SQL-like Expressions spark = SparkSession.builder \ 3. Right Join: from pyspark.sql.functions import col, avg + + + + .appName("Grouping Data") \ + + + | 1 | Alice | 28 | right_join = orders.join(products, on="product", how="right") Optimizing PySpark DataFrame operations is essential to achieve better performance and Loading data from Parquet | Name | Age | | 2 | Bob | 22 | # Using SQL-like expressions to transform data .getOrCreate() right_join.show() # Create a SparkSession efficient data processing. Here are some tips and best practices to consider: + + + df_with_age_group = df.withColumn("Age_Group", expr("CASE WHEN Age <= 25 THEN spark = SparkSession.builder \ | 3 | Charlie | 35 | | Alice | 28 | + + + + # Sample data for the DataFrame .appName("Window Functions") \ 'Young' ELSE 'Adult' END")) Output: # Load data from Parquet file into a DataFrame | Bob | 22 | data = [ .getOrCreate() 1. Use Lazy Evaluation: PySpark uses lazy evaluation, which means that transformations on parquet_file_path = "path/to/your/parquet/file.parquet" | Charlie | 35 | ("Alice", "Department A", 100), + + + + + # Show the DataFrame with the new "Age_Group" column | product | order | quantity | price | DataFrames are not executed immediately but are queued up. This allows PySpark to df_parquet = spark.read.parquet(parquet_file_path) + + + ("Bob", "Department B", 200), # Sample data for the DataFrame Adding new columns print("DataFrame with 'Age_Group' Column:") ("Charlie", "Department A", 150), + + + | + + data = [ optimize and optimize the execution plan before actually performing computations. df_with_age_group.show() | apple | 101 3 | 1.5 | # Show the first few rows of the DataFrame ("David", "Department C", 120), | banana | 102 | 2 | 0.75 | ("ProductA", "2022-01-01", 100), 2. Caching: Caching involves storing a DataFrame or RDD in memory so that it can be df_parquet.show() ("Eva", "Department B", 180) | grape | null | null | 2.0 | ("ProductA", "2022-01-02", 150), Additionally, you can also use the col() function from pyspark.sql.functions to select columns. # Adding a new column "City" with a default value reused efficiently across multiple operations. Use .cache() or .persist() to cache intermediate ] + + + + + ("ProductA", "2022-01-03", 200), Here's how you can do it: df_with_new_column = df.withColumn("City", lit("New York")) DataFrames that are used in multiple transformations or actions. 4. Outer Join: ("ProductA", "2022-01-04", 120), Output: # Define the DataFrame schema ("ProductA", "2022-01-05", 180), # Show the DataFrame with the new column outer_join = orders.join(products, on="product", how="outer") # Using col() function to select columns print("DataFrame with New Column:") columns = ["Name", "Department", "Salary"] ("ProductB", "2022-01-01", 50), 3. Partitioning: Partitioning involves dividing your data into smaller subsets (partitions) selected_columns_v2 = df.select(col("Name"), col("Age")) outer_join.show() ("ProductB", "2022-01-02", 70), df_with_new_column.show() based on certain criteria. This can significantly improve query performance as it reduces the DataFrame with 'Age_Group' Column: # Create the DataFrame Output: ("ProductB", "2022-01-03", 90), amount of data that needs to be processed. Use .repartition() or .coalesce() to manage DataFrame Basics # Show the DataFrame with selected columns using col() function print("DataFrame with Selected Columns (using col() function):") + | ID + | Name + | Age + | | | Age_Group + df = spark.createDataFrame(data, columns) + | product | + order + + + | quantity | price | ] partitions. selected_columns_v2.show() Output: # Show the DataFrame # Define the DataFrame schema . + + + + + + + + + + 4. Broadcasting: Broadcasting is a technique where smaller DataFrames are distributed to | 1 | Alice | 28 | Adult | print("DataFrame:") | apple | 101 | 3 | 1.5 | columns = ["Product", "Date", "Sales"] df.show() | banana | 102 | 2 | 0.75 | worker nodes and cached in memory for join operations. This is particularly useful when you In PySpark, a DataFrame is a distributed collection of data organized into named columns.For | 2 | Bob | 22 | Young | DataFrame with New Column: have a small DataFrame that needs to be joined with a larger one. example, consider the following simple DataFrame representing sales data with distinct rows Output: | 3 | Charlie | 35 | Adult | | orange | 103 | 5 | null | # Create the DataFrame and columns: + + + + + + + + + + | grape | null | null | 2.0 | df = spark.createDataFrame(data, columns) . | ID | Name | Age | City | + + + + + 5. Avoid Using .collect(): Using .collect() brings all the data to the driver node, which can DataFrame with Selected Columns (using col() function): Output: + + + + + # Define the window specification lead to memory issues. Instead, try to perform operations using distributed computations. + + + + + | 1 | Alice | 28 | New York | window_spec = Window.partitionBy("Product").orderBy("Date").rowsBetween(-1, 1) | OrderID | Item | Quantity | UnitPrice | + + + . | 2 | Bob | 22 | New York | DataFrame: + + + + + | Name | Age | | 3 | Charlie | 35 | New York | 6. Use Built-in Functions: Whenever possible, use built-in functions from the pyspark.sql. | 101 | Apple |5 | 1.50 | + + + + + + + # Calculate rolling average using window function + + + + + functions module. These functions are optimized for distributed processing and provide | 102 | 103 | Banana | 3 | Orange | 2 | 0.75 | 1.00 | | | Alice | Bob | 28 | 22 | | Aggregating and Grouping Data | Name + | Department | Salary | + + + Handling Missing Data df_with_rolling_avg = df.withColumn("RollingAvg", avg(col("Sales")).over(window_spec)) better performance than custom Python functions. + + + + + | Charlie | 35 | Renaming Columns | Alice | Department A | 100 | # Show the DataFrame with rolling average . + + + | Bob | Department B | 200 | df_with_rolling_avg.show() 7. Avoid Shuffling: Shuffling is an expensive operation that involves data movement # Renaming the column "Name" to "Full_Name" | Charlie | Department A | 150 | Identifying and handling missing data is a crucial step in data preprocessing and analysis. between partitions. Minimize shuffling by using appropriate partitioning and avoiding | David | Department C | 120 | operations that require data to be rearranged. df_with_renamed_column = df.withColumnRenamed("Name", "Full_Name") PySpark provides several techniques to identify and handle missing data in DataFrames. from pyspark.sql import SparkSession | Eva | Department B | 180 | + + + + Let's explore these techniques: . # Show the DataFrame with the renamed column from pyspark.sql.functions import col, sum, avg, count, min, max Pivot Tables 8. Optimize Joins: Joins can be performance-intensive. Try to avoid shuffling during joins by Creating DataFrames print("DataFrame with Renamed Column:") # Create a SparkSession Identifying Missing Data ensuring both DataFrames are appropriately partitioned and using the appropriate join Filtering Data Elements df_with_renamed_column.show() Suppose you have a DataFrame with sales data and you want to create a pivot table showing spark = SparkSession.builder \ # Grouping data based on "Department" and calculating average salary strategy (broadcast, sortMerge, etc.). total sales for each product in different months. .appName("Aggregation Functions") \ grouped_data = df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")) PySpark represents missing data as null values. You can use various methods to identify missing . PySpark's versatility extends beyond just working with structured data; it also offers the flexibility .getOrCreate() data in DataFrames: from pyspark.sql import SparkSession 9. Use explain(): Use the explain() method on DataFrames to understand the execution plan to create DataFrames from Resilient Distributed Datasets (RDDs) and various external data sources. Output: # Show the grouped data ▪ isNull() and isNotNull() from pyspark.sql.functions import col and identify potential optimization opportunities. Let's explore how you can harness these capabilities. # Sample data for the DataFrame Creating a sample DataFrame to demonstrate each method: print("Grouped Data:") ▪ dropna() from pyspark.sql.pivot import PivotTable . data = [ grouped_data.show() ▪ fillna() Creating DataFrames from RDDs DataFrame with Renamed Column: ("Alice", 100), 10. Hardware Considerations: Consider the cluster configuration, hardware resources, # Create a SparkSession + + + + and the amount of data being processed. Properly allocating resources and scaling up the # Sample data for the DataFrame | ID | Full_Name | Age | ("Bob", 200), Handling Missing Data spark = SparkSession.builder \ ("Charlie", 150), Output: .appName("Pivot Tables") \ cluster can significantly impact performance.. # Sample RDD data = [ + + + + ("David", 120), .getOrCreate() . rdd = spark.sparkContext.parllelize( [ ( 1, “Alice” ), ( 2, “Bob” ), ( 3, “Charlie” ) ] ) (1, "Alice", 28), | 1 | Alice | 28 | • Dropping Rows with Null Values | 2 | Bob | 22 | ("Eva", 180) 11. Monitor Resource Usage: Keep an eye on resource usage, including CPU, memory, and (2, "Bob", 22), ] DataFrame: | 3 | Charlie | 35 | # Sample data for the DataFrame disk I/O. Monitoring can help identify performance bottlenecks and resource constraints.. # Convert RDD to DataFrame (3, "Charlie", 35), + + + + df _without _null = df.dropna( ) data = [ . df_from_rdd = rdd.toDF( [ “ID”, “Name” ] ) (4, "David", 30), # Define the DataFrame schema + + + | Department |AvgSalary| ("ProductA", "2022-01-01", 100), 12. Use Parquet Format: Parquet is a columnar storage format that is highly efficient for (5, "Eva", 25) columns = ["Name", "Salary"] + + + ("ProductA", "2022-02-01", 150), # Show the DataFrame ] • Filling Null Values both reading and writing. Consider using Parquet for storage as it can improve read and write df_from_rdd.show ( ) Dropping Columns | Department B | 190.0 | ("ProductA", "2022-01-01", 200), # Create the DataFrame | Department C | 120.0 | ("ProductB", "2022-02-01", 120), performance. # Define the DataFrame schema df = spark.createDataFrame(data, columns) . | Department A | 125.0 | df _filled_mean = df.fillna( { ‘age’: df.select( avg( ‘age’ ) ).first( ) [ 0 ] } ) ("ProductB", "2022-02-01", 180), schema = StructType([ # Dropping the column "Age" + + + Working with Various Data Sources StructField("ID", IntegerType(), True), df_dropped_column = df.drop("Age") ] # Show the DataFrame StructField("Name", StringType(), True), print("DataFrame:") • Imputation # Define the DataFrame schema StructField("Age", IntegerType(), True) # Show the DataFrame with the dropped column df.show() # Creating DataFrame from CSV columns = ["Product", "Date", "Sales"] ]) print("DataFrame with Dropped Column:") csv_path = “path/to/data.csv” from pyspark.ml.feature import Imputer df_dropped_column.show() df_from_csv = spark.read.csv ( csv_path, header=True, inferSchema-True ) # Create the DataFrame # Create the DataFrame df = spark.createDataFrame(data, schema) Output: Joins and Combining DataFrames imputer = Imputer( inputCols=[ ‘age’ ], outputCols=['imputed_age']) imputed_df = imputer.fit(df).transform(df) df = spark.createDataFrame(data, columns) # Creating DataFrame from JSON Output: json_path = “path/to/data.json” # Create a pivot table df_from_json = spark.read.json ( json_path ) • Adding an Indicator Column pivot_table = df.groupBy("Product").pivot("Date").sum("Sales") Using filter() method DataFrame with Dropped Column: DataFrame: A sample example to illustrate the different join types: # Creating DataFrame from Parquet + + + + + + # Show the pivot table parquet_path = “path/to/data.parquet” | ID | Name | | Name | Salary | df_with_indicator = df.withColumn('age_missing', df['age'].isNull()) pivot_table.show() df_from_parquet = spark.read.parquet ( parquet_path ) # Method 1: Using filter() method + + + + + + DataFrame 1: filtered_df = df.filter(df.Age > 25) | 1 | Alice | | Alice | 100 | | 2 | Bob | | Bob | 200 | + + + # Show the filtered DataFrame | 3 | Charlie | | Charlie | 150 | | ID | Name | • Handling Categorical Missing Data Basic DataFrame Operations print("Filtered DataFrame (using filter()):") + + + | David | 120 | + + + | Eva | 180 | | 1 | Alice | filtered_df.show() df_filled_category = df.fillna({'gender': 'unknown'}) + + + | 2 | Bob | from pyspark.sql import SparkSession from pyspark.sql.functions import col Let's use a sample DataFrame to demonstrate some common data transformation examples: | 3 + | Charlie | + + PySpark SQL Cheat Sheet: SQL Functions for # Create a SparkSession Output: Aggregation using sum() Function DataFrame 1: DataFrames spark = SparkSession.builder \ from pyspark.sql import SparkSession .appName("DataFrame Operations Example") \ from pyspark.sql.functions import col, lit, concat, expr # Using sum() function to calculate total salary + + + .getOrCreate() Filtered DataFrame (using filter()): # Create a SparkSession total_salary = df.select(sum("Salary")).collect()[0][0] | ID + | + Role | + Working with Dates and Timestamps Filtering Data # Sample data for the DataFrame spark = SparkSession.builder \ | 2 | Manager | + + + + print("Total Salary:", total_salary) data = [ .appName("Data Transformation") \ | 3 | Employee | | ID | Name | Age | .getOrCreate() | 4 | Intern | from pyspark.sql import SparkSession (1, "Alice", 28), + + + + + + + | 1 | Alice | 28 | Sample DataFrame with date and timestamp data as mentioned below. from pyspark.sql.functions import col (2, "Bob", 22), (3, "Charlie", 35) | 3 | Charlie | 35 | # Sample data for the DataFrame Output: ] | 4 | David | 30 | data = [ # Create a SparkSession + + + + (1, "Alice", 28), Now, let's demonstrate different join types: spark = SparkSession.builder \ (2, "Bob", 22), DataFrame: .appName("PySpark SQL Functions") \ # Define the DataFrame schema Total Salary: 750 (3, "Charlie", 35) 1. Inner Join: .getOrCreate() columns = ["ID", "Name", "Age"] + + + + Using expr() Function ] | Name | Date | Timestamp | # Create the DataFrame inner_join = df1.join(df2, on="ID", how="inner") + + + + # Sample data for the DataFrame df = spark.createDataFrame(data, columns) from pyspark.sql.functions import expr # Define the DataFrame schema Aggregation using avg() Function inner_join.show() | Alice | 2022-01-15 | 2022-01-15 08:30:00 | data = [ columns = ["ID", "Name", "Age"] | Bob | 2021-12-20 | 2021-12-20 15:45:00 | ("Alice", 28), 2. Outer Join: | Charlie | 2022-02-28 | 2022-02-28 11:00:00 | ("Bob", 22), # Method 3: Using expr() function + + + + # Create the DataFrame # Using avg() function to calculate average salary ("Charlie", 35) expr_filtered_df = df.filter(expr("Age > 25")) outer_join = df1.join(df2, on="ID", how="outer") ] Display DataFrame contents df = spark.createDataFrame(data, columns) average_salary = df.select(avg("Salary")).collect()[0][0] outer_join.show() # Show the filtered DataFrame print("Average Salary:", average_salary) Current Date and Timestamp # Define the DataFrame schema Input: print("Filtered DataFrame (using expr() function):") 3. Left Join: columns = ["Name", "Age"] expr_filtered_df.show() Using Functions to Transform Data left_join = df1.join(df2, on="ID", how="left") from pyspark.sql.functions import current_date, current_timestamp # Display the first 20 rows of the DataFrame # Create the DataFrame Output: left_join.show() df.show() df = spark.createDataFrame(data, columns) Output: # Using functions to transform data # Adding columns for current date and timestamp df_with_status = df.withColumn("Status", when(col("Age") > 25, "Adult").otherwise("Young")) 4. Right Join: df_with_current_date = df.withColumn("CurrentDate", current_date()) # Using SQL-like functions to filter data Output: right_join = df1.join(df2, on="ID", how="right") df_with_current_timestamp = df.withColumn("CurrentTimestamp", current_timestamp()) Average Salary: 150.0 filtered_data = df.filter(col("Age") > 25) Filtered DataFrame (using expr() function): # Show the DataFrame with the new "Status" column right_join.show() selected_columns = df.select("Name", "Age") + + + + print("DataFrame with 'Status' Column:") # Show the DataFrames | ID | Name | Age | + + + + df_with_status.show() print("DataFrame with Current Date:") # Show the filtered and selected data + | 1 + | Alice + | 28 + | | ID + | Name + | Age + | + Aggregation using count() Function Let's now use a few examples to illustrate how to combine DataFrames using different join types: df_with_current_date.show() print("Filtered Data:") | 2 | Bob | 22 | | 1 | Alice | 28 | Output: filtered_data.show() Consider two sample DataFrames: print("DataFrame with Current Timestamp:") | 3 | Charlie | 35 | | 3 | Charlie | 35 | # Using count() function to count the number of employees df_with_current_timestamp.show() + + + + | 4 | David | 30 | DataFrame with Dropped Column: employee_count = df.select(count("Name")).collect()[0][0] print("Selected Columns:") + + + + a. DataFrame orders: selected_columns.show() + + + + + print("Number of Employees:", employee_count) Check DataFrame schema | ID | Name | Age | Status | + + + + Date Difference Using where() Method + + + + + | order | product | quantity | Input: | 1 | Alice | 28 | Adult | + + + + Aggregation Output: from pyspark.sql.functions import datediff | 2 | Bob | 22 | Young | | 101 | apple | 3 | # Method 4: Using where() method | 3 | Charlie | 35 | Adult | | 102 | banana | 2 | # Check the DataFrame schema + + + + + date_diff_df = df.withColumn("DaysSince", datediff(current_date(), col("Date"))) from pyspark.sql.functions import avg, max where_filtered_df = df.where(df.Age > 25) | 103 | orange | 4 | df.printSchema() + + + + Number of Employees: 5 # Show the DataFrame with date difference # Using SQL-like functions for aggregation # Show the filtered DataFrame b. DataFrame products: print("DataFrame with Date Difference:") grouped_data = df.groupBy("Age").agg(avg("Age"), max("Age")) Output: print("Filtered DataFrame (using where() method):") date_diff_df.show() where_filtered_df.show() + + + # Show the aggregated data root | product| price | + + + print("Aggregated Data:") |-- ID: long (nullable = true) grouped_data.show() | apple | 1.5 | |-- Name: string (nullable = true) | banana | 0.75 | |-- Age: long (nullable = true) | grape | | 2.0 + + +