Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Search Write

Find a home for your writing: Join Medium's Pub Crawl, March 19

Delta merge operation in


Databricks using PySpark
Ashish Garg · Follow
3 min read · Dec 29, 2023

The article talks here about how to copy the data to


the delta table in a merge fashion. This is quite
common operation used by data engineering
professional on day to day basis.

File system
When we talk about storage it is always about the file system. For
instance, on a standalone PC if we want to save our data we save it in
notepad (in .txt format). To organize several such files we use the concept
of folders. A single folder stores several files and folders within itself.
Similarly in the distributed file systems it is called as object storage which
is more commonly used by the major cloud providers. The core building
entities of any file system are folders (aka directories) and files. Storage
always means some sort of file or object system to persist data into. For
example, a Google Cloud Storage (GCS) is the google’s cloud service
offering for storing your objects (files of any format) over the Google
Cloud. In the next section, we would look into how to perform the copy
operation to GCS bucket over Data-bricks and ensure that it performs the
incremental load i.e., merge or upsert operation.

Bucket: The container that contains data in the google cloud storage is
called a bucket (something similar to AWS). To understand bucket the
best analogy is Folder.

Copy with the merge operation over GCS

Create a spark session


The first step is to create the spark session.

from pyspark.sql import SparkSession

# SparkSession creation
spark = (SparkSession
.builder
.appName("Merge Example")
.getOrCreate())

The second step involves loading the source data and creating the delta
table.

# The second step involves loading the source data and creating the delta table.
source_data = spark.createDataFrame([
(1, "Aline", 28, 111-11-1111),
(2, "Bob", 40, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])

bucket_name = "gs://gcompany-datateam-preenv-gcs-silverbucket/silver/"
directory_name = "emp_target_delta_table"
target_delta_table_name = bucket_name+directory_name
(source_data.write
.format("delta")
.mode("overwrite")
.save(target_delta_table_name))

Now declare two new data sets,

# Load the new data1


new_data1 = spark.createDataFrame([
(3, "Charlie", 45, 333-33-3333),
(4, "Danny", 35, 444-44-4444)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])

# Load the new data2


new_data2 = spark.createDataFrame([
(1, "Aline", 30, 111-11-1111),
(2, "Bob", None, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])

# Load the new data3 with colascing the null of age with 0
new_data3 = spark.createDataFrame([
(1, "Aline", 30, 111-11-1111),
(2, "Bob", 0, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])

Time to load the target delta table with the new data:

Import the DeltaTable library and load the existing delta table data to the
variable.

from delta.tables import DeltaTable

# Load the target Delta table


target_delta_table_load = DeltaTable.forPath(spark, target_delta_table_name)

Scenario 1: Data with new rows. Performs Insert operation

merge_condition = "target.id = source.id AND target.age = source.age"

# Merge the new data into the target Delta table


(target_delta_table_load.alias("target")
.merge(new_data1.alias("source"),merge_condition)
.whenMatchedUpdate(set={"first_name": "source.first_name", "last_name": "source.
.whenNotMatchedInsertAll()
.execute())

df = spark.read.load(target_delta_table_name)
df.show()

Scenario 2: Data with existing rows and one of the merge keys is NULL.
Performs Update operation.
# Merge the new data into the target Delta table
(target_delta_table_load.alias("target")
.merge(new_data2.alias("source"),merge_condition)
.whenMatchedUpdate(set={"first_name": "source.first_name", "last_name": "source.
.whenNotMatchedInsertAll()
.execute())

df = spark.read.load(target_delta_table_name)
df.show()

Scenario 3: Data with existing rows and one of the merge keys which was
NULL replaced with 0. Performs Update operation.

# Merge the new data into the target Delta table


(target_delta_table_load.alias("target")
.merge(new_data3.alias("source"),merge_condition)
.whenMatchedUpdate(set={"first_name": "source.first_name", "last_name": "source.
.whenNotMatchedInsertAll()
.execute())

df = spark.read.load(target_delta_table_name)
df.show()

Pyspark Databricks Data Engineering Data Python


Written by Ashish Garg Follow
119 Followers

Data Science | Data Engineering | Machine Learning | Artificial Intelligence

More from Ashish Garg


Ashish Garg Ashish Garg

How to create secrets in Big Data—File storage formats


Databricks? (Row/CSV vs Column/Parquet)
Creating secrets is the common norm in the This article will introduces you to the row-
data engineering world. It is beneficial to… based and column-based data formats.…
store sensitive information in the Azure Further it shows the difference between the
4 min read · Jan 29, 2023
portal/GCP… 3 minand
two · Jul 2, 2023
readat…

12 3

Ashish Garg Ashish Garg

Databricks—How to mount the How to create the view and hive


ADLS Gen2 folder to Azure… references for data bricks users?
Databricks?
Storage is provided by multiple cloud A quite common scenario for the data
service providers (like MS, Amazon,… engineers is to create the views and hive…
Google, etc.) where we store our data or file references in different data bricks
2 minor
(big read · AugMS…
small). 11, 2022 2 min read · Mar
workspaces for29, 2023
their…

3
See all from Ashish Garg

Recommended from Medium

Vishal Barvaliya Maitreemanna

PySpark Interview Questions for Pyspark Optimization Technique


Data Engineers || Part I For Better Performance
Most Frequently asked PySpark interview Optimize Serialization
questions in data engineering interviews.

· 20 min read · Feb 27, 2024 6 min read · Nov 11, 2023

374 66

Lists
Coding & Development Predictive Modeling w/
11 stories · 502 saves Python
20 stories · 1001 saves

Practical Guides to Machine ChatGPT


Learning 21 stories · 518 saves
10 stories · 1198 saves
Shantanu Tripathi Piethein Strengholt

Troubleshooting Slow Spark Job: Data Quality within Lakehouses


5 Key Areas to Investigate A deep dive into data quality using bronze,
Spark is supposed to reduce ETL time by silver, and gold layered architectures
leveraging the concept of efficient…
parallelism. If your job isn’t doing so, let’s
5 · Jan 5, 2024
min read5…
discuss 9 min read · Mar 5, 2024

108 306 4

Ryan Chynoweth in DBSQL SME Engineering Sujit J Fulse

Using Databricks Notebooks for Spark In Depth: Group by with


Production Data Pipelines Case statement
Introduction I recently encountered a complex problem
while working on my project. I needed to…
group the data for a my use case, which
10 min read · Mar 5, 2024 2 · Feb 24, 2024
min read several…
involved

127 3 114 1
See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like