Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium

Search Write
Find a home for your writing: Join Medium's Pub Crawl, March 19
Delta merge operation in

Databricks using PySpark
Ashish Garg · Follow
3 min read · Dec 29, 2023
The article talks here about how to copy the data to

the delta table in a merge fashion. This is quite
common operation used by data engineering
professional on day to day basis.
File system
When we talk about storage it is always about the file system. For
instance, on a standalone PC if we want to save our data we save it in
notepad (in .txt format). To organize several such files we use the concept
of folders. A single folder stores several files and folders within itself.
Similarly in the distributed file systems it is called as object storage which
is more commonly used by the major cloud providers. The core building
entities of any file system are folders (aka directories) and files. Storage
always means some sort of file or object system to persist data into. For
example, a Google Cloud Storage (GCS) is the google’s cloud service
offering for storing your objects (files of any format) over the Google
Cloud. In the next section, we would look into how to perform the copy
operation to GCS bucket over Data-bricks and ensure that it performs the
incremental load i.e., merge or upsert operation.
Bucket: The container that contains data in the google cloud storage is
called a bucket (something similar to AWS). To understand bucket the
best analogy is Folder.
Copy with the merge operation over GCS
Create a spark session

The first step is to create the spark session.
from pyspark.sql import SparkSession
# SparkSession creation
spark = (SparkSession
.builder
.appName("Merge Example")
.getOrCreate())
The second step involves loading the source data and creating the delta
table.
# The second step involves loading the source data and creating the delta table.
source_data = spark.createDataFrame([
(1, "Aline", 28, 111-11-1111),
(2, "Bob", 40, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])
bucket_name = "gs://gcompany-datateam-preenv-gcs-silverbucket/silver/"
directory_name = "emp_target_delta_table"
target_delta_table_name = bucket_name+directory_name
(source_data.write
.format("delta")
.mode("overwrite")
.save(target_delta_table_name))
Now declare two new data sets,
# Load the new data1

new_data1 = spark.createDataFrame([
(3, "Charlie", 45, 333-33-3333),
(4, "Danny", 35, 444-44-4444)
# Load the new data2

(1, "Aline", 30, 111-11-1111),
(2, "Bob", None, 222-22-2222)
# Load the new data3 with colascing the null of age with 0
(1, "Aline", 30, 111-11-1111),
(2, "Bob", 0, 222-22-2222)
Time to load the target delta table with the new data:
Import the DeltaTable library and load the existing delta table data to the
variable.
from delta.tables import DeltaTable
# Load the target Delta table

target_delta_table_load = DeltaTable.forPath(spark, target_delta_table_name)
Scenario 1: Data with new rows. Performs Insert operation
merge_condition = "target.id = source.id AND target.age = source.age"
# Merge the new data into the target Delta table

(target_delta_table_load.alias("target")
.merge(new_data1.alias("source"),merge_condition)
.whenMatchedUpdate(set={"first_name": "source.first_name", "last_name": "source.
.whenNotMatchedInsertAll()
.execute())
df = spark.read.load(target_delta_table_name)
df.show()
Scenario 2: Data with existing rows and one of the merge keys is NULL.
Performs Update operation.
.execute())
df.show()
Scenario 3: Data with existing rows and one of the merge keys which was
NULL replaced with 0. Performs Update operation.

.execute())
df.show()
Pyspark Databricks Data Engineering Data Python

Written by Ashish Garg Follow
119 Followers
Data Science | Data Engineering | Machine Learning | Artificial Intelligence
More from Ashish Garg

Ashish Garg Ashish Garg
How to create secrets in Big Data—File storage formats

Databricks? (Row/CSV vs Column/Parquet)
Creating secrets is the common norm in the This article will introduces you to the row-
data engineering world. It is beneficial to… based and column-based data formats.…
store sensitive information in the Azure Further it shows the difference between the
4 min read · Jan 29, 2023
portal/GCP… 3 minand
two · Jul 2, 2023
readat…
12 3
Ashish Garg Ashish Garg
Databricks—How to mount the How to create the view and hive

ADLS Gen2 folder to Azure… references for data bricks users?
Databricks?
Storage is provided by multiple cloud A quite common scenario for the data
service providers (like MS, Amazon,… engineers is to create the views and hive…
Google, etc.) where we store our data or file references in different data bricks
2 minor
(big read · AugMS…
small). 11, 2022 2 min read · Mar
workspaces for29, 2023
their…
3
See all from Ashish Garg
Recommended from Medium
Vishal Barvaliya Maitreemanna
PySpark Interview Questions for Pyspark Optimization Technique

Data Engineers || Part I For Better Performance
Most Frequently asked PySpark interview Optimize Serialization
questions in data engineering interviews.
· 20 min read · Feb 27, 2024 6 min read · Nov 11, 2023
374 66
Lists
Coding & Development Predictive Modeling w/
11 stories · 502 saves Python
20 stories · 1001 saves
Practical Guides to Machine ChatGPT

Learning 21 stories · 518 saves
10 stories · 1198 saves
Shantanu Tripathi Piethein Strengholt
Troubleshooting Slow Spark Job: Data Quality within Lakehouses

5 Key Areas to Investigate A deep dive into data quality using bronze,
Spark is supposed to reduce ETL time by silver, and gold layered architectures
leveraging the concept of efficient…
parallelism. If your job isn’t doing so, let’s
5 · Jan 5, 2024
min read5…
discuss 9 min read · Mar 5, 2024
108 306 4
Ryan Chynoweth in DBSQL SME Engineering Sujit J Fulse
Using Databricks Notebooks for Spark In Depth: Group by with

Production Data Pipelines Case statement
Introduction I recently encountered a complex problem
while working on my project. I needed to…
group the data for a my use case, which
10 min read · Mar 5, 2024 2 · Feb 24, 2024
min read several…
involved
127 3 114 1
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams

Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium

Uploaded by

Copyright:

Available Formats

You might also like

Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium

Uploaded by

Copyright:

Available Formats

Search Write

Delta merge operation in

The article talks here about how to copy the data to

Copy with the merge operation over GCS

Create a spark session

from pyspark.sql import SparkSession

Now declare two new data sets,

# Load the new data1

# Load the new data2

from delta.tables import DeltaTable

# Load the target Delta table

Scenario 1: Data with new rows. Performs Insert operation

merge_condition = "target.id = source.id AND target.age = source.age"

# Merge the new data into the target Delta table

# Merge the new data into the target Delta table

Pyspark Databricks Data Engineering Data Python

Data Science | Data Engineering | Machine Learning | Artificial Intelligence

More from Ashish Garg

How to create secrets in Big Data—File storage formats

Ashish Garg Ashish Garg

Databricks—How to mount the How to create the view and hive

Recommended from Medium

Vishal Barvaliya Maitreemanna

PySpark Interview Questions for Pyspark Optimization Technique

Practical Guides to Machine ChatGPT

Troubleshooting Slow Spark Job: Data Quality within Lakehouses

Ryan Chynoweth in DBSQL SME Engineering Sujit J Fulse

Using Databricks Notebooks for Spark In Depth: Group by with

You might also like