Professional Documents
Culture Documents
Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium
Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium
Delta Merge Operation in Databricks Using PySpark - by Ashish Garg - Medium
Find a home for your writing: Join Medium's Pub Crawl, March 19
File system
When we talk about storage it is always about the file system. For
instance, on a standalone PC if we want to save our data we save it in
notepad (in .txt format). To organize several such files we use the concept
of folders. A single folder stores several files and folders within itself.
Similarly in the distributed file systems it is called as object storage which
is more commonly used by the major cloud providers. The core building
entities of any file system are folders (aka directories) and files. Storage
always means some sort of file or object system to persist data into. For
example, a Google Cloud Storage (GCS) is the google’s cloud service
offering for storing your objects (files of any format) over the Google
Cloud. In the next section, we would look into how to perform the copy
operation to GCS bucket over Data-bricks and ensure that it performs the
incremental load i.e., merge or upsert operation.
Bucket: The container that contains data in the google cloud storage is
called a bucket (something similar to AWS). To understand bucket the
best analogy is Folder.
# SparkSession creation
spark = (SparkSession
.builder
.appName("Merge Example")
.getOrCreate())
The second step involves loading the source data and creating the delta
table.
# The second step involves loading the source data and creating the delta table.
source_data = spark.createDataFrame([
(1, "Aline", 28, 111-11-1111),
(2, "Bob", 40, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])
bucket_name = "gs://gcompany-datateam-preenv-gcs-silverbucket/silver/"
directory_name = "emp_target_delta_table"
target_delta_table_name = bucket_name+directory_name
(source_data.write
.format("delta")
.mode("overwrite")
.save(target_delta_table_name))
# Load the new data3 with colascing the null of age with 0
new_data3 = spark.createDataFrame([
(1, "Aline", 30, 111-11-1111),
(2, "Bob", 0, 222-22-2222)
], ["EmpNr", "EmpName", "EmpAge", "SSN"])
Time to load the target delta table with the new data:
Import the DeltaTable library and load the existing delta table data to the
variable.
df = spark.read.load(target_delta_table_name)
df.show()
Scenario 2: Data with existing rows and one of the merge keys is NULL.
Performs Update operation.
# Merge the new data into the target Delta table
(target_delta_table_load.alias("target")
.merge(new_data2.alias("source"),merge_condition)
.whenMatchedUpdate(set={"first_name": "source.first_name", "last_name": "source.
.whenNotMatchedInsertAll()
.execute())
df = spark.read.load(target_delta_table_name)
df.show()
Scenario 3: Data with existing rows and one of the merge keys which was
NULL replaced with 0. Performs Update operation.
df = spark.read.load(target_delta_table_name)
df.show()
12 3
3
See all from Ashish Garg
· 20 min read · Feb 27, 2024 6 min read · Nov 11, 2023
374 66
Lists
Coding & Development Predictive Modeling w/
11 stories · 502 saves Python
20 stories · 1001 saves
108 306 4
127 3 114 1
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams