Performing Operations On Multiple Columns in A PySpark DataFrame

Performing operations on multiple
columns in a PySpark DataFrame

You can use reduce, for loops, or list comprehensions to apply
PySpark functions to multiple columns in a DataFrame.
Using iterators to apply the same operation on multiple columns is

vital for maintaining a DRY codebase.
Let’s explore different ways to lowercase all of the columns in a

DataFrame to illustrate this concept.
If you’re using the Scala API, see this blog post on performing
operations on multiple columns in a Spark DataFrame with
foldLeft.
Lowercase all columns with reduce
Let’s import the reduce function from functools and use it to

lowercase all the columns in a DataFrame.
source_df = spark.createDataFrame(
[
("Jose", "BLUE"),
("lI", "BrOwN")
],
["name", "eye_color"]
)
actual_df = (reduce(
lambda memo_df, col_name: memo_df.withColumn(col_name,
lower(col(col_name))),
source_df.columns,
source_df
))
print(actual_df.show())+----+---------+
|name|eye_color|
+----+---------+
|jose| blue|
| li| brown|
+----+---------+
The physical plan that’s generated by this code looks efficient.

print(actual_df.explain())== Physical Plan ==
*Project [lower(name#0) AS name#5, lower(eye_color#1) AS
eye_color#9]
+- Scan ExistingRDD[name#0,eye_color#1]
It is no secret that reduce is not among the favored functions of

the Pythonistas. — dawg
Let’s see how we can achieve the same result with a for loop.
Lowercase all columns with a for loop

Let’s use the same source_df as earlier and build up the actual_df with
a for loop.
actual_df = source_df
for col_name in actual_df.columns:

actual_df = actual_df.withColumn(col_name,
lower(col(col_name)))
This code is a bit ugly, but Spark is smart and generates the same
physical plan.
eye_color#27]
Let’s see how we can also use a list comprehension to write this
code.
Lowercase all columns with a list comprehension
Let’s use the same source_df as earlier and lowercase all the columns
with list comprehensions that are beloved by Pythonistas far and
wide.
actual_df = source_df.select(
*[lower(col(col_name)).name(col_name) for col_name in
source_df.columns]
)
Spark is still smart and generates the same physical plan.

eye_color#42]
Let’s mix it up and see how these solutions work when they’re run
on some, but not all, of the columns in a DataFrame.
Performing operations on a subset of the

DataFrame columns
Let’s define a remove_some_chars function that removes all exclamation
points and question marks from a column.
def remove_some_chars(col_name):
removed_chars = ("!", "?")
regexp = "|".join('\{0}'.format(i) for i in removed_chars)
return regexp_replace(col_name, regexp, "")
Let’s use reduce to apply the remove_some_chars function to two colums

in a new DataFrame.
source_df = spark.createDataFrame(
[
("h!o!c!k!e!y", "rangers", "new york"),
("soccer", "??nacional!!", "medellin")
],
["sport", "team", "city"]
)
print(source_df.show())+-----------+------------+--------+
| sport| team| city|
+-----------+------------+--------+
|h!o!c!k!e!y| rangers|new york|
| soccer|??nacional!!|medellin|
+-----------+------------+--------+actual_df = (reduce(
lambda memo_df, col_name: memo_df.withColumn(col_name,
remove_some_chars(col_name)),
["sport", "team"],
source_df
))
print(actual_df.show())+------+--------+--------+
| sport| team| city|
+------+--------+--------+
|hockey| rangers|new york|
|soccer|nacional|medellin|
+------+--------+--------+
Let’s try building up the actual_df with a for loop.

actual_df = source_df
for col_name in ["sport", "team"]:

actual_df = actual_df.withColumn(col_name,
remove_some_chars(col_name))
The for loop looks pretty clean. Now let’s try it with a list
comprehension.
source_df.select(
*[remove_some_chars(col_name).name(col_name) if col_name in
["sport", "team"] else col_name for col_name in source_df.columns]
)
Wow, the list comprehension is really ugly for a subset of the
columns 😿
, , and list comprehensions are all outputting the same

reduce for
physical plan as in the previous example, so each option is equally
performant when executed.
== Physical Plan ==
*Project [regexp_replace(sport#109, \!|\?, ) AS sport#116,
regexp_replace(team#110, \!|\?, ) AS team#117, city#111]
+- Scan ExistingRDD[sport#109,team#110,city#111]
What approach should you use?

for loops seem to yield the most readable code. List
comprehensions can be used for operations that are performed on
all columns of a DataFrame, but should be avoided for operations
performed on a subset of the columns. The reduce code is pretty
clean too, so that’s also a viable alternative.
It’s best to write functions that operate on a single column and

wrap the iterator in a separate DataFrame transformation so the
code can easily be applied to multiple columns.
Let’s define a multi_remove_some_chars DataFrame transformation that

takes an array of col_names as an argument and
applies remove_some_chars to each col_name.
def multi_remove_some_chars(col_names):
def inner(df):
for col_name in col_names:
df = df.withColumn(
col_name,
remove_some_chars(col_name)
)
return df
return inner
We can invoke multi_remove_some_chars as follows:

multi_remove_some_chars(["sport", "team"])(source_df)
This separation of concerns creates a codebase that’s easy to test

and reuse.

Performing Operations On Multiple Columns in A PySpark DataFrame

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performing Operations On Multiple Columns in A PySpark DataFrame

Uploaded by

Copyright:

Available Formats

Performing operations on multiple

columns in a PySpark DataFrame

Using iterators to apply the same operation on multiple columns is

Let’s explore different ways to lowercase all of the columns in a

Lowercase all columns with reduce

Let’s import the reduce function from functools and use it to

The physical plan that’s generated by this code looks efficient.

It is no secret that reduce is not among the favored functions of

Lowercase all columns with a for loop

for col_name in actual_df.columns:

Spark is still smart and generates the same physical plan.

Performing operations on a subset of the

Let’s use reduce to apply the remove_some_chars function to two colums

Let’s try building up the actual_df with a for loop.

for col_name in ["sport", "team"]:

Wow, the list comprehension is really ugly for a subset of the

, , and list comprehensions are all outputting the same

What approach should you use?

It’s best to write functions that operate on a single column and

Let’s define a multi_remove_some_chars DataFrame transformation that

We can invoke multi_remove_some_chars as follows:

This separation of concerns creates a codebase that’s easy to test

You might also like