Day 19

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Additional Functions

Objectives :
1. Apply built-in functions to generate data for new columns
2. Apply DataFrame NA functions to handle null values
3. Join DataFrames

Methods :
DataFrameNaFunctions: fill

Built-In Functions:

• Aggregate: collect_set
• Collection: explode
• Non-aggregate and miscellaneous: col, lit
1-A: Get emails of converted users from transactions Select the email column in UserDF and remove
duplicates Add a new column converted with the value True for all rows Save the result as
convertedUsersDF.

User-Defined Functions
Objectives :
1. Define a function
2. Create and apply a UDF
3. Register the UDF to use in SQL
4. Create and register a UDF with Python decorator syntax
5. Create and apply a Pandas (vectorized) UDF

Methods :
• UDF Registration (spark.udf): register
• Built-In Functions: udf
• Python UDF Decorator: @udf
• Pandas UDF Decorator: @pandas_udf
User-Defined Function (UDF)
A custom column transformation function

• Can’t be optimized by Catalyst Optimizer


• Function is serialized and sent to executors
• Row data is deserialized from Spark's native binary format to pass to the UDF, and the results
are serialized back into Spark's native format
• For Python UDFs, additional interprocess communication overhead between the executor and
a Python interpreter running on each worker node

Define a function
Define a function (on the driver) to get the first letter of a string from the email field.
Register UDF to use in SQL
Register the UDF using spark.udf.register to also make it available for use in the SQL namespace.

Note :

If You Want to Do More Hands – On of UDF so Go With NoteBook : ASP 2.5 – UDFs of Sandip Sir.

You might also like